Various FN). The completeness is the fraction of those

Various data mining algorithms are being
applied by astronomers in most of the numerous applications
in astronomy. However, long-term studies and several mining projects have also been made by experts in the field of data mining utilizing
 datarelated to astronomy because
astronomy has produced numerous large datasets that are flexible to the approach along with various other fields such as medicine and high
energy physics. Examples of such projects are the SKICAT-Sky Image Cataloging and Analysis System for catalog production and analysis of the catalog
from digitized sky surveys particularly the scans given by the second Palomar Observatory Sky Survey; the JAR Tool- Jet Propulsion Laboratory Adaptive Recognition Tool used for recognition of volcanoes in over 30,000
images of Venus which came by the Magellan mission; the following  and more general Diamond Eye and the Lawrence Livermore National Laboratory Sapphire project.

 Object classification

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!

order now

 Classification is an crucial preliminary step
in the scientific process as it provides a way for arranging information in a way that may be used to make hypotheses and compare easily with
models. The two most useful concepts in object classification are
the completeness and the efficiency, also known as recall and
precision. They are generally defined in terms of  true and false positives (TP and FP) and true and false negatives (TN and FN). The completeness is the fraction of those
objects that are in reality of a given type that are  classified as that type: and the efficiency is the fraction of objects generally classified
as a given type that are truly of that type These two quantities are interesting astrophysically
because, while one wants both higher completeness and efficiency, there is
generally a tradeoff involved. The importance of each often
depends on the application, for instance, an investigation of rare
objects generally requires high completeness while allowing
some contamination (lower efficiency), but statistical clustering of cosmological objects requires high
efficiency, even at the expense of completeness.

 Star-Galaxy Separation

 Due to their physical size in comparison to their distance from us, almost all the stars are unresolved in photometric datasets, and
therefore appear as point
sources. Galaxies despite being
further away, generally subtend a larger angle and appear as extended sources. However, other astrophysical objects such as quasars and supernovae,
are also seen as as point sources. Thus, the separation of photometric catalog
into stars and galaxies, or more generally, stars, galaxies and
other objects, is an important problem. The
number of galaxies and stars in typical surveys (of order 108 or above)
requires that such separation be automated. This problem is a well studied one
and automated approaches were employed before current
data mining algorithms became famous, for instance, during digitization done by the
scanning of various photographic plates by machines such as the APM and
DPOSS.Several data mining algorithms have been applied, including ANN,DT,mixture modelling and SOM with
most algorithms achieving over efficiency around 95%. Typically, this is
performed using a set of measured
morphological parameters that are made
from the survey photometry, with perhaps colors or other information, such as the seeing. The advantage
of  data mining approach is that all such information about each object is easily

 Galaxy Morphology

 Galaxies come in a range of numerous sizes and shapes, or more collectively,
morphology. The most well-known system for the morphological classification of
galaxies is the Hubble Sequence of elliptical, spiral, barred spiral, and
irregular, along with various subclasses. This system correlates to many physical properties known
to be crucial in the formation and formation of galaxies. Because
galaxy morphology is a tough and complex
phenomenon that correlates to the underlying physics, but is not unique to any one given process, the
Hubble sequence has shown, despite it being rather subjective and based on visible-light  morphology originally created from
blue-biased photographic plates. The Hubble sequence has been extended in
various methods, and for data mining purposes
the T system has been extensively taken into consideration. This system maps the categorical Hubble types E, S0, Sa, Sb, Sc, Sd, and Irr onto the numerical values -5 to 10. One can train a supervised algorithm to allot T types to images for which measured parameters are made
available. Such parameters can be completely morphological, or comprise of
other information such as color. A
series of papers written by Lahav
and collaborators do exactly the same, by applying ANNs to predict the T type
of galaxies at low redshift, and finding equal amount of accuracy to human experts. ANNs have also been applied to higher redshift data to distinguish between normal and unique galaxies and the
fundamentally topological and unsupervised SOM ANN has been used to classify
galaxies from Hubble Space Telescope images, where the initial distribution of classes is unknown. Likewise, ANNs have been used to obtain morphological types from
galaxy spectra. Photometric redshifts An
area of astrophysics that has
greatly increased in popularity in the last few years is the estimation of redshifts from
photometric data (photo-zs). This is because, although the distances are less accurate than those obtained with spectra,
the sheer number of objects with photometric measurements can often make up for the reduction in individual accuracy by suppressing
the statistical noise of an
ensemble calculation. The two common approaches to photo-zs are the template
method and the empirical training set method. The template approach has many difficult issues, including calibration, zero-points, priors, multiwavelength performance (e.g., poor in the mid-infrared), and difficulty handling missing or incomplete training data. We focus in this review on the empirical approach, as it
is an implementation of supervised learning. 3.2.1. Galaxies At low redshifts,
the calculation of photometric redshifts for normal galaxies is quite
straightforward due to the break in the typical galaxy spectrum at 4000A. Thus,
as a galaxy is redshifted with increasing distance, the color (measured as a
difference in magnitudes) changes relatively smoothly. As a result, both
template and empirical photo-z approaches obtain similar outcomes, a root-mean-square deviation of ~ 0.02 in redshift, which is near to the best possible result given the intrinsic spread in the properties. This has been shown with ANNs SVM DT, kNN, empirical polynomial relations, numerous template-based studies, and several other procedures. At higher redshifts, acheiving accurate results becomes more difficult because the 4000A break
is shifted redward of the optical, galaxies are fainter and thus spectral data
are sparser, and galaxies intrinsically evolve over time. While supervised
learning has been successfully used, beyond the spectral regime the obvious
limitation arises that in order to reach the limiting magnitude of the
photometric portions of surveys, extrapolation would be required. In this
regime, or where only small training sets are available, template-based results
can be used, but without spectral information, the templates themselves are
being extrapolated. However, the extrapolation of the templates is being done in a more physically motivated manner. It is likely that the more general
hybrid approach of using empirical data to iteratively improve the templates or the
semi-supervised procedure described in will ultimately provide a more elegant
solution. Another issue at higher redshift is that the available numbers of
objects can become quite small (in the hundreds or fewer), thus reintroducing
the curse of dimensionality by a simple lack of objects compared to measured
wavebands. The methods of dimension reduction can help to mitigate this effect.