Variable Star Signature Classification using Slotted Symbolic Markov Modeling
Introduction
With the advent of digital astronomy, new benefits and new challenges have been presented to the modern day astronomer. While data is captured in a more efficient and accurate manner using digital means, the efficiency of data retrieval has led to an overload of scientific data for processing and storage. This means that more stars, in more detail are captured per night; but increasing data capture begets exponentially increasing data processing. Database management, digital signal processing, automated image reduction and statistical analysis of data have all made their way to the forefront of tools for the modern astronomer. Astro-statistics and astro-informatics are fields which focus on the application and development of these tools to help aid in the processing of large scale astronomical data resources.
A methodology for the reduction of stellar variable observations (time-domain data) into a novel feature space representation is introduced. The proposed methodology, referred to as Slotted Symbolic Markov Modeling (SSMM), has a number of advantages over other classification approaches for stellar variables. SSMM can be applied to both folded and unfolded data. Also, it does not need time-warping for alignment of the waveforms. Given the reduction of a survey of stars into this new feature space, the problem of using prior patterns to identify new observed patterns can be addressed via classification algorithms. These methods have two large advantages over manual-classification procedures: the rate at which new data is processed is dependent only on the computational processing power available and the performance of a supervised classification algorithm is quantifiable and consistent.
The remainder of this paper is structured as follows. First, the data, prior efforts, and challenges uniquely associated to classification of stars via stellar variability is reviewed. Second, the novel methodology, SSMM, is outlined including the feature space and signal conditioning methods used to extract the unique time-domain signatures. Third, a set of classifiers (random forest/bagged decisions tree, k-nearest neighbor, and Parzen window classifier) is trained and tested on the extracted feature space using both a standardized stellar variability dataset and the LINEAR dataset. Fourth, performance statistics are generated for each classifier and a comparing and contrasting of the methods is discussed. Lastly, an anomaly detection algorithm is generated using the so called one-class Parzen Window Classifier and the LINEAR dataset. The result will be the demonstration of the SSMM methodology as being a competitive feature space reduction technique, for usage in supervised classification algorithms.
The idea of constructing a supervised classification algorithm for stellar classification is not unique to this paper (Dubath et al., 2011). Methods pursued include the construction of a detector to determine variability (Barclay et al., 2011), the design of random forests for the detection of photometric redshifts in spectra (Carliles et al., 2010), the detection of transient events (Djorgovski et al., 2012) and the development of machine-assisted discovery of astronomical parameter relationships (Graham et al., 2013a). Debosscher (2009) explored several classification techniques for the supervised classification of variable stars, quantitatively comparing the performance in terms of computational speed and performance. Likewise, other efforts have focused on comparing speed and robustness of various methods (Blomme, Sarro, O’Donovan, Debosscher, Brown, Lopez, Dubath, Rimoldini, Charbonneau, Dunham, Mandushev, Ciardi, De Ridder, C., 2011, Pichara, Protopapas, Kim, Marquette, Tisserand, 2012, Pichara, Protopapas, 2013). These methods span both different classifiers and different spectral regimes, including IR surveys (Angeloni, Ramos, Catelan, Dékány, Gran, Alonso-García, Hempel, Navarrete, Andrews, Aparicio, et al., 2014, Masci, Hoffman, Grillmair, Cutri, 2014), RF surveys (Rebbapragada et al., 2011) and optical (Richards et al., 2012). Methods for automated supervised classification include procedures such as: direct parametric analysis (Udalski et al., 1999), fully automated neural networking (Pojmanski, 2000, Pojmanski, 2002) and Bayesian classification (Eyer and Blake, 2005).
The majority of these studies rely on periodicity domain feature space reductions. Debosscher (2009) and Templeton (2004) review a number of feature spaces and a number of efforts to reduce the time domain data, most of which implement Fourier techniques, primarily the Lomb-Scargle (L-S) Method (Lomb, 1976, Scargle, 1982), to estimate the primary periodicity (Eyer, Blake, 2005, Park, Cho, 2013, Richards, Starr, Miller, Bloom, Butler, Brink, Crellin-Quick, 2012, Ngeow, Lucchini, Kanbur, Barrett, Lin, 2013, Deb, Singh, 2009). Lomb-Scargle is favored because of the flexibility it provides with respect to observed datasets; when sample rates are irregular and drop outs are common in the data being observed. Long et al. (2014) advance L-S even further, introducing multi-band (multidimensional) generalized L-S, allowing the algorithm to take advantage of information across filters, in cases where multi-channel time-domain data is available. There have also been efforts to estimate frequency using techniques other than L-S such as the Correntropy Kernelized Periodogram, (Huijse et al., 2011) or MUlti SIgnal Classificator (Tagliaferri et al., 2003).
The assumption of the light curve being periodic, or even that the functionality of the signal being represented in the limited Fourier space that Lomb-Scargle uses, has been shown (Palaversa, Ivezić, Eyer, Ruždjak, Sudar, Galin, Kroflin, Mesarić, Munk, Vrbanec, et al., 2013, Barclay, Ramsay, Hakala, Napiwotzki, Nelemans, Potter, Todd, 2011) to result in biases and other challenges when used for signature identification purposes. Supervised classification algorithms implementing these frequency estimation algorithms do so to generate an estimate of primary frequency used to fold all observations resulting in a plot of magnitude vs. phase, something Deb and Singh (2009) refer to as “reconstruction” . After some interpolation to place the magnitude vs. phase plots on similar regularly sampled scales, the new folded time series can be directly compared (1-to-1) with known folded time series. Comparisons can be performed via distance metric (Tagliaferri et al., 2003), correlation (Protopapas et al., 2006), further feature space reduction (Debosscher, 2009) or more novel methods (Huijse et al., 2012). It should be noted that the family of stars with the label “stellar variable” is a large and diverse population: eclipsing binaries, irregularly pulsating variables, nova (stars in outburst), multi-model variables, and many others are frequently processed using the described methods despite the underlying stellar variability functionality not naturally lending itself to Fourier decomposition and the associated assumptions that accompany the said decomposition. Indeed this is why Szatmary et al. (1994); Barclay et al. (2011); Palaversa et al. (2013) and others suggest using other decomposition methods such as discrete wavelet transformations, which have been shown to be powerful in the effort to decompose a time series into the time-frequency (phase) space for analysis (Torrence, Compo, 1998, Bolós, Benítez, 2014, Rioul, Vetterli, 1991). It is noted that the digital signal processing possibilities beyond Fourier domain analysis time series comparison and wavelet transformation are too numerous to outline here; however the near complete review by Fulcher et al. (2013) is highly recommended.
Section snippets
Slotted Symbolic Markov Modeling
The discussion of the Slotted Symbolic Markov Modeling (SSMM) algorithm encompasses the analysis, reduction and classification of data. Since the a priori distribution of class labels are roughly evenly distributed for both experimental studies, the approach uses a multi-class classifier. Should the class labels with additional data become unbalanced, other approaches are possible (Rifkin and Klautau, 2004). Data specific challenges, associated with astronomical time series observations, have
Datasets
Two datasets are addressed here, the first is the STARLIGHT dataset from the UCR time series database, the second is published data from the LINEAR survey. The UCR time series dataset is used to base line the time-domain dataset feature extraction methodology proposed, it is compared to the results published on the UCR website. The UCR time series data contains only time domain data that has already been folded and put into magnitude phase space, no photometric data from either SDSS or 2MASS,
Conclusions
The Slotted Symbolic Markov Modeling (SSMM) methodology developed has been able to generate a feature space which separates variable stars by class (supervised classification). This methodology has the benefit of being able to accommodate irregular sampling rates, dropouts and some degree of time-domain variance. It also provides a fairly simple methodology for feature space generation, necessary for classification. One of the major advantages of the methodology used is that a signature pattern
Acknowledgments
The authors are grateful for valuable discussion with Stephen Wiechecki-Vergara and Véronique Petit. Research was partially supported by Vencore, Inc. The LINEAR program is sponsored by the National Aeronautics and Space Administration (NRA Nos. NNH09ZDA001N, 09-NEOO09-0010) and the United States Air Force under Air Force Contract FA8721-05-C-0002
References (66)
- et al.
Highly comparative time-series analysis: the empirical structure of time series and their methods
J. R. Soc. Interface
(2013) - et al.
Feature extraction for one-class classification
Proceedings of the ICANN/ICONIP
(2003) - et al.
The VVV templates project towards an automated classification of VVV light-curves-I. building a database of stellar variability in the near-infrared
Astron. Astrophys.
(2014) - et al.
Stellar variability on time-scales of minutes: results from the first 5 yr of the rapid temporal survey
Mont. Not. Roy. Astron. Soc.
(2011) - et al.
The milky way tomography with sloan digital sky survey. IV. dissecting dust
Astrophys. J.
(2012) - et al.
Improved methodology for the automated classification of periodic variable stars
Mont. Not. Roy. Astron. Soc.
(2011) - et al.
The wavelet scalogram in the study of time series
Advances in Differential Equations and Applications
(2014) - et al.
Classification and Regression Trees
(1984) - et al.
Random forests for photometric redshifts
Astrophys. J.
(2010) - et al.
Light curve analysis of variable stars using fourier decomposition and principal component analysis
Astron. Astrophys.
(2009)
Flashes in a star stream: automated classification of astronomical transient events
E-Science (e-Science), 2012 IEEE 8th International Conference on
First results from the catalina real-time transient survey
Astrophys. J.
Random forest automated supervised classification of hipparcos periodic variable stars
Mont. Not. Roy. Astron. Soc.
Pattern Classification
Prtools4.1, a matlab toolbox for pattern recognition,
Automated classification of variable stars for all-sky automated survey 1–2 data
Mont. Not. Roy. Astron. Soc.
A review on time series data mining
Eng. Appl. Artif. Intell.
Deformable Markov model templates for time-series pattern matching
Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Invariant time-series classification
Machine Learning and Knowledge Discovery in Databases
Machine-assisted discovery of relationships in astronomy
Mont. Not. Roy. Astron. Soc.
A comparison of period finding algorithms
Mont. Not. Roy. Astron. Soc.
The Elements of Statistical Learning
An information theoretic algorithm for finding periodicities in stellar light curves
Signal Process. IEEE Trans.
Period estimation in astronomical time series using slotted correntropy
Signal Process. Lett. IEEE
Applied Multivariate Statistical Analysis
Dimensionality reduction for fast similarity search in large time series databases
Knowl. Inf. Syst.
The ucr time series classification/clustering homepage
Time delay evolution of five active galactic nuclei
J. Astrophys. Astron.
Simulation Modeling and Analysis
Nonparametric Econometrics: Theory and Practice
A symbolic representation of time series, with implications for streaming algorithms
Proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery
Experiencing SAX: a novel symbolic representation of time series
Data Min. Knowl. Discov.
Cited by (8)
FGMC-HADS: Fuzzy Gaussian mixture-based correntropy models for detecting zero-day attacks from linux systems
2020, Computers and SecurityCitation Excerpt :Then, the normal boundaries of the GMM posterior probabilities are estimated for a normal profile and considering any deviations from the profile as an abnormal event. While GMM-based outlier approaches have been successfully implemented in a number of applications, including recognition (Johnston and Peter, 2017; Lagrange et al., 2017; Liu et al., 2017; Queiroz et al., 2017), their use in the HADS domain is new. The GMM models the data dynamics of the JF sequences and the normal boundaries as an outlier baseline that can be used to measure the precise behavioral boundaries for classification purposes.
Generation of a supervised classification algorithm for time-series variable stars with an application to the LINEAR dataset
2017, New AstronomyCitation Excerpt :Specifically, additional example of the under-sampled variable stars, enough to perform k-fold cross-validation would yield improved performance and increased generality of the classifier. An improved feature space could also benefit the process, if new features were found to provide additional linear separation for certain classes, such as those presented in Johnston and Peter (2017). However, additional dimensionality without reduction of superfluous features is warned against as it may only worsen the performance issues of the classifier.
A survey on machine learning based light curve analysis for variable astronomical sources
2021, Wiley Interdisciplinary Reviews: Data Mining and Knowledge DiscoveryVariable star classification using multiview metric learning
2020, Monthly Notices of the Royal Astronomical Society