Elsevier

New Astronomy

Volume 50, January 2017, Pages 1-11
New Astronomy

Variable Star Signature Classification using Slotted Symbolic Markov Modeling

https://doi.org/10.1016/j.newast.2016.06.001Get rights and content

Highlights

  • We present a new feature space for the supervised classification of stellar variables.

  • Two surveys are used: data from the UCR database and data from the LINEAR survey.

  • Improved linear separation is generated using the new feature space.

Abstract

With the advent of digital astronomy, new benefits and new challenges have been presented to the modern day astronomer. No longer can the astronomer rely on manual processing, instead the profession as a whole has begun to adopt more advanced computational means. This paper focuses on the construction and application of a novel time-domain signature extraction methodology and the development of a supporting supervised pattern classification algorithm for the identification of variable stars. A methodology for the reduction of stellar variable observations (time-domain data) into a novel feature space representation is introduced. The methodology presented will be referred to as Slotted Symbolic Markov Modeling (SSMM) and has a number of advantages which will be demonstrated to be beneficial; specifically to the supervised classification of stellar variables. It will be shown that the methodology outperformed a baseline standard methodology on a standardized set of stellar light curve data. The performance on a set of data derived from the LINEAR dataset will also be shown.

Introduction

With the advent of digital astronomy, new benefits and new challenges have been presented to the modern day astronomer. While data is captured in a more efficient and accurate manner using digital means, the efficiency of data retrieval has led to an overload of scientific data for processing and storage. This means that more stars, in more detail are captured per night; but increasing data capture begets exponentially increasing data processing. Database management, digital signal processing, automated image reduction and statistical analysis of data have all made their way to the forefront of tools for the modern astronomer. Astro-statistics and astro-informatics are fields which focus on the application and development of these tools to help aid in the processing of large scale astronomical data resources.

A methodology for the reduction of stellar variable observations (time-domain data) into a novel feature space representation is introduced. The proposed methodology, referred to as Slotted Symbolic Markov Modeling (SSMM), has a number of advantages over other classification approaches for stellar variables. SSMM can be applied to both folded and unfolded data. Also, it does not need time-warping for alignment of the waveforms. Given the reduction of a survey of stars into this new feature space, the problem of using prior patterns to identify new observed patterns can be addressed via classification algorithms. These methods have two large advantages over manual-classification procedures: the rate at which new data is processed is dependent only on the computational processing power available and the performance of a supervised classification algorithm is quantifiable and consistent.

The remainder of this paper is structured as follows. First, the data, prior efforts, and challenges uniquely associated to classification of stars via stellar variability is reviewed. Second, the novel methodology, SSMM, is outlined including the feature space and signal conditioning methods used to extract the unique time-domain signatures. Third, a set of classifiers (random forest/bagged decisions tree, k-nearest neighbor, and Parzen window classifier) is trained and tested on the extracted feature space using both a standardized stellar variability dataset and the LINEAR dataset. Fourth, performance statistics are generated for each classifier and a comparing and contrasting of the methods is discussed. Lastly, an anomaly detection algorithm is generated using the so called one-class Parzen Window Classifier and the LINEAR dataset. The result will be the demonstration of the SSMM methodology as being a competitive feature space reduction technique, for usage in supervised classification algorithms.

The idea of constructing a supervised classification algorithm for stellar classification is not unique to this paper (Dubath et al., 2011). Methods pursued include the construction of a detector to determine variability (Barclay et al., 2011), the design of random forests for the detection of photometric redshifts in spectra (Carliles et al., 2010), the detection of transient events (Djorgovski et al., 2012) and the development of machine-assisted discovery of astronomical parameter relationships (Graham et al., 2013a). Debosscher (2009) explored several classification techniques for the supervised classification of variable stars, quantitatively comparing the performance in terms of computational speed and performance. Likewise, other efforts have focused on comparing speed and robustness of various methods (Blomme, Sarro, O’Donovan, Debosscher, Brown, Lopez, Dubath, Rimoldini, Charbonneau, Dunham, Mandushev, Ciardi, De Ridder, C., 2011, Pichara, Protopapas, Kim, Marquette, Tisserand, 2012, Pichara, Protopapas, 2013). These methods span both different classifiers and different spectral regimes, including IR surveys (Angeloni, Ramos, Catelan, Dékány, Gran, Alonso-García, Hempel, Navarrete, Andrews, Aparicio, et al., 2014, Masci, Hoffman, Grillmair, Cutri, 2014), RF surveys (Rebbapragada et al., 2011) and optical (Richards et al., 2012). Methods for automated supervised classification include procedures such as: direct parametric analysis (Udalski et al., 1999), fully automated neural networking (Pojmanski, 2000, Pojmanski, 2002) and Bayesian classification (Eyer and Blake, 2005).

The majority of these studies rely on periodicity domain feature space reductions. Debosscher (2009) and Templeton (2004) review a number of feature spaces and a number of efforts to reduce the time domain data, most of which implement Fourier techniques, primarily the Lomb-Scargle (L-S) Method (Lomb, 1976, Scargle, 1982), to estimate the primary periodicity (Eyer, Blake, 2005, Park, Cho, 2013, Richards, Starr, Miller, Bloom, Butler, Brink, Crellin-Quick, 2012, Ngeow, Lucchini, Kanbur, Barrett, Lin, 2013, Deb, Singh, 2009). Lomb-Scargle is favored because of the flexibility it provides with respect to observed datasets; when sample rates are irregular and drop outs are common in the data being observed. Long et al. (2014) advance L-S even further, introducing multi-band (multidimensional) generalized L-S, allowing the algorithm to take advantage of information across filters, in cases where multi-channel time-domain data is available. There have also been efforts to estimate frequency using techniques other than L-S such as the Correntropy Kernelized Periodogram, (Huijse et al., 2011) or MUlti SIgnal Classificator (Tagliaferri et al., 2003).

The assumption of the light curve being periodic, or even that the functionality of the signal being represented in the limited Fourier space that Lomb-Scargle uses, has been shown (Palaversa, Ivezić, Eyer, Ruždjak, Sudar, Galin, Kroflin, Mesarić, Munk, Vrbanec, et al., 2013, Barclay, Ramsay, Hakala, Napiwotzki, Nelemans, Potter, Todd, 2011) to result in biases and other challenges when used for signature identification purposes. Supervised classification algorithms implementing these frequency estimation algorithms do so to generate an estimate of primary frequency used to fold all observations resulting in a plot of magnitude vs. phase, something Deb and Singh (2009) refer to as “reconstruction” . After some interpolation to place the magnitude vs. phase plots on similar regularly sampled scales, the new folded time series can be directly compared (1-to-1) with known folded time series. Comparisons can be performed via distance metric (Tagliaferri et al., 2003), correlation (Protopapas et al., 2006), further feature space reduction (Debosscher, 2009) or more novel methods (Huijse et al., 2012). It should be noted that the family of stars with the label “stellar variable” is a large and diverse population: eclipsing binaries, irregularly pulsating variables, nova (stars in outburst), multi-model variables, and many others are frequently processed using the described methods despite the underlying stellar variability functionality not naturally lending itself to Fourier decomposition and the associated assumptions that accompany the said decomposition. Indeed this is why Szatmary et al. (1994); Barclay et al. (2011); Palaversa et al. (2013) and others suggest using other decomposition methods such as discrete wavelet transformations, which have been shown to be powerful in the effort to decompose a time series into the time-frequency (phase) space for analysis (Torrence, Compo, 1998, Bolós, Benítez, 2014, Rioul, Vetterli, 1991). It is noted that the digital signal processing possibilities beyond Fourier domain analysis time series comparison and wavelet transformation are too numerous to outline here; however the near complete review by Fulcher et al. (2013) is highly recommended.

Section snippets

Slotted Symbolic Markov Modeling

The discussion of the Slotted Symbolic Markov Modeling (SSMM) algorithm encompasses the analysis, reduction and classification of data. Since the a priori distribution of class labels are roughly evenly distributed for both experimental studies, the approach uses a multi-class classifier. Should the class labels with additional data become unbalanced, other approaches are possible (Rifkin and Klautau, 2004). Data specific challenges, associated with astronomical time series observations, have

Datasets

Two datasets are addressed here, the first is the STARLIGHT dataset from the UCR time series database, the second is published data from the LINEAR survey. The UCR time series dataset is used to base line the time-domain dataset feature extraction methodology proposed, it is compared to the results published on the UCR website. The UCR time series data contains only time domain data that has already been folded and put into magnitude phase space, no photometric data from either SDSS or 2MASS,

Conclusions

The Slotted Symbolic Markov Modeling (SSMM) methodology developed has been able to generate a feature space which separates variable stars by class (supervised classification). This methodology has the benefit of being able to accommodate irregular sampling rates, dropouts and some degree of time-domain variance. It also provides a fairly simple methodology for feature space generation, necessary for classification. One of the major advantages of the methodology used is that a signature pattern

Acknowledgments

The authors are grateful for valuable discussion with Stephen Wiechecki-Vergara and Véronique Petit. Research was partially supported by Vencore, Inc. The LINEAR program is sponsored by the National Aeronautics and Space Administration (NRA Nos. NNH09ZDA001N, 09-NEOO09-0010) and the United States Air Force under Air Force Contract FA8721-05-C-0002

References (66)

  • B.D. Fulcher et al.

    Highly comparative time-series analysis: the empirical structure of time series and their methods

    J. R. Soc. Interface

    (2013)
  • D. Tax et al.

    Feature extraction for one-class classification

    Proceedings of the ICANN/ICONIP

    (2003)
  • R. Angeloni et al.

    The VVV templates project towards an automated classification of VVV light-curves-I. building a database of stellar variability in the near-infrared

    Astron. Astrophys.

    (2014)
  • T. Barclay et al.

    Stellar variability on time-scales of minutes: results from the first 5 yr of the rapid temporal survey

    Mont. Not. Roy. Astron. Soc.

    (2011)
  • M. Berry et al.

    The milky way tomography with sloan digital sky survey. IV. dissecting dust

    Astrophys. J.

    (2012)
  • J. Blomme et al.

    Improved methodology for the automated classification of periodic variable stars

    Mont. Not. Roy. Astron. Soc.

    (2011)
  • V.J. Bolós et al.

    The wavelet scalogram in the study of time series

    Advances in Differential Equations and Applications

    (2014)
  • L. Breiman et al.

    Classification and Regression Trees

    (1984)
  • S. Carliles et al.

    Random forests for photometric redshifts

    Astrophys. J.

    (2010)
  • S. Deb et al.

    Light curve analysis of variable stars using fourier decomposition and principal component analysis

    Astron. Astrophys.

    (2009)
  • Debosscher, J., 2009. Automated classification of variable stars: application to the OGLE and CoRot databases.status:...
  • S.G. Djorgovski et al.

    Flashes in a star stream: automated classification of astronomical transient events

    E-Science (e-Science), 2012 IEEE 8th International Conference on

    (2012)
  • A. Drake et al.

    First results from the catalina real-time transient survey

    Astrophys. J.

    (2009)
  • P. Dubath et al.

    Random forest automated supervised classification of hipparcos periodic variable stars

    Mont. Not. Roy. Astron. Soc.

    (2011)
  • R.O. Duda et al.

    Pattern Classification

    (2012)
  • R. Duin et al.

    Prtools4.1, a matlab toolbox for pattern recognition,

    (2007)
  • L. Eyer et al.

    Automated classification of variable stars for all-sky automated survey 1–2 data

    Mont. Not. Roy. Astron. Soc.

    (2005)
  • T.-c. Fu

    A review on time series data mining

    Eng. Appl. Artif. Intell.

    (2011)
  • X. Ge et al.

    Deformable Markov model templates for time-series pattern matching

    Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

    (2000)
  • J. Grabocka et al.

    Invariant time-series classification

    Machine Learning and Knowledge Discovery in Databases

    (2012)
  • M.J. Graham et al.

    Machine-assisted discovery of relationships in astronomy

    Mont. Not. Roy. Astron. Soc.

    (2013)
  • M.J. Graham et al.

    A comparison of period finding algorithms

    Mont. Not. Roy. Astron. Soc.

    (2013)
  • T. Hastie et al.

    The Elements of Statistical Learning

    (2009)
  • P. Huijse et al.

    An information theoretic algorithm for finding periodicities in stellar light curves

    Signal Process. IEEE Trans.

    (2012)
  • P. Huijse et al.

    Period estimation in astronomical time series using slotted correntropy

    Signal Process. Lett. IEEE

    (2011)
  • R.A. Johnson et al.

    Applied Multivariate Statistical Analysis

    (1992)
  • E. Keogh et al.

    Dimensionality reduction for fast similarity search in large time series databases

    Knowl. Inf. Syst.

    (2001)
  • E. Keogh et al.

    The ucr time series classification/clustering homepage

    (2011)
  • A. Kovačević et al.

    Time delay evolution of five active galactic nuclei

    J. Astrophys. Astron.

    (2015)
  • A.M. Law et al.

    Simulation Modeling and Analysis

    (1991)
  • Q. Li et al.

    Nonparametric Econometrics: Theory and Practice

    (2007)
  • J. Lin et al.

    A symbolic representation of time series, with implications for streaming algorithms

    Proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery

    (2003)
  • J. Lin et al.

    Experiencing SAX: a novel symbolic representation of time series

    Data Min. Knowl. Discov.

    (2007)
  • Cited by (8)

    • FGMC-HADS: Fuzzy Gaussian mixture-based correntropy models for detecting zero-day attacks from linux systems

      2020, Computers and Security
      Citation Excerpt :

      Then, the normal boundaries of the GMM posterior probabilities are estimated for a normal profile and considering any deviations from the profile as an abnormal event. While GMM-based outlier approaches have been successfully implemented in a number of applications, including recognition (Johnston and Peter, 2017; Lagrange et al., 2017; Liu et al., 2017; Queiroz et al., 2017), their use in the HADS domain is new. The GMM models the data dynamics of the JF sequences and the normal boundaries as an outlier baseline that can be used to measure the precise behavioral boundaries for classification purposes.

    • Generation of a supervised classification algorithm for time-series variable stars with an application to the LINEAR dataset

      2017, New Astronomy
      Citation Excerpt :

      Specifically, additional example of the under-sampled variable stars, enough to perform k-fold cross-validation would yield improved performance and increased generality of the classifier. An improved feature space could also benefit the process, if new features were found to provide additional linear separation for certain classes, such as those presented in Johnston and Peter (2017). However, additional dimensionality without reduction of superfluous features is warned against as it may only worsen the performance issues of the classifier.

    • A survey on machine learning based light curve analysis for variable astronomical sources

      2021, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
    • Variable star classification using multiview metric learning

      2020, Monthly Notices of the Royal Astronomical Society
    View all citing articles on Scopus
    View full text