Abstract
Long Short-Term Memory (LSTM) network, a popular deep-learning model, is particularly useful for data with temporal correlation, such as texts, sequences, or time series data, thanks to its well-sought after recurrent network structures designed to capture temporal correlation. In this article, we propose to generalize LSTM to generic machine-learning tasks where data used for training do not have explicit temporal or sequential correlation. Our theme is to explore feature correlation in the original data and convert each instance into a synthetic sentence format by using a two-gram probabilistic language model. More specifically, for each instance represented in the original feature space, our conversion first seeks to horizontally align original features into a sequentially correlated feature vector, resembling to the letter coherence within a word. In addition, a vertical alignment is also carried out to create multiple time points and simulate word sequential order in a sentence (i.e., word correlation). The two dimensional horizontal-and-vertical alignments not only ensure feature correlations are maximally utilized, but also preserve the original feature values in the new representation. As a result, LSTM model can be utilized to achieve good classification accuracy, even if the underlying data do not have temporal or sequential dependency. Experiments on 20 generic datasets show that applying LSTM to generic data can improve the classification accuracy, compared to conventional machine-learning methods. This research opens a new opportunity for LSTM deep learning to be broadly applied to generic machine-learning tasks.
- M. Abadi, A. Agarwal, and P. Barham. 2015. Tensorflow: Large-scale machine learning on heterogeneous systems. 1 (2015). Softwareavailablefromtensorflow.org.Google Scholar
- A. Adam Pauls and D. Klein. 2011. Faster and smaller n-gram language models. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. 258--267.Google Scholar
- S. Al-Semari, F. Alajaji, and T. E. Fuja. 1999. Sequence MAP decoding of trellis codes for Gaussian and Rayleigh channels. IEEE Transactions on Vehicular Technology 48, 4 (1999), 1130--1140.Google ScholarCross Ref
- K. G. Anil. 2006. On optimum choice of k in nearest neighbour classification. Computational Statistics and Data Analysis 50, 11 (2006), 3113--3123.Google ScholarDigital Library
- Y. Bengio, A. Courville, and P. Vincent. 2013. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 8 (2013), 1798--1828.Google ScholarDigital Library
- Y. Bengio, O. Delalleau, and N. Le Roux. 2005. The curse of highly variable functions for local kernel machines. In Proceedings of the Advances in Neural Information Processing Systems, British Columbia, Canada. MIT Press, 107--114.Google Scholar
- Y. Bengio and P. Simard. 1994. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks 5, 2 (1994), 157--166.Google ScholarDigital Library
- Mairead L. Bermingham, Ricardo Pong-Wong, Athina Spiliopoulou, et al. 2015. Application of high-dimensional feature selection: Evaluation for genomic prediction in man. Scientific Reports 5, 10312 (2015).Google Scholar
- A. L. Blum and P. Langley. 1997. Selection of relevant features and examples in machine learning. Artificial Intelligence 97, 1--2 (1997), 245--271.Google ScholarDigital Library
- C. E. Brodley and P. E. Utgoff. 1995. Multivariate decision trees. Machine Learning 19, 1 (1995), 45--77.Google ScholarDigital Library
- Xiaojun Chang, Feiping Nie, Yi Yang, Chengqi Zhang, and Heng Huang. 2016. Convex sparse PCA for unsupervised feature learning. ACM Transactions on Knowledge Discovery from Data 11, 1 (2016), 3:1--3:16.Google Scholar
- L. Changki and L. G. Geunbae. 2006. Information gain and divergence-based feature selection for machine learning-based text categorization. Information Processing 8 Management 42, 1 (2006), 155--165.Google Scholar
- T. Chen and C. Guestrin. 2016. XGBoost: A scalable tree boosting System. In Proceedings of the Conference on Knowledge Discovery and Data Mining.Google Scholar
- Yanping Chen, Eamonn Keogh, Bing Hu, Nurjahan Begum, Anthony Bagnall, Abdullah Mueen, and Gustavo Batista. 2015. The UCR Time Series Classification Archive. Retrieved from www.cs.ucr.edu/∼eamonn/time_series_data/.Google Scholar
- Dan Ciresan, U. Meier, and J. Schmidhuber. 2012. Multi-column deep neural networks for image classification. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition. 3642--3649.Google Scholar
- C. M. Bishop. 1995. Neural Networks for Pattern Recognition. Oxford University Press, Oxford, UK.Google ScholarDigital Library
- C. Cortes and V. Vapnik. 1995. Support-vector networks. Machine Learning 20, 3 (1995), 273--297.Google ScholarCross Ref
- R. A. Dunne and N. A. Campbel. 1997. On the pairing Of the softmax activation and cross entropy penalty functions and the derivation of the softmax activation function. In Proceedings of the 8th Australian Conference on Neural Networks. 181--185.Google Scholar
- M. Federico and M. Cettolo. 2007. Efficient handling of n-gram language models for statistical machine translation. In Proceedings of the Second Workshop on Statistical Machine Translation. 88--95.Google Scholar
- F. Gers, N. Schraudolph, and J. Schmidhuber. 2002. Learning precise timing with LSTM recurrent networks. Journal of Machine Learning Research 3, 1 (2002), 115--143.Google ScholarDigital Library
- Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. The MIT Press, Cambridge, MA.Google ScholarDigital Library
- A. Graves, A. Mohamed, and G. Hinton. 2013. Speech recognition with deep recurrent neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. 6645--6649.Google Scholar
- A. Graves, A. R. Mohamed, and G. Hinton. 2013. Speech recognition with deep recurrent neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, Canada. 6645--6649.Google Scholar
- A. Graves and J. Schmidhuber. 2005. Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks 18, 5 (2005), 602--610.Google ScholarDigital Library
- I. Guyon and A. Elisseeff. 2003. An introduction to variable and feature selection. Journal of Machine Learning Research 3, 6 (2003), 1157--1182.Google ScholarDigital Library
- M. F. A. Hady and F. Schwenker. 2013. Semi-supervised Learning,in Handbook on Neural Information Processing. Springer, Berlin, Germany.Google Scholar
- H. Han, Y. Li, and X. Zhu. 2019. Convolutional neural network learning for generic data classification. Information Sciences 477 (2019), 448--465.Google ScholarCross Ref
- H. Han, X. Zhu, and Y. Li. 2018. EDLT: Enabling deep learning for generic data classification. In Proceedings of the IEEE International Conference on Data Mining.Google Scholar
- J. Hauke and T. Kossowski. 2011. Comparison of values of Pearson’s and Spearman’s correlation coefficient on the same sets of data. Quaestiones Geographicae 31, 2 (2011), 87--93.Google ScholarCross Ref
- S. Hochreiter and J. Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 3 (1997), 1735--1780.Google ScholarDigital Library
- Kurt Hornik. 1991. Approximation capabilities of multilayer feedforward networks. Neural Networks 4, 2 (1991), 251--257.Google ScholarDigital Library
- Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning. Springer.Google ScholarDigital Library
- Adebayo Kolawole John, Luigi Di Caro, and Guido Boella. 2016. ImageNet classification with deep convolutional neural networks. In Proceedings of the 12th International Conference on Semantic Systems.Google Scholar
- D. Kingma and J. Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations.Google Scholar
- R. Kohavi and G. H. John. 1997. Wrappers for feature subset selection. Artificial Intelligent 97, 12 (1997), 273--324.Google ScholarDigital Library
- Alex Krizhevsky, Ilya Sutskever, and Geoffry Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the 26th Annual Conference on Neural Information Processing Systems, Lake Tahoe, Nevada.Google Scholar
- L. Ladla and T. Deepa. 2011. Feature selection methods and algorithms. International Journal on Computer Science and Engineering 3, 5 (2011), 1787--1797.Google Scholar
- P. Langley. 1994. Selection of relevant features in machine learning. In Proceedings of the AAAI Fall Symposium on Relevance, New Orleans, Louisiana. 140--144.Google ScholarCross Ref
- Y. LeCun, G. Bengio, and Y. Hinton. 2015. Deep learning. Nature 521 (2015), 436--444.Google ScholarCross Ref
- Y. LeCun, G. Bengio, and Y. Hinton. 2019. Fast video frame correlation analysis for vehicular networks by using CVS-CNN. IEEE Transactions on Vehicular Technology 68, 7 (2019), 6286--6296.Google ScholarCross Ref
- Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. 1990. Handwritten digit recognition with a back-propagation network. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, Canada. MIT Press, 396--404.Google Scholar
- Huan Liu and Hiroshi Motoda. 1998. Feature Extraction, Construction and Selection: A Data Mining Perspective. Kluwer Academic Publishers.Google ScholarDigital Library
- H. Liu and R. Setiono. 1995. Chi2: Feature selection and discretization of numeric attributes. In Proceedings of the 7th IEEE International Conference on Tools with Artificial Intelligence.Google Scholar
- D. Lunga, S. Prasad, M. M. Crawford, and O. Ersoy. 2014. Manifold learning-based feature extraction for classification of hyperspectral data: A review of advances in manifold learning. IEEE Signal Processing Magazine 31, 1 (2014), 55--66.Google ScholarCross Ref
- Nasser M. Nasrabadi. 2007. Pattern recognition and machine learning. Journal of Electronic Imaging 16, 4 (2007), 049901.Google ScholarCross Ref
- D. Newman, S. Hettich, C. Blake, and C. Merz. 1998. UCI repository of machine learning databases, Irvine. University of California, Department of Information and Computer Science, CA. Retrieved from http://www.ics.uci.edu/∼mlearn/MLRepository.html.Google Scholar
- F. Pedregosa, G. Varoquaux, A. Gramfort, and V. Michel. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, 10 (2011), 2825--2830.Google ScholarDigital Library
- V. Rokhlin, A. Szlam, and M. Tygert. 2009. A randomized algorithm for principal component analysis. SIAM Journal on Matrix Analysis and Applications 31, 3 (2009), 1100--1124.Google ScholarDigital Library
- H. Sak et al. 2014. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Proceedings of the Annual Conference of the International Speech Communication Association. 338--342.Google ScholarCross Ref
- J. Schmidhuber. 2015. Deep learning in neural networks: An overview. Neural Networks 61, 1 (2015), 85--117.Google ScholarDigital Library
- B. Scholkopft and K.-R. Mullert. 1999. Neural Networks for Signal Processing. Springer.Google Scholar
- L. J. P. van der Maaten and G. E. Hinton. 2008. Visualizing High-dimensional data using t-SNE. Journal of Machine Learning Research 9, 12 (2008), 2579--2605.Google Scholar
- Y. Wang, M. Huang, L. Zhao, and X. Zhu. 2016. Attention-based lstm for aspect-level sentiment classification. In Proceedings of the Conference on Conference on Empirical Methods in Natural Language Processing.Google Scholar
- Man Wu, Shirui Pan, Xingquan Zhu, Chuan Zhou, and Lei Pan. 2019. Domain-adversarial graph neural networks for text classification. In Proceedings of the IEEE International Conference on Data Mining.Google ScholarCross Ref
- Y. Wu, S. Hio, T. Mei, and N. Yu. 2017. Large-scale online feature selection for ultra-high dimensional sparse data. ACM Transactions on Knowledge Discovery from Data 11, 4 (2017), 48:1--48:22.Google Scholar
- Kui Yu, Xindong Wu, Wei Ding, and Jian Pei. 2016. Scalable and accurate online feature selection for big data. ACM Transactions on Knowledge Discovery from Data 11, 2 (2016), 16:1--16:39.Google Scholar
- D. Zhang, J. Wang, F. Wang, and C. Zhang. 2008. Semi-supervised classification with universum. In Proceedings of the SIAM International Conference on Data Mining, San Diego, CA. 323--333.Google Scholar
- Daokun Zhang, Jie Yin, Xingquan Zhu, and Chengqi Zhang. 2018. Network representation learning: A survey. IEEE Transactions on Big Data (2018). DOI:https://doi.org/10.1109/TBDATA.2018.2850013Google Scholar
- X. Zhu. 2011. Cross-domain semi-supervised learning using feature formulation. IEEE Transactions on Systems, Man, and Cybernetics, Part B 41, 6 (2011), 1627--1638.Google ScholarDigital Library
Index Terms
- Generalizing Long Short-Term Memory Network for Deep Learning from Generic Data
Recommendations
A deep learning approach for stock market prediction using deep autoencoder and long short-term memory
The stock market prediction problems have received increased attention from researchers due to the high stakes involved and the need for better prediction accuracy. We have developed an architecture by combining a deep autoencoder and long short-term ...
Deep long short-term memory based model for agricultural price forecasting
AbstractAgricultural price forecasting is one of the research hotspots in time series forecasting due to its unique characteristics. In this paper, we developed a deep long short-term memory (DLSTM) based model for the accurate forecasting of a ...
Application of deep learning to multivariate aviation weather forecasting by long short-term memory
Weather forecasts are essential to aviation safety. Unreliable forecasts not only cause problems to pilots and air traffic controllers, but also lead to aviation accidents and incidents. This study develops a long short-term memory (LSTM) integrating both ...
Comments