ABSTRACT
Real-time big data analytics, which is the combination of real-time analytics and big data, works on processing large scale data as it arrives and strives to obtain insights from it without exceeding a limited time period. Massive amount of data is being generated every moment through web sites, social networks, scientific experiments, etc. which is stored in the cloud. When decision making requires an insight of the raw data in real-time (often by applying machine learning algorithms such as dimensionality reduction), these algorithms fail to analyze these incessantly flowing high volume data i.e dynamic big data. Our work proposes a variant of scalable principal component analysis (PCA) which is suited for real-time big data applications. We maintain a sliding window (representing incoming data in real-time) over the most recent data and project every incoming data into lower dimensional subspace. Our goal is to minimize the reconstruction error of the output from the input and keep updating the principal components depending on it. We have implemented our scalable algorithm on popular Spark framework for distributed platform and performed extensive experiments on datasets from a variety of real-time applications e.g. activity recognition, customer expenditure, etc. Furthermore, we have demonstrated that our algorithm can capture the changing distributions of real-life datasets, thus enabling real-time PCA. We have also compared the performances of distributed and non-distributed versions of our algorithm over a variety of window sizes and showed that our distributed algorithm is scalable and performs better when window size and target dimension increase.
- [n.d.]. Mahout 2014. Apache Mahout Software.https://mahout.apache.org/Google Scholar
- [n.d.]. MLlib 2014. MLlib: Apache Spark’s scalable machine learning library.https://spark.apache.org/mllib/Google Scholar
- [n.d.]. Spark 2014. Apache Spark: Lightning-fast Cluster Computing. https://spark.apache.orgGoogle Scholar
- Nuno Gonçalo Costa Fernandes Marques Abreu 2011. Analise do perfil do cliente Recheio e desenvolvimento de um sistema promocional. Ph.D. Dissertation.Google Scholar
- Rajen B Bhatt and M Gopal. 2008. FRCT: fuzzy-rough classification trees. Pattern analysis and applications 11, 1 (2008), 73–88.Google Scholar
- Anil Bhattacharyya. 1943. On a measure of divergence between two statistical populations defined by their probability distributions. Bull. Calcutta Math. Soc. 35 (1943), 99–109.Google Scholar
- Albert Bifet and Ricard Gavalda. 2007. Learning from time-changing data with adaptive windowing. In Proceedings of the 2007 SIAM international conference on data mining. SIAM, 443–448.Google ScholarCross Ref
- Christos Boutsidis, Dan Garber, Zohar Karnin, and Edo Liberty. 2015. Online principal components analysis. In Proceedings of the twenty-sixth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, 887–901.Google ScholarDigital Library
- Pierluigi Casale, Oriol Pujol, and Petia Radeva. 2012. Personalization and user verification in wearable systems using biometric walking patterns. Personal and Ubiquitous Computing 16, 5 (2012), 563–580.Google ScholarDigital Library
- Kenneth L. Clarkson and David P. Woodruff. 2017. Low-Rank Approximation and Regression in Input Sparsity Time. J. ACM 63, Article 54 (Jan. 2017), 45 pages.Google Scholar
- Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM 51, 1 (2008), 107–113.Google ScholarDigital Library
- Dua Dheeru and Efi Karra Taniskidou. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/mlGoogle Scholar
- Chris Ding and Xiaofeng He. 2004. K-means Clustering via Principal Component Analysis. In Proceedings of the Twenty-first International Conference on Machine Learning(ICML ’04). ACM, 29–. http://doi.acm.org/10.1145/1015330.1015408Google ScholarDigital Library
- Tarek Elgamal, Maysam Yabandeh, Ashraf Aboulnaga, Waleed Mustafa, and Mohamed Hefeeda. 2015. sPCA: Scalable principal component analysis for big data on distributed platforms. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 79–91.Google ScholarDigital Library
- Dan Feldman, Melanie Schmidt, and Christian Sohler. 2013. Turning Big Data into Tiny Data: Constant-size Coresets for K-means, PCA and Projective Clustering. In Proceedings of the Twenty-fourth Annual ACM-SIAM Symposium on Discrete Algorithms (New Orleans, Louisiana) (SODA ’13). Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 1434–1453. http://dl.acm.org/citation.cfm?id=2627817.2627920Google ScholarCross Ref
- Gene H Golub and Christian Reinsch. 1970. Singular value decomposition and least squares solutions.Numerische mathematik 14. 403––420 pages.Google Scholar
- Nathan P. Halko. 2012. Randomized Methods for Computing Low-rank Approximations of Matrices. Ph.D. Dissertation. Boulder, CO, USA. Advisor(s) Martinsson, Per-Gunnar. AAI3507998.Google Scholar
- Ian T Jolliffe. 1990. Principal component analysis: a beginner’s guide—I. Introduction and application. Weather 45, 10 (1990), 375–382.Google ScholarCross Ref
- Zohar Karnin and Edo Liberty. 2015. Online pca with spectral bounds. In Conference on Learning Theory. 1129–1140.Google Scholar
- Ludmila I Kuncheva and William J Faithfull. 2014. PCA feature extraction for change detection in multidimensional unlabeled data. IEEE transactions on neural networks and learning systems 25, 1(2014), 69–80.Google ScholarCross Ref
- Michael Mathioudakis and Nick Koudas. 2010. Twittermonitor: trend detection over the twitter stream. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 1155–1158.Google ScholarDigital Library
- G. Chaudhuri (originator). [n.d.]. Bhattacharyya distance. https://www.encyclopediaofmath.org/index.php/Bhattacharyya_distanceGoogle Scholar
- Ninh Pham and Rasmus Pagh. 2013. Fast and Scalable Polynomial Kernels via Explicit Feature Maps. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD ’13). ACM, 239–247.Google ScholarDigital Library
- Nhathai Phan, Soon Ae Chun, Manasi Bhole, and James Geller. 2017. Enabling real-time drug abuse detection in tweets. In 2017 IEEE 33rd International Conference on Data Engineering (ICDE). IEEE, 1510–1514.Google ScholarCross Ref
- Daniel Preotiuc-Pietro, Sina Samangooei, Trevor Cohn, Nicholas Gibbins, and Mahesan Niranjan. 2012. Trendminer: An architecture for real time analysis of social media text. (2012).Google Scholar
- Abdulhakim A Qahtan, Basma Alharbi, Suojin Wang, and Xiangliang Zhang. 2015. A pca-based change detection framework for multidimensional data streams: Change detection in multidimensional data streams. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 935–944.Google ScholarDigital Library
- Muhammad Abdullah Adnan Ranak Roy Chowdhuryand Rajesh K. Gupta. 2019. Real Time Principal Component Analysis. In to appear in the Proceedings of 35th IEEE International Conference on Data Engineering(ICDE ‘19). IEEE.Google Scholar
- Sam Roweis. 1998. EM Algorithms for PCA and SPCA. In in Advances in Neural Information Processing Systems. MIT Press, 626–632.Google Scholar
- Lindsay I Smith. 2002. A tutorial on principal components analysis. Technical Report.Google Scholar
- Md. Mehrab Tanjim and Muhammad Abdullah Adnan. 2018. sSketch: A Scalable Sketching Technique for PCA in the Cloud. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining(WSDM ‘18). ACM, 574–582.Google Scholar
- Md. Mehrab Tanjim and Muhammad Abdullah Adnan. 2018. sSketch: A Scalable Sketching Technique for PCA in the Cloud. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining(WSDM ‘18). ACM, 574–582.Google Scholar
- Zhewei Wei, Xuancheng Liu, Feifei Li, Shuo Shang, Xiaoyong Du, and Ji-Rong Wen. 2016. Matrix sketching over sliding windows. In Proceedings of the 2016 International Conference on Management of Data. ACM, 1465–1480.Google ScholarDigital Library
- David P. Woodruff. 2014. Sketching As a Tool for Numerical Linear Algebra. Found. Trends Theor. Comput. Sci. 10, 1–2 (Oct. 2014), 1–157. https://doi.org/10.1561/0400000060Google ScholarDigital Library
Recommendations
A Brief Survey on Big Data in Healthcare
This article presents a brief introduction to big data and big data analytics and also their roles in the healthcare system. A definite range of scientific researches about big data analytics in the healthcare system have been reviewed. The definition ...
Responsible Big Data Analytics for E-Business Services
ICBDR '21: Proceedings of the 5th International Conference on Big Data ResearchThis paper examines responsible big data analytics for e-business services and looks at how to use responsible big data analytics to obtain responsible e-business services. It addresses why responsibility matters to big data analytics and e-business ...
Comparative analysis of real-time messages in big data pipeline architecture
Nowadays, real-time messaging system is the essential thing in enabling time-critical decision making in many applications where it is important to deal with real-time requirements and reliability requirements simultaneously. For dependability reasons, we ...
Comments