skip to main content
10.1145/3428363.3428369acmotherconferencesArticle/Chapter ViewAbstractPublication PagesnsyssConference Proceedingsconference-collections
research-article

Distributed Principal Component Analysis for Real-time Big Data Processing

Published:22 December 2020Publication History

ABSTRACT

Real-time big data analytics, which is the combination of real-time analytics and big data, works on processing large scale data as it arrives and strives to obtain insights from it without exceeding a limited time period. Massive amount of data is being generated every moment through web sites, social networks, scientific experiments, etc. which is stored in the cloud. When decision making requires an insight of the raw data in real-time (often by applying machine learning algorithms such as dimensionality reduction), these algorithms fail to analyze these incessantly flowing high volume data i.e dynamic big data. Our work proposes a variant of scalable principal component analysis (PCA) which is suited for real-time big data applications. We maintain a sliding window (representing incoming data in real-time) over the most recent data and project every incoming data into lower dimensional subspace. Our goal is to minimize the reconstruction error of the output from the input and keep updating the principal components depending on it. We have implemented our scalable algorithm on popular Spark framework for distributed platform and performed extensive experiments on datasets from a variety of real-time applications e.g. activity recognition, customer expenditure, etc. Furthermore, we have demonstrated that our algorithm can capture the changing distributions of real-life datasets, thus enabling real-time PCA. We have also compared the performances of distributed and non-distributed versions of our algorithm over a variety of window sizes and showed that our distributed algorithm is scalable and performs better when window size and target dimension increase.

References

  1. [n.d.]. Mahout 2014. Apache Mahout Software.https://mahout.apache.org/Google ScholarGoogle Scholar
  2. [n.d.]. MLlib 2014. MLlib: Apache Spark’s scalable machine learning library.https://spark.apache.org/mllib/Google ScholarGoogle Scholar
  3. [n.d.]. Spark 2014. Apache Spark: Lightning-fast Cluster Computing. https://spark.apache.orgGoogle ScholarGoogle Scholar
  4. Nuno Gonçalo Costa Fernandes Marques Abreu 2011. Analise do perfil do cliente Recheio e desenvolvimento de um sistema promocional. Ph.D. Dissertation.Google ScholarGoogle Scholar
  5. Rajen B Bhatt and M Gopal. 2008. FRCT: fuzzy-rough classification trees. Pattern analysis and applications 11, 1 (2008), 73–88.Google ScholarGoogle Scholar
  6. Anil Bhattacharyya. 1943. On a measure of divergence between two statistical populations defined by their probability distributions. Bull. Calcutta Math. Soc. 35 (1943), 99–109.Google ScholarGoogle Scholar
  7. Albert Bifet and Ricard Gavalda. 2007. Learning from time-changing data with adaptive windowing. In Proceedings of the 2007 SIAM international conference on data mining. SIAM, 443–448.Google ScholarGoogle ScholarCross RefCross Ref
  8. Christos Boutsidis, Dan Garber, Zohar Karnin, and Edo Liberty. 2015. Online principal components analysis. In Proceedings of the twenty-sixth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, 887–901.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Pierluigi Casale, Oriol Pujol, and Petia Radeva. 2012. Personalization and user verification in wearable systems using biometric walking patterns. Personal and Ubiquitous Computing 16, 5 (2012), 563–580.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Kenneth L. Clarkson and David P. Woodruff. 2017. Low-Rank Approximation and Regression in Input Sparsity Time. J. ACM 63, Article 54 (Jan. 2017), 45 pages.Google ScholarGoogle Scholar
  11. Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM 51, 1 (2008), 107–113.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Dua Dheeru and Efi Karra Taniskidou. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/mlGoogle ScholarGoogle Scholar
  13. Chris Ding and Xiaofeng He. 2004. K-means Clustering via Principal Component Analysis. In Proceedings of the Twenty-first International Conference on Machine Learning(ICML ’04). ACM, 29–. http://doi.acm.org/10.1145/1015330.1015408Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Tarek Elgamal, Maysam Yabandeh, Ashraf Aboulnaga, Waleed Mustafa, and Mohamed Hefeeda. 2015. sPCA: Scalable principal component analysis for big data on distributed platforms. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 79–91.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Dan Feldman, Melanie Schmidt, and Christian Sohler. 2013. Turning Big Data into Tiny Data: Constant-size Coresets for K-means, PCA and Projective Clustering. In Proceedings of the Twenty-fourth Annual ACM-SIAM Symposium on Discrete Algorithms (New Orleans, Louisiana) (SODA ’13). Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 1434–1453. http://dl.acm.org/citation.cfm?id=2627817.2627920Google ScholarGoogle ScholarCross RefCross Ref
  16. Gene H Golub and Christian Reinsch. 1970. Singular value decomposition and least squares solutions.Numerische mathematik 14. 403––420 pages.Google ScholarGoogle Scholar
  17. Nathan P. Halko. 2012. Randomized Methods for Computing Low-rank Approximations of Matrices. Ph.D. Dissertation. Boulder, CO, USA. Advisor(s) Martinsson, Per-Gunnar. AAI3507998.Google ScholarGoogle Scholar
  18. Ian T Jolliffe. 1990. Principal component analysis: a beginner’s guide—I. Introduction and application. Weather 45, 10 (1990), 375–382.Google ScholarGoogle ScholarCross RefCross Ref
  19. Zohar Karnin and Edo Liberty. 2015. Online pca with spectral bounds. In Conference on Learning Theory. 1129–1140.Google ScholarGoogle Scholar
  20. Ludmila I Kuncheva and William J Faithfull. 2014. PCA feature extraction for change detection in multidimensional unlabeled data. IEEE transactions on neural networks and learning systems 25, 1(2014), 69–80.Google ScholarGoogle ScholarCross RefCross Ref
  21. Michael Mathioudakis and Nick Koudas. 2010. Twittermonitor: trend detection over the twitter stream. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 1155–1158.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. G. Chaudhuri (originator). [n.d.]. Bhattacharyya distance. https://www.encyclopediaofmath.org/index.php/Bhattacharyya_distanceGoogle ScholarGoogle Scholar
  23. Ninh Pham and Rasmus Pagh. 2013. Fast and Scalable Polynomial Kernels via Explicit Feature Maps. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD ’13). ACM, 239–247.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Nhathai Phan, Soon Ae Chun, Manasi Bhole, and James Geller. 2017. Enabling real-time drug abuse detection in tweets. In 2017 IEEE 33rd International Conference on Data Engineering (ICDE). IEEE, 1510–1514.Google ScholarGoogle ScholarCross RefCross Ref
  25. Daniel Preotiuc-Pietro, Sina Samangooei, Trevor Cohn, Nicholas Gibbins, and Mahesan Niranjan. 2012. Trendminer: An architecture for real time analysis of social media text. (2012).Google ScholarGoogle Scholar
  26. Abdulhakim A Qahtan, Basma Alharbi, Suojin Wang, and Xiangliang Zhang. 2015. A pca-based change detection framework for multidimensional data streams: Change detection in multidimensional data streams. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 935–944.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Muhammad Abdullah Adnan Ranak Roy Chowdhuryand Rajesh K. Gupta. 2019. Real Time Principal Component Analysis. In to appear in the Proceedings of 35th IEEE International Conference on Data Engineering(ICDE ‘19). IEEE.Google ScholarGoogle Scholar
  28. Sam Roweis. 1998. EM Algorithms for PCA and SPCA. In in Advances in Neural Information Processing Systems. MIT Press, 626–632.Google ScholarGoogle Scholar
  29. Lindsay I Smith. 2002. A tutorial on principal components analysis. Technical Report.Google ScholarGoogle Scholar
  30. Md. Mehrab Tanjim and Muhammad Abdullah Adnan. 2018. sSketch: A Scalable Sketching Technique for PCA in the Cloud. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining(WSDM ‘18). ACM, 574–582.Google ScholarGoogle Scholar
  31. Md. Mehrab Tanjim and Muhammad Abdullah Adnan. 2018. sSketch: A Scalable Sketching Technique for PCA in the Cloud. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining(WSDM ‘18). ACM, 574–582.Google ScholarGoogle Scholar
  32. Zhewei Wei, Xuancheng Liu, Feifei Li, Shuo Shang, Xiaoyong Du, and Ji-Rong Wen. 2016. Matrix sketching over sliding windows. In Proceedings of the 2016 International Conference on Management of Data. ACM, 1465–1480.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. David P. Woodruff. 2014. Sketching As a Tool for Numerical Linear Algebra. Found. Trends Theor. Comput. Sci. 10, 1–2 (Oct. 2014), 1–157. https://doi.org/10.1561/0400000060Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    NSysS '20: Proceedings of the 7th International Conference on Networking, Systems and Security
    December 2020
    132 pages
    ISBN:9781450389051
    DOI:10.1145/3428363

    Copyright © 2020 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 22 December 2020

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited

    Acceptance Rates

    Overall Acceptance Rate12of44submissions,27%
  • Article Metrics

    • Downloads (Last 12 months)12
    • Downloads (Last 6 weeks)3

    Other Metrics

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format