skip to main content
10.1145/3428363.3428368acmotherconferencesArticle/Chapter ViewAbstractPublication PagesnsyssConference Proceedingsconference-collections
research-article

Distributing Active Learning Algorithms

Published:22 December 2020Publication History

ABSTRACT

Active Learning is a machine learning strategy that aims at finding out an optimal labeling sequence for a huge pool of unlabeled data. We observe that sometimes there are not enough labeled data in contrast to unlabeled samples. Moreover, in some scenarios labeling requires a lot of time and expert supervision. In those cases, we need to chalk out an optimal labeling order so that with a relatively small amount of labeled samples the model gives fairly good accuracy levels. This problem becomes more serious when dealing with distributed data that needs a distributed processing framework. In this work, we propose distributed implementations of state of the art active learning algorithms and perform various analyses on them. The algorithms are tested with real datasets on multinode spark clusters with data distributed on a distributed file system (HDFS). We show that our algorithms perform better than randomly labeling data i.e. non-active learning scenarios and show their mutual performance comparisons. The code is publicly available at https://github.com/dv66/Distributed-Active-Learning

Skip Supplemental Material Section

Supplemental Material

References

  1. Leo Breiman. 2001. Random Forests. Mach. Learn. 45, 1 (october 2001), 5–32. https://doi.org/10.1023/A:1010933404324Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM 51, 1 (january 2008), 107–113. https://doi.org/10.1145/1327452.1327492Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Stuart Geman, Elie Bienenstock, and René Doursat. 1992. Neural Networks and the Bias/Variance Dilemma. Neural Comput. 4, 1 (january 1992), 1–58. https://doi.org/10.1162/neco.1992.4.1.1Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. https://github.com/googlecreativelab/quickdraw dataset. [n. d.]. .Google ScholarGoogle Scholar
  5. Ksenia Konyushkova, Raphael Sznitman, and Pascal Fua. 2017. Learning Active Learning from Data. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett(Eds.). Curran Associates, Inc., 4225–4235. http://papers.nips.cc/paper/7010-learning-active-learning-from-data.pdfGoogle ScholarGoogle Scholar
  6. David D. Lewis and Jason Catlett. 1994. Heterogeneous Uncertainty Sampling for Supervised Learning. In In Proceedings of the Eleventh International Conference on Machine Learning. Morgan Kaufmann, 148–156.Google ScholarGoogle Scholar
  7. Xiangrui Meng, Joseph K. Bradley, Burak Yavuz, Evan R. Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, D. B. Tsai, Manish Amde, Sean Owen, Doris Xin, Reynold Xin, Michael J. Franklin, Reza Zadeh, Matei Zaharia, and Ameet Talwalkar. 2015. MLlib: Machine Learning in Apache Spark. CoRR abs/1505.06807(2015). arxiv:1505.06807http://arxiv.org/abs/1505.06807Google ScholarGoogle Scholar
  8. Nicholas Roy and Andrew McCallum. 2001. Toward Optimal Active Learning Through Sampling Estimation of Error Reduction. In Proceedings of the Eighteenth International Conference on Machine Learning(ICML ’01). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 441–448. http://dl.acm.org/citation.cfm?id=645530.655646Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Burr Settles. 2008. Curious Machines: Active Learning with Structured Instances. Ph.D. Dissertation. University of Wisconsin-Madison. Advisor(s) Mark Craven.Google ScholarGoogle Scholar
  10. Burr Settles. 2012. Active Learning. Morgan & Claypool Publishers.Google ScholarGoogle Scholar
  11. H. S. Seung, M. Opper, and H. Sompolinsky. 1992. Query by Committee. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory(COLT ’92). ACM, New York, NY, USA, 287–294. https://doi.org/10.1145/130385.130417Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The Hadoop Distributed File System. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)(MSST ’10). IEEE Computer Society, Washington, DC, USA, 1–10. https://doi.org/10.1109/MSST.2010.5496972Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Brian C. Smith, Burr Settles, William C. Hallows, Mark W. Craven, and John M. Denu. 2011. SIRT3 Substrate Specificity Determined by Peptide Arrays and Machine Learning. ACS Chemical Biology 6, 2 (18 Feb 2011), 146–157. https://doi.org/10.1021/cb100218dGoogle ScholarGoogle Scholar
  14. Simon Tong and Daphne Koller. 2002. Support Vector Machine Active Learning with Applications to Text Classification. J. Mach. Learn. Res. 2 (march 2002), 45–66. https://doi.org/10.1162/153244302760185243Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Reza Bosagh Zadeh and Gunnar Carlsson. 2013. Dimension Independent Matrix Square using MapReduce. CoRR abs/1304.1467(2013). arxiv:1304.1467http://arxiv.org/abs/1304.1467Google ScholarGoogle Scholar
  16. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation(NSDI’12). USENIX Association, Berkeley, CA, USA, 2–2. http://dl.acm.org/citation.cfm?id=2228298.2228301Google ScholarGoogle Scholar
  17. Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, and Ion Stoica. 2016. Apache Spark: A Unified Engine for Big Data Processing. Commun. ACM 59, 11 (october 2016), 56–65. https://doi.org/10.1145/2934664Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Jia-Jie Zhu and José Bento. 2017. Generative Adversarial Active Learning. CoRR abs/1702.07956(2017). http://arxiv.org/abs/1702.07956Google ScholarGoogle Scholar
  19. Xiaojin Zhu, Andrew B. Goldberg, Ronald Brachman, and Thomas Dietterich. 2009. Introduction to Semi-Supervised Learning. Morgan and Claypool Publishers.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    NSysS '20: Proceedings of the 7th International Conference on Networking, Systems and Security
    December 2020
    132 pages
    ISBN:9781450389051
    DOI:10.1145/3428363

    Copyright © 2020 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 22 December 2020

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited

    Acceptance Rates

    Overall Acceptance Rate12of44submissions,27%
  • Article Metrics

    • Downloads (Last 12 months)5
    • Downloads (Last 6 weeks)0

    Other Metrics

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format