research-article

Distributing Active Learning Algorithms

Authors:
Syed Mostofa Monsur

Bangladesh University of Engineering and Technology, Bangladesh

Bangladesh University of Engineering and Technology, Bangladesh
View Profile

,
Muhammad Abdullah Adnan

Bangladesh University of Engineering and Technology, Bangladesh

Bangladesh University of Engineering and Technology, Bangladesh
View Profile

NSysS '20: Proceedings of the 7th International Conference on Networking, Systems and SecurityDecember 2020Pages 74–81https://doi.org/10.1145/3428363.3428368

Published:22 December 2020Publication History

NSysS '20: Proceedings of the 7th International Conference on Networking, Systems and Security

Pages 74–81

ABSTRACT

Active Learning is a machine learning strategy that aims at finding out an optimal labeling sequence for a huge pool of unlabeled data. We observe that sometimes there are not enough labeled data in contrast to unlabeled samples. Moreover, in some scenarios labeling requires a lot of time and expert supervision. In those cases, we need to chalk out an optimal labeling order so that with a relatively small amount of labeled samples the model gives fairly good accuracy levels. This problem becomes more serious when dealing with distributed data that needs a distributed processing framework. In this work, we propose distributed implementations of state of the art active learning algorithms and perform various analyses on them. The algorithms are tested with real datasets on multinode spark clusters with data distributed on a distributed file system (HDFS). We show that our algorithms perform better than randomly labeling data i.e. non-active learning scenarios and show their mutual performance comparisons. The code is publicly available at https://github.com/dv66/Distributed-Active-Learning

Supplemental Material

Available for Download

pdf

p74-monsur-supplement.pdf (1.6 MB)

Presentation slides

References

Leo Breiman. 2001. Random Forests. Mach. Learn. 45, 1 (october 2001), 5–32. https://doi.org/10.1023/A:1010933404324Google ScholarDigital Library
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM 51, 1 (january 2008), 107–113. https://doi.org/10.1145/1327452.1327492Google ScholarDigital Library
Stuart Geman, Elie Bienenstock, and René Doursat. 1992. Neural Networks and the Bias/Variance Dilemma. Neural Comput. 4, 1 (january 1992), 1–58. https://doi.org/10.1162/neco.1992.4.1.1Google ScholarDigital Library
https://github.com/googlecreativelab/quickdraw dataset. [n. d.]. .Google Scholar
Ksenia Konyushkova, Raphael Sznitman, and Pascal Fua. 2017. Learning Active Learning from Data. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett(Eds.). Curran Associates, Inc., 4225–4235. http://papers.nips.cc/paper/7010-learning-active-learning-from-data.pdfGoogle Scholar
David D. Lewis and Jason Catlett. 1994. Heterogeneous Uncertainty Sampling for Supervised Learning. In In Proceedings of the Eleventh International Conference on Machine Learning. Morgan Kaufmann, 148–156.Google Scholar
Xiangrui Meng, Joseph K. Bradley, Burak Yavuz, Evan R. Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, D. B. Tsai, Manish Amde, Sean Owen, Doris Xin, Reynold Xin, Michael J. Franklin, Reza Zadeh, Matei Zaharia, and Ameet Talwalkar. 2015. MLlib: Machine Learning in Apache Spark. CoRR abs/1505.06807(2015). arxiv:1505.06807http://arxiv.org/abs/1505.06807Google Scholar
Nicholas Roy and Andrew McCallum. 2001. Toward Optimal Active Learning Through Sampling Estimation of Error Reduction. In Proceedings of the Eighteenth International Conference on Machine Learning(ICML ’01). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 441–448. http://dl.acm.org/citation.cfm?id=645530.655646Google ScholarDigital Library
Burr Settles. 2008. Curious Machines: Active Learning with Structured Instances. Ph.D. Dissertation. University of Wisconsin-Madison. Advisor(s) Mark Craven.Google Scholar
Burr Settles. 2012. Active Learning. Morgan & Claypool Publishers.Google Scholar
H. S. Seung, M. Opper, and H. Sompolinsky. 1992. Query by Committee. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory(COLT ’92). ACM, New York, NY, USA, 287–294. https://doi.org/10.1145/130385.130417Google ScholarDigital Library
Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The Hadoop Distributed File System. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)(MSST ’10). IEEE Computer Society, Washington, DC, USA, 1–10. https://doi.org/10.1109/MSST.2010.5496972Google ScholarDigital Library
Brian C. Smith, Burr Settles, William C. Hallows, Mark W. Craven, and John M. Denu. 2011. SIRT3 Substrate Specificity Determined by Peptide Arrays and Machine Learning. ACS Chemical Biology 6, 2 (18 Feb 2011), 146–157. https://doi.org/10.1021/cb100218dGoogle Scholar
Simon Tong and Daphne Koller. 2002. Support Vector Machine Active Learning with Applications to Text Classification. J. Mach. Learn. Res. 2 (march 2002), 45–66. https://doi.org/10.1162/153244302760185243Google ScholarDigital Library
Reza Bosagh Zadeh and Gunnar Carlsson. 2013. Dimension Independent Matrix Square using MapReduce. CoRR abs/1304.1467(2013). arxiv:1304.1467http://arxiv.org/abs/1304.1467Google Scholar
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation(NSDI’12). USENIX Association, Berkeley, CA, USA, 2–2. http://dl.acm.org/citation.cfm?id=2228298.2228301Google Scholar
Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, and Ion Stoica. 2016. Apache Spark: A Unified Engine for Big Data Processing. Commun. ACM 59, 11 (october 2016), 56–65. https://doi.org/10.1145/2934664Google ScholarDigital Library
Jia-Jie Zhu and José Bento. 2017. Generative Adversarial Active Learning. CoRR abs/1702.07956(2017). http://arxiv.org/abs/1702.07956Google Scholar
Xiaojin Zhu, Andrew B. Goldberg, Ronald Brachman, and Thomas Dietterich. 2009. Introduction to Semi-Supervised Learning. Morgan and Claypool Publishers.Google Scholar

Recommendations

Combining active learning and semi-supervised for improving learning performance
ISABEL '11: Proceedings of the 4th International Symposium on Applied Sciences in Biomedical and Communication Technologies

In many learning tasks, there are abundant unlabeled samples but the number of labeled training samples is limited, because labeling the samples requires the efforts of human annotators and expertise. There are three major techniques for labeling the ...
Read More
Analysis of active semi-supervised learning
SAC '23: Proceedings of the 38th ACM/SIGAPP Symposium on Applied Computing

In many real-world applications, labeled instances are costly and infeasible to obtain large training sets. This way, learning strategies that do the most with fewer labels are calling attention, such as semi-supervised learning (SSL) and active learning ...
Read More
Transfer active learning
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management

Active learning traditionally assumes that labeled and unlabeled samples are subject to the same distributions and the goal of an active learner is to label the most informative unlabeled samples. In reality, situations may exist that we may not have ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

NSysS '20: Proceedings of the 7th International Conference on Networking, Systems and Security
December 2020
132 pages
ISBN:9781450389051
DOI:10.1145/3428363

Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 December 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
active learning
big data
distributed systems
machine learning
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate12of44submissions,27%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 74
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Distributing Active Learning Algorithms

NSysS '20: Proceedings of the 7th International Conference on Networking, Systems and Security

ABSTRACT

Supplemental Material

Available for Download

References

Cited By

Recommendations

Combining active learning and semi-supervised for improving learning performance

Analysis of active semi-supervised learning

Transfer active learning

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Distributing Active Learning Algorithms

NSysS '20: Proceedings of the 7th International Conference on Networking, Systems and Security

ABSTRACT

Supplemental Material

Available for Download

References

Cited By

Recommendations

Combining active learning and semi-supervised for improving learning performance

Analysis of active semi-supervised learning

Transfer active learning

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media