ABSTRACT
Active Learning is a machine learning strategy that aims at finding out an optimal labeling sequence for a huge pool of unlabeled data. We observe that sometimes there are not enough labeled data in contrast to unlabeled samples. Moreover, in some scenarios labeling requires a lot of time and expert supervision. In those cases, we need to chalk out an optimal labeling order so that with a relatively small amount of labeled samples the model gives fairly good accuracy levels. This problem becomes more serious when dealing with distributed data that needs a distributed processing framework. In this work, we propose distributed implementations of state of the art active learning algorithms and perform various analyses on them. The algorithms are tested with real datasets on multinode spark clusters with data distributed on a distributed file system (HDFS). We show that our algorithms perform better than randomly labeling data i.e. non-active learning scenarios and show their mutual performance comparisons. The code is publicly available at https://github.com/dv66/Distributed-Active-Learning
Supplemental Material
Available for Download
Presentation slides
- Leo Breiman. 2001. Random Forests. Mach. Learn. 45, 1 (october 2001), 5–32. https://doi.org/10.1023/A:1010933404324Google ScholarDigital Library
- Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM 51, 1 (january 2008), 107–113. https://doi.org/10.1145/1327452.1327492Google ScholarDigital Library
- Stuart Geman, Elie Bienenstock, and René Doursat. 1992. Neural Networks and the Bias/Variance Dilemma. Neural Comput. 4, 1 (january 1992), 1–58. https://doi.org/10.1162/neco.1992.4.1.1Google ScholarDigital Library
- https://github.com/googlecreativelab/quickdraw dataset. [n. d.]. .Google Scholar
- Ksenia Konyushkova, Raphael Sznitman, and Pascal Fua. 2017. Learning Active Learning from Data. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett(Eds.). Curran Associates, Inc., 4225–4235. http://papers.nips.cc/paper/7010-learning-active-learning-from-data.pdfGoogle Scholar
- David D. Lewis and Jason Catlett. 1994. Heterogeneous Uncertainty Sampling for Supervised Learning. In In Proceedings of the Eleventh International Conference on Machine Learning. Morgan Kaufmann, 148–156.Google Scholar
- Xiangrui Meng, Joseph K. Bradley, Burak Yavuz, Evan R. Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, D. B. Tsai, Manish Amde, Sean Owen, Doris Xin, Reynold Xin, Michael J. Franklin, Reza Zadeh, Matei Zaharia, and Ameet Talwalkar. 2015. MLlib: Machine Learning in Apache Spark. CoRR abs/1505.06807(2015). arxiv:1505.06807http://arxiv.org/abs/1505.06807Google Scholar
- Nicholas Roy and Andrew McCallum. 2001. Toward Optimal Active Learning Through Sampling Estimation of Error Reduction. In Proceedings of the Eighteenth International Conference on Machine Learning(ICML ’01). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 441–448. http://dl.acm.org/citation.cfm?id=645530.655646Google ScholarDigital Library
- Burr Settles. 2008. Curious Machines: Active Learning with Structured Instances. Ph.D. Dissertation. University of Wisconsin-Madison. Advisor(s) Mark Craven.Google Scholar
- Burr Settles. 2012. Active Learning. Morgan & Claypool Publishers.Google Scholar
- H. S. Seung, M. Opper, and H. Sompolinsky. 1992. Query by Committee. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory(COLT ’92). ACM, New York, NY, USA, 287–294. https://doi.org/10.1145/130385.130417Google ScholarDigital Library
- Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The Hadoop Distributed File System. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)(MSST ’10). IEEE Computer Society, Washington, DC, USA, 1–10. https://doi.org/10.1109/MSST.2010.5496972Google ScholarDigital Library
- Brian C. Smith, Burr Settles, William C. Hallows, Mark W. Craven, and John M. Denu. 2011. SIRT3 Substrate Specificity Determined by Peptide Arrays and Machine Learning. ACS Chemical Biology 6, 2 (18 Feb 2011), 146–157. https://doi.org/10.1021/cb100218dGoogle Scholar
- Simon Tong and Daphne Koller. 2002. Support Vector Machine Active Learning with Applications to Text Classification. J. Mach. Learn. Res. 2 (march 2002), 45–66. https://doi.org/10.1162/153244302760185243Google ScholarDigital Library
- Reza Bosagh Zadeh and Gunnar Carlsson. 2013. Dimension Independent Matrix Square using MapReduce. CoRR abs/1304.1467(2013). arxiv:1304.1467http://arxiv.org/abs/1304.1467Google Scholar
- Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation(NSDI’12). USENIX Association, Berkeley, CA, USA, 2–2. http://dl.acm.org/citation.cfm?id=2228298.2228301Google Scholar
- Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, and Ion Stoica. 2016. Apache Spark: A Unified Engine for Big Data Processing. Commun. ACM 59, 11 (october 2016), 56–65. https://doi.org/10.1145/2934664Google ScholarDigital Library
- Jia-Jie Zhu and José Bento. 2017. Generative Adversarial Active Learning. CoRR abs/1702.07956(2017). http://arxiv.org/abs/1702.07956Google Scholar
- Xiaojin Zhu, Andrew B. Goldberg, Ronald Brachman, and Thomas Dietterich. 2009. Introduction to Semi-Supervised Learning. Morgan and Claypool Publishers.Google Scholar
Recommendations
Combining active learning and semi-supervised for improving learning performance
ISABEL '11: Proceedings of the 4th International Symposium on Applied Sciences in Biomedical and Communication TechnologiesIn many learning tasks, there are abundant unlabeled samples but the number of labeled training samples is limited, because labeling the samples requires the efforts of human annotators and expertise. There are three major techniques for labeling the ...
Analysis of active semi-supervised learning
SAC '23: Proceedings of the 38th ACM/SIGAPP Symposium on Applied ComputingIn many real-world applications, labeled instances are costly and infeasible to obtain large training sets. This way, learning strategies that do the most with fewer labels are calling attention, such as semi-supervised learning (SSL) and active learning ...
Transfer active learning
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge managementActive learning traditionally assumes that labeled and unlabeled samples are subject to the same distributions and the goal of an active learner is to label the most informative unlabeled samples. In reality, situations may exist that we may not have ...
Comments