ABSTRACT
For finite document collections, continuous active learning ('CAL') has been observed to achieve high recall with high probability, at a labeling cost asymptotically proportional to the number of relevant documents. As the size of the collection increases, the number of relevant documents typically increases as well, thereby limiting the applicability of CAL to low-prevalence high-stakes classes, such as evidence in legal proceedings, or security threats, where human effort proportional to the number of relevant documents is justified. We present a scalable version of CAL ('S-CAL') that requires O(log N) labeling effort and O(N log N) computational effort---where N is the number of unlabeled training examples---to construct a classifier whose effectiveness for a given labeling cost compares favorably with previously reported methods. At the same time, S-CAL offers calibrated estimates of class prevalence, recall, and precision, facilitating both threshold setting and determination of the adequacy of the classifier.
- Da Silva Moore v. Publicis Groupe. 287 F.R.D. 182, S.D.N.Y., 2012.Google Scholar
- Case Management Order: Protocol Relating to the Production of Electronically Stored Information ("ESI"). In In Re: Actos (Pioglitazone) Products Liability Litigation. MDL No. 6:11-md-2299, W.D. La., July 27, 2012.Google Scholar
- M. Bagdouri, D. D. Lewis, and D. W. Oard. Sequential testing in classifier evaluation yields biased estimates of effectiveness. In SIGIR 2013.Google Scholar
- M. Bagdouri, W. Webber, D. D. Lewis, and D. W. Oard. Towards minimizing the annotation cost of certified text classification. In SIGIR 2013. Google ScholarDigital Library
- L. Bottou and O. Bousquet. Learning using large datasets. Mining Massive DataSets for Security, 2008.Google Scholar
- A. M. Cohen, W. R. Hersh, K. Peterson, and P.-Y. Yen. Reducing workload in systematic review preparation using automated citation classification. J. Am. Med. Inform. Assoc., 13(2), 2006.Google ScholarCross Ref
- R. Collobert, F. Sinz, J. Weston, and L. Bottou. Large scale transductive svms. J. Mach. Learn. Res., 7, 2006. Google ScholarDigital Library
- G. V. Cormack and M. R. Grossman. Engineering quality and reliability in technology-assisted review. In SIGIR 2016. Google ScholarDigital Library
- G. V. Cormack and M. R. Grossman. Evaluation of machine-learning protocols for technology-assisted review in electronic discovery. In SIGIR 2014. Google ScholarDigital Library
- G. V. Cormack and M. R. Grossman. Multi-faceted recall of continuous active learning for technology-assisted review. In SIGIR 2015. Google ScholarDigital Library
- G. V. Cormack and M. R. Grossman. Waterloo (Cormack) participation in the TREC 2015 Total Recall Track. In TREC 2015.Google Scholar
- G. V. Cormack and M. R. Grossman. Systems and methods for classifying electronic information using advanced active learning techniques. United States Patent 9122681, 2013.Google Scholar
- G. V. Cormack and M. R. Grossman. Autonomy and reliability of continuous active learning for technology-assisted review. arXiv:1504.06868, 2015.Google Scholar
- G. V. Cormack and E. Lee. Information retrieval effectiveness measurement using very sparse relevance assessments. Technical report, University of Waterloo, 2011.Google Scholar
- G. V. Cormack, C. R. Palmer, and C. L. A. Clarke. Efficient construction of large test collections. In SIGIR 1998. Google ScholarDigital Library
- T.-N. Do and J.-D. Fekete. Large scale classification with support vector machine algorithms. In ICMLA 2007. Google ScholarDigital Library
- A. Esuli and F. Sebastiani. Active learning strategies for multi-label text classification. In Advances in Information Retrieval. Springer, 2009. Google ScholarDigital Library
- M. R. Grossman and G. V. Cormack. Technology-assisted review in e-discovery can be more effective and more efficient than exhaustive manual review. Richmond J. L. & Tech., 17(3), 2011.Google Scholar
- M. R. Grossman and G. V. Cormack. Comments on "The implications of Rule 26(g) on the use of technology-assisted review". Fed. Cts. L. Rev., 7, 2014.Google Scholar
- I. Guyon, G. C. Cawley, G. Dror, and V. Lemaire. Results of the Active Learning Challenge. Workshop on Active Learning and Experimental Design, JMLR Workshop and Conference Proceedings 16, 2011.Google Scholar
- S. C. Hoi, R. Jin, and M. R. Lyu. Large-scale text categorization by batch mode active learning. In WWW 2006. Google ScholarDigital Library
- C. Lefebvre, E. Manheimer, and J. Glanville. Searching for studies. Cochrane handbook for systematic reviews of interventions, 2008.Google Scholar
- D. D. Lewis and W. A. Gale. A sequential algorithm for training text classifiers. In SIGIR 1994. Google ScholarDigital Library
- D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. RCV1: A new benchmark collection for text categorization research. J. Mach. Learn. Res., 5:361--397, 2004. Google ScholarDigital Library
- P. Oot, A. Kershaw, and H. L. Roitblat. Mandating reasonableness in a reasonable inquiry. Denver L. Rev., 87:533, 2010.Google Scholar
- I. Partalas, A. Kosmopoulos, N. Baskiotis, T. Artieres, G. Paliouras, E. Gaussier, I. Androutsopoulos, M.-R. Amini, and P. Galinari. Lshtc: A benchmark for large-scale text classification. arXiv:1503.08581, 2015.Google Scholar
- A. Peck. Search, forward: Will manual document review and keyword searches be replaced by computer-assisted coding? Law Tech. News, Oct. 1, 2011.Google Scholar
- Y. Ravid. System for Enhancing Expert-Based Computerized Analysis of a Set of Digital Documents and Methods Useful in Conjunction Therewith. United States Patent 8527523, 2013.Google Scholar
- A. Roegiest, G. V. Cormack, M. R. Grossman, and C. L. A. Clarke. TREC 2015 Total Recall Track Overview. In TREC 2015.Google Scholar
- H. Roitblat, A. Kershaw, and P. Oot. Document categorization in legal electronic discovery: Computer classification vs. manual review. J. Assoc. Inf. Sci. Technol., 61(1), 2010. Google ScholarDigital Library
- M. Sanderson and H. Joho. Forming test collections with no system pooling. In SIGIR 2004. Google ScholarDigital Library
- K. Schieneman and T. Gricks. The implications of Rule 26(g) on the use of technology-assisted review. Fed. Cts. L. Rev., 7, 2013.Google Scholar
- F. Sebastiani. Machine learning in automated text categorization. ACM Comput. Surv., 34(1), 2002. Google ScholarDigital Library
- B. Settles. Active learning literature survey. University of Wisconsin, Madison, 2010.Google ScholarDigital Library
- I. Soboroff and S. Robertson. Building a filtering test collection for TREC 2002. In SIGIR 2003. Google ScholarDigital Library
- A. Vlachos. A stopping criterion for active learning. Comput. Speech Lang., 22(3), 2008. Google ScholarDigital Library
- E. M. Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. Inf. Process. Manag., 36(5), 2000. Google ScholarDigital Library
- E. M. Voorhees. The philosophy of information retrieval evaluation. In Evaluation of cross-language information retrieval systems, 2002. Google ScholarDigital Library
- E. M. Voorhees. The TREC 2005 Robust Track. In ACM SIGIR Forum, volume 40. ACM, 2006. Google ScholarDigital Library
- B. C. Wallace, K. Small, C. E. Brodley, and T. A. Trikalinos. Active learning for biomedical citation screening. In KDD 2010. Google ScholarDigital Library
- W. Webber. Approximate recall confidence intervals. ACM Trans. Inf. Syst., 31(1), 2013. Google ScholarDigital Library
- W. Webber, M. Bagdouri, D. D. Lewis, and D. W. Oard. Sequential testing in classifier evaluation yields biased estimates of effectiveness. In SIGIR 2013. Google ScholarDigital Library
- Z. Xu, C. Hogan, and R. Bauer. Greedy is not enough: An efficient batch mode active learning algorithm. In ICDMW 2009. Google ScholarDigital Library
- B. Yang, J.-T. Sun, T. Wang, and Z. Chen. Effective multi-label active learning for text classification. In KDD 2009. Google ScholarDigital Library
- J. Zobel. How reliable are the results of large-scale information retrieval experiments? In SIGIR 1998. Google ScholarDigital Library
Index Terms
- Scalability of Continuous Active Learning for Reliable High-Recall Text Classification
Recommendations
Effective User Interaction for High-Recall Retrieval: Less is More
CIKM '18: Proceedings of the 27th ACM International Conference on Information and Knowledge ManagementHigh-recall retrieval --- finding all or nearly all relevant documents --- is critical to applications such as electronic discovery, systematic review, and the construction of test collections for information retrieval tasks. The effectiveness of ...
Multi-Faceted Recall of Continuous Active Learning for Technology-Assisted Review
SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information RetrievalContinuous active learning achieves high recall for technology-assisted review, not only for an overall information need, but also for various facets of that information need, whether explicit or implicit. Through simulations using Cormack and Grossman'...
Evaluation of machine-learning protocols for technology-assisted review in electronic discovery
SIGIR '14: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrievalAbstract Using a novel evaluation toolkit that simulates a human reviewer in the loop, we compare the effectiveness of three machine-learning protocols for technology-assisted review as used in document review for discovery in legal proceedings. Our ...
Comments