Scalability of Continuous Active Learning for Reliable High-Recall Text Classification

Authors:
Gordon V. Cormack

University of Waterloo, Waterloo, ON, Canada

University of Waterloo, Waterloo, ON, Canada
View Profile

,
Maura R. Grossman

University of Waterloo, Waterloo, ON, Canada

University of Waterloo, Waterloo, ON, Canada
View Profile

CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge ManagementOctober 2016Pages 1039–1048https://doi.org/10.1145/2983323.2983776

Published:24 October 2016Publication History

CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management

Pages 1039–1048

ABSTRACT

For finite document collections, continuous active learning ('CAL') has been observed to achieve high recall with high probability, at a labeling cost asymptotically proportional to the number of relevant documents. As the size of the collection increases, the number of relevant documents typically increases as well, thereby limiting the applicability of CAL to low-prevalence high-stakes classes, such as evidence in legal proceedings, or security threats, where human effort proportional to the number of relevant documents is justified. We present a scalable version of CAL ('S-CAL') that requires O(log N) labeling effort and O(N log N) computational effort---where N is the number of unlabeled training examples---to construct a classifier whose effectiveness for a given labeling cost compares favorably with previously reported methods. At the same time, S-CAL offers calibrated estimates of class prevalence, recall, and precision, facilitating both threshold setting and determination of the adequacy of the classifier.

References

Da Silva Moore v. Publicis Groupe. 287 F.R.D. 182, S.D.N.Y., 2012.Google Scholar
Case Management Order: Protocol Relating to the Production of Electronically Stored Information ("ESI"). In In Re: Actos (Pioglitazone) Products Liability Litigation. MDL No. 6:11-md-2299, W.D. La., July 27, 2012.Google Scholar
M. Bagdouri, D. D. Lewis, and D. W. Oard. Sequential testing in classifier evaluation yields biased estimates of effectiveness. In SIGIR 2013.Google Scholar
M. Bagdouri, W. Webber, D. D. Lewis, and D. W. Oard. Towards minimizing the annotation cost of certified text classification. In SIGIR 2013. Google ScholarDigital Library
L. Bottou and O. Bousquet. Learning using large datasets. Mining Massive DataSets for Security, 2008.Google Scholar
A. M. Cohen, W. R. Hersh, K. Peterson, and P.-Y. Yen. Reducing workload in systematic review preparation using automated citation classification. J. Am. Med. Inform. Assoc., 13(2), 2006.Google ScholarCross Ref
R. Collobert, F. Sinz, J. Weston, and L. Bottou. Large scale transductive svms. J. Mach. Learn. Res., 7, 2006. Google ScholarDigital Library
G. V. Cormack and M. R. Grossman. Engineering quality and reliability in technology-assisted review. In SIGIR 2016. Google ScholarDigital Library
G. V. Cormack and M. R. Grossman. Evaluation of machine-learning protocols for technology-assisted review in electronic discovery. In SIGIR 2014. Google ScholarDigital Library
G. V. Cormack and M. R. Grossman. Multi-faceted recall of continuous active learning for technology-assisted review. In SIGIR 2015. Google ScholarDigital Library
G. V. Cormack and M. R. Grossman. Waterloo (Cormack) participation in the TREC 2015 Total Recall Track. In TREC 2015.Google Scholar
G. V. Cormack and M. R. Grossman. Systems and methods for classifying electronic information using advanced active learning techniques. United States Patent 9122681, 2013.Google Scholar
G. V. Cormack and M. R. Grossman. Autonomy and reliability of continuous active learning for technology-assisted review. arXiv:1504.06868, 2015.Google Scholar
G. V. Cormack and E. Lee. Information retrieval effectiveness measurement using very sparse relevance assessments. Technical report, University of Waterloo, 2011.Google Scholar
G. V. Cormack, C. R. Palmer, and C. L. A. Clarke. Efficient construction of large test collections. In SIGIR 1998. Google ScholarDigital Library
T.-N. Do and J.-D. Fekete. Large scale classification with support vector machine algorithms. In ICMLA 2007. Google ScholarDigital Library
A. Esuli and F. Sebastiani. Active learning strategies for multi-label text classification. In Advances in Information Retrieval. Springer, 2009. Google ScholarDigital Library
M. R. Grossman and G. V. Cormack. Technology-assisted review in e-discovery can be more effective and more efficient than exhaustive manual review. Richmond J. L. & Tech., 17(3), 2011.Google Scholar
M. R. Grossman and G. V. Cormack. Comments on "The implications of Rule 26(g) on the use of technology-assisted review". Fed. Cts. L. Rev., 7, 2014.Google Scholar
I. Guyon, G. C. Cawley, G. Dror, and V. Lemaire. Results of the Active Learning Challenge. Workshop on Active Learning and Experimental Design, JMLR Workshop and Conference Proceedings 16, 2011.Google Scholar
S. C. Hoi, R. Jin, and M. R. Lyu. Large-scale text categorization by batch mode active learning. In WWW 2006. Google ScholarDigital Library
C. Lefebvre, E. Manheimer, and J. Glanville. Searching for studies. Cochrane handbook for systematic reviews of interventions, 2008.Google Scholar
D. D. Lewis and W. A. Gale. A sequential algorithm for training text classifiers. In SIGIR 1994. Google ScholarDigital Library
D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. RCV1: A new benchmark collection for text categorization research. J. Mach. Learn. Res., 5:361--397, 2004. Google ScholarDigital Library
P. Oot, A. Kershaw, and H. L. Roitblat. Mandating reasonableness in a reasonable inquiry. Denver L. Rev., 87:533, 2010.Google Scholar
I. Partalas, A. Kosmopoulos, N. Baskiotis, T. Artieres, G. Paliouras, E. Gaussier, I. Androutsopoulos, M.-R. Amini, and P. Galinari. Lshtc: A benchmark for large-scale text classification. arXiv:1503.08581, 2015.Google Scholar
A. Peck. Search, forward: Will manual document review and keyword searches be replaced by computer-assisted coding? Law Tech. News, Oct. 1, 2011.Google Scholar
Y. Ravid. System for Enhancing Expert-Based Computerized Analysis of a Set of Digital Documents and Methods Useful in Conjunction Therewith. United States Patent 8527523, 2013.Google Scholar
A. Roegiest, G. V. Cormack, M. R. Grossman, and C. L. A. Clarke. TREC 2015 Total Recall Track Overview. In TREC 2015.Google Scholar
H. Roitblat, A. Kershaw, and P. Oot. Document categorization in legal electronic discovery: Computer classification vs. manual review. J. Assoc. Inf. Sci. Technol., 61(1), 2010. Google ScholarDigital Library
M. Sanderson and H. Joho. Forming test collections with no system pooling. In SIGIR 2004. Google ScholarDigital Library
K. Schieneman and T. Gricks. The implications of Rule 26(g) on the use of technology-assisted review. Fed. Cts. L. Rev., 7, 2013.Google Scholar
F. Sebastiani. Machine learning in automated text categorization. ACM Comput. Surv., 34(1), 2002. Google ScholarDigital Library
B. Settles. Active learning literature survey. University of Wisconsin, Madison, 2010.Google ScholarDigital Library
I. Soboroff and S. Robertson. Building a filtering test collection for TREC 2002. In SIGIR 2003. Google ScholarDigital Library
A. Vlachos. A stopping criterion for active learning. Comput. Speech Lang., 22(3), 2008. Google ScholarDigital Library
E. M. Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. Inf. Process. Manag., 36(5), 2000. Google ScholarDigital Library
E. M. Voorhees. The philosophy of information retrieval evaluation. In Evaluation of cross-language information retrieval systems, 2002. Google ScholarDigital Library
E. M. Voorhees. The TREC 2005 Robust Track. In ACM SIGIR Forum, volume 40. ACM, 2006. Google ScholarDigital Library
B. C. Wallace, K. Small, C. E. Brodley, and T. A. Trikalinos. Active learning for biomedical citation screening. In KDD 2010. Google ScholarDigital Library
W. Webber. Approximate recall confidence intervals. ACM Trans. Inf. Syst., 31(1), 2013. Google ScholarDigital Library
W. Webber, M. Bagdouri, D. D. Lewis, and D. W. Oard. Sequential testing in classifier evaluation yields biased estimates of effectiveness. In SIGIR 2013. Google ScholarDigital Library
Z. Xu, C. Hogan, and R. Bauer. Greedy is not enough: An efficient batch mode active learning algorithm. In ICDMW 2009. Google ScholarDigital Library
B. Yang, J.-T. Sun, T. Wang, and Z. Chen. Effective multi-label active learning for text classification. In KDD 2009. Google ScholarDigital Library
J. Zobel. How reliable are the results of large-scale information retrieval experiments? In SIGIR 1998. Google ScholarDigital Library

Index Terms

Scalability of Continuous Active Learning for Reliable High-Recall Text Classification
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification

Recommendations

Effective User Interaction for High-Recall Retrieval: Less is More
CIKM '18: Proceedings of the 27th ACM International Conference on Information and Knowledge Management

High-recall retrieval --- finding all or nearly all relevant documents --- is critical to applications such as electronic discovery, systematic review, and the construction of test collections for information retrieval tasks. The effectiveness of ...
Read More
Multi-Faceted Recall of Continuous Active Learning for Technology-Assisted Review
SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval

Continuous active learning achieves high recall for technology-assisted review, not only for an overall information need, but also for various facets of that information need, whether explicit or implicit. Through simulations using Cormack and Grossman'...
Read More
Evaluation of machine-learning protocols for technology-assisted review in electronic discovery
SIGIR '14: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval

Abstract Using a novel evaluation toolkit that simulates a human reviewer in the loop, we compare the effectiveness of three machine-learning protocols for technology-assisted review as used in document review for discovery in legal proceedings. Our ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management
October 2016
2566 pages
ISBN:9781450340731
DOI:10.1145/2983323
General Chairs:
Snehasis Mukhopadhyay
Indiana University Purdue University Indianapolis, USA
,
ChengXiang Zhai
University of Illinois at Urbana-Champaign, USA
,
Program Chairs:
Elisa Bertino
Purdue University
,
Fabio Crestani
University of Lugano
,
Javed Mostafa
University of North Carolina
,
Jie Tang
Tsinghua University
,
Luo Si
Alibaba Group Inc & Purdue University
,
Xiaofang Zhou
University of Queensland
,
Yi Chang
Yahoo Research
,
Yunyao Li
IBM Research - Almaden
,
Parikshit Sondhi
WalmartLabs
Copyright © 2016 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 October 2016
Check for updates
Author Tags
cal
continuous active learning
ediscovery
electronic discovery
predictive coding
relevance feedback
tar
technology-assisted review
test collections
text categorization
volume estimation
Qualifiers
- research-article
Conference

Acceptance Rates
CIKM '16 Paper Acceptance Rate160of701submissions,23%Overall Acceptance Rate1,861of8,427submissions,22%
More
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 44
  Total Citations
  View Citations
- 1,426
  Total Downloads
- Downloads (Last 12 months)154
- Downloads (Last 6 weeks)29
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Scalability of Continuous Active Learning for Reliable High-Recall Text Classification

CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Effective User Interaction for High-Recall Retrieval: Less is More

Multi-Faceted Recall of Continuous Active Learning for Technology-Assisted Review

Evaluation of machine-learning protocols for technology-assisted review in electronic discovery