short-paper

Semi-Automated Text Classification for Sensitivity Identification

Authors:
Giacomo Berardi

Consiglio Nazionale delle Ricerche, Pisa, Italy

Consiglio Nazionale delle Ricerche, Pisa, Italy
View Profile

,
Andrea Esuli

Consiglio Nazionale delle Ricerche, Pisa, Italy

Consiglio Nazionale delle Ricerche, Pisa, Italy
View Profile

,
Craig Macdonald

University of Glasgow, Glasgow, United Kingdom

University of Glasgow, Glasgow, United Kingdom
View Profile

,
Iadh Ounis

University of Glasgow, Glasgow, United Kingdom

University of Glasgow, Glasgow, United Kingdom
View Profile

,
Fabrizio Sebastiani

Qatar Computing Research Institute, Doha, Qatar

Qatar Computing Research Institute, Doha, Qatar
View Profile

CIKM '15: Proceedings of the 24th ACM International on Conference on Information and Knowledge ManagementOctober 2015Pages 1711–1714https://doi.org/10.1145/2806416.2806597

Published:17 October 2015Publication History

CIKM '15: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management

Pages 1711–1714

ABSTRACT

Sensitive documents are those that cannot be made public, e.g., for personal or organizational privacy reasons. For instance, documents requested through Freedom of Information mechanisms must be manually reviewed for the presence of sensitive information before their actual release. Hence, tools that can assist human reviewers in spotting sensitive information are of great value to government organizations subject to Freedom of Information laws. We look at sensitivity identification in terms of semi-automated text classification (SATC), the task of ranking automatically classified documents so as to optimize the cost-effectiveness of human post-checking work. We use a recently proposed utility-theoretic approach to SATC that explicitly optimizes the chosen effectiveness function when ranking the documents by sensitivity; this is especially useful in our case, since sensitivity identification is a recall-oriented task, thus requiring the use of a recall-oriented evaluation measure such as F₂. We show the validity of this approach by running experiments on a multi-label multi-class dataset of government documents manually annotated according to different types of sensitivity.

References

G. E. Batista, R. C. Prati, and M. C. Monard. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations, 6(1):20--29, 2004. Google ScholarDigital Library
G. Berardi, A. Esuli, and F. Sebastiani. A utility-theoretic ranking method for semi-automated text classification. In Proceedings of the 35th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2012), pages 961--970, Portland OR, US, 2012. Google ScholarDigital Library
M. Gabriel, C. Paskach, and D. Sharpe. The challenge and promise of predictive coding for privilege. In Proceedings of the ICAIL 2013 Workshop on Standards for Using Predictive Coding (DESI V), Roma, IT, 2013.Google Scholar
T. Joachims. Making large-scale SVM learning practical. In B. Schölkopf, C. J. Burges, and A. J. Smola, editors, Advances in Kernel Methods -- Support Vector Learning, chapter 11, pages 169--184. The MIT Press, Cambridge, US, 1999. Google ScholarDigital Library
M. Martinez-Alvarez, A. Bellogin, and T. Roelleke. Document difficulty framework for semi-automatic text classification. In Proceedings of the 15th International Conference on Data Warehousing and Knowledge Discovery (DaWaK 2013), Prague, CZ, 2013. Google ScholarDigital Library
M. Martinez-Alvarez, S. Yahyaei, and T. Roelleke. Semi-automatic document classification: Exploiting document difficulty. In Proceedings of the 34th European Conference on Information Retrieval (ECIR 2012), Barcelona, ES, 2012. Google ScholarDigital Library
G. McDonald, C. Macdonald, I. Ounis, and T. Gollins. Towards a classifier for digital sensitivity review. In Proceedings of the 36th European Conference on Information Retrieval (ECIR 2014), pages 500--506, Amsterdam, NL, 2014.Google ScholarDigital Library
D. W. Oard and W. Webber. Information retrieval for e-discovery. Foundations and Trends in Information Retrieval, 7(2/3):99--237, 2013.Google ScholarCross Ref
G. Szarvas, R. Farkas, and R. Busa-Fekete. State-of-the-art anonymisation of medical data with an iterative machine learning model/framework. Journal of the American Medical Informatics Association, 14(5):574--580, 2007.Google ScholarCross Ref
J. K. Vinjumur, D. W. Oard, , and J. H. Paik. Assessing the reliability and reusability of an e-discovery privilege test collection. In Proceedings of the 37th ACM Conference on Research and Development in Information Retrieval (SIGIR 2014), pages 1047--1050, Gold Coast, AU, 2014. Google ScholarDigital Library
T. Wilson, P. Hoffmann, S. Somasundaran, J. Kessler, J. Wiebe, Y. Choi, C. Cardie, E. Riloff, and S. Patwardhan. OpinionFinder: A system for subjectivity analysis. In Proceedings of the HLT/EMNLP 2005 Interactive Demonstrations, pages 34--35, Vancouver, CA, 2005. Google ScholarDigital Library

Index Terms

Semi-Automated Text Classification for Sensitivity Identification
1. Computing methodologies
  1. Machine learning

Recommendations

How Sensitivity Classification Effectiveness Impacts Reviewers in Technology-Assisted Sensitivity Review
CHIIR '19: Proceedings of the 2019 Conference on Human Information Interaction and Retrieval

All government documents that are released to the public must first be manually reviewed to identify and protect any sensitive information, e.g. confidential information. However, the unassisted manual sensitivity review of born-digital documents is not ...
Read More
A utility-theoretic ranking method for semi-automated text classification
SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

In Semi-Automated Text Classification (SATC) an automatic classifier F labels a set of unlabelled documents D, following which a human annotator inspects (and corrects when appropriate) the labels attributed by F to a subset D' of D, with the aim of ...
Read More
A framework for enhanced text classification in sensitivity and reputation management
FDIA '15: Proceedings of the 6th Symposium on Future Directions in Information Access

Freedom of Information (FOI) laws state that government documents should be open to the public. However, many government documents contain sensitive information that is exempt from release. In this PhD programme, we aim to develop a framework that can ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '15: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management
October 2015
1998 pages
ISBN:9781450337946
DOI:10.1145/2806416
General Chairs:
James Bailey
The University of Melbourne
,
Alistair Moffat
The University of Melbourne
,
Program Chairs:
Charu C. Aggarwal
IBM
,
Maarten de Rijke
University of Amsterdam
,
Ravi Kumar
Google
,
Vanessa Murdock
Microsoft
,
Timos Sellis
RMIT University
,
Jeffrey Xu Yu
Chinese University of Hong Kong
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 October 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
semi-automated text classification
sensitive information
Qualifiers
- short-paper
Conference

Acceptance Rates
CIKM '15 Paper Acceptance Rate165of646submissions,26%Overall Acceptance Rate1,861of8,427submissions,22%
More
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 15
  Total Citations
  View Citations
- 336
  Total Downloads
- Downloads (Last 12 months)14
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Semi-Automated Text Classification for Sensitivity Identification

CIKM '15: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management

ABSTRACT

References

Cited By

Index Terms

Recommendations

How Sensitivity Classification Effectiveness Impacts Reviewers in Technology-Assisted Sensitivity Review

A utility-theoretic ranking method for semi-automated text classification

A framework for enhanced text classification in sensitivity and reputation management