ABSTRACT
Sensitive documents are those that cannot be made public, e.g., for personal or organizational privacy reasons. For instance, documents requested through Freedom of Information mechanisms must be manually reviewed for the presence of sensitive information before their actual release. Hence, tools that can assist human reviewers in spotting sensitive information are of great value to government organizations subject to Freedom of Information laws. We look at sensitivity identification in terms of semi-automated text classification (SATC), the task of ranking automatically classified documents so as to optimize the cost-effectiveness of human post-checking work. We use a recently proposed utility-theoretic approach to SATC that explicitly optimizes the chosen effectiveness function when ranking the documents by sensitivity; this is especially useful in our case, since sensitivity identification is a recall-oriented task, thus requiring the use of a recall-oriented evaluation measure such as F2. We show the validity of this approach by running experiments on a multi-label multi-class dataset of government documents manually annotated according to different types of sensitivity.
- G. E. Batista, R. C. Prati, and M. C. Monard. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations, 6(1):20--29, 2004. Google ScholarDigital Library
- G. Berardi, A. Esuli, and F. Sebastiani. A utility-theoretic ranking method for semi-automated text classification. In Proceedings of the 35th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2012), pages 961--970, Portland OR, US, 2012. Google ScholarDigital Library
- M. Gabriel, C. Paskach, and D. Sharpe. The challenge and promise of predictive coding for privilege. In Proceedings of the ICAIL 2013 Workshop on Standards for Using Predictive Coding (DESI V), Roma, IT, 2013.Google Scholar
- T. Joachims. Making large-scale SVM learning practical. In B. Schölkopf, C. J. Burges, and A. J. Smola, editors, Advances in Kernel Methods -- Support Vector Learning, chapter 11, pages 169--184. The MIT Press, Cambridge, US, 1999. Google ScholarDigital Library
- M. Martinez-Alvarez, A. Bellogin, and T. Roelleke. Document difficulty framework for semi-automatic text classification. In Proceedings of the 15th International Conference on Data Warehousing and Knowledge Discovery (DaWaK 2013), Prague, CZ, 2013. Google ScholarDigital Library
- M. Martinez-Alvarez, S. Yahyaei, and T. Roelleke. Semi-automatic document classification: Exploiting document difficulty. In Proceedings of the 34th European Conference on Information Retrieval (ECIR 2012), Barcelona, ES, 2012. Google ScholarDigital Library
- G. McDonald, C. Macdonald, I. Ounis, and T. Gollins. Towards a classifier for digital sensitivity review. In Proceedings of the 36th European Conference on Information Retrieval (ECIR 2014), pages 500--506, Amsterdam, NL, 2014.Google ScholarDigital Library
- D. W. Oard and W. Webber. Information retrieval for e-discovery. Foundations and Trends in Information Retrieval, 7(2/3):99--237, 2013.Google ScholarCross Ref
- G. Szarvas, R. Farkas, and R. Busa-Fekete. State-of-the-art anonymisation of medical data with an iterative machine learning model/framework. Journal of the American Medical Informatics Association, 14(5):574--580, 2007.Google ScholarCross Ref
- J. K. Vinjumur, D. W. Oard, , and J. H. Paik. Assessing the reliability and reusability of an e-discovery privilege test collection. In Proceedings of the 37th ACM Conference on Research and Development in Information Retrieval (SIGIR 2014), pages 1047--1050, Gold Coast, AU, 2014. Google ScholarDigital Library
- T. Wilson, P. Hoffmann, S. Somasundaran, J. Kessler, J. Wiebe, Y. Choi, C. Cardie, E. Riloff, and S. Patwardhan. OpinionFinder: A system for subjectivity analysis. In Proceedings of the HLT/EMNLP 2005 Interactive Demonstrations, pages 34--35, Vancouver, CA, 2005. Google ScholarDigital Library
Index Terms
- Semi-Automated Text Classification for Sensitivity Identification
Recommendations
How Sensitivity Classification Effectiveness Impacts Reviewers in Technology-Assisted Sensitivity Review
CHIIR '19: Proceedings of the 2019 Conference on Human Information Interaction and RetrievalAll government documents that are released to the public must first be manually reviewed to identify and protect any sensitive information, e.g. confidential information. However, the unassisted manual sensitivity review of born-digital documents is not ...
A utility-theoretic ranking method for semi-automated text classification
SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrievalIn Semi-Automated Text Classification (SATC) an automatic classifier F labels a set of unlabelled documents D, following which a human annotator inspects (and corrects when appropriate) the labels attributed by F to a subset D' of D, with the aim of ...
A framework for enhanced text classification in sensitivity and reputation management
FDIA '15: Proceedings of the 6th Symposium on Future Directions in Information AccessFreedom of Information (FOI) laws state that government documents should be open to the public. However, many government documents contain sensitive information that is exempt from release. In this PhD programme, we aim to develop a framework that can ...
Comments