skip to main content
10.1145/2806416.2806597acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
short-paper

Semi-Automated Text Classification for Sensitivity Identification

Published:17 October 2015Publication History

ABSTRACT

Sensitive documents are those that cannot be made public, e.g., for personal or organizational privacy reasons. For instance, documents requested through Freedom of Information mechanisms must be manually reviewed for the presence of sensitive information before their actual release. Hence, tools that can assist human reviewers in spotting sensitive information are of great value to government organizations subject to Freedom of Information laws. We look at sensitivity identification in terms of semi-automated text classification (SATC), the task of ranking automatically classified documents so as to optimize the cost-effectiveness of human post-checking work. We use a recently proposed utility-theoretic approach to SATC that explicitly optimizes the chosen effectiveness function when ranking the documents by sensitivity; this is especially useful in our case, since sensitivity identification is a recall-oriented task, thus requiring the use of a recall-oriented evaluation measure such as F2. We show the validity of this approach by running experiments on a multi-label multi-class dataset of government documents manually annotated according to different types of sensitivity.

References

  1. G. E. Batista, R. C. Prati, and M. C. Monard. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations, 6(1):20--29, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. G. Berardi, A. Esuli, and F. Sebastiani. A utility-theoretic ranking method for semi-automated text classification. In Proceedings of the 35th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2012), pages 961--970, Portland OR, US, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Gabriel, C. Paskach, and D. Sharpe. The challenge and promise of predictive coding for privilege. In Proceedings of the ICAIL 2013 Workshop on Standards for Using Predictive Coding (DESI V), Roma, IT, 2013.Google ScholarGoogle Scholar
  4. T. Joachims. Making large-scale SVM learning practical. In B. Schölkopf, C. J. Burges, and A. J. Smola, editors, Advances in Kernel Methods -- Support Vector Learning, chapter 11, pages 169--184. The MIT Press, Cambridge, US, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. Martinez-Alvarez, A. Bellogin, and T. Roelleke. Document difficulty framework for semi-automatic text classification. In Proceedings of the 15th International Conference on Data Warehousing and Knowledge Discovery (DaWaK 2013), Prague, CZ, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Martinez-Alvarez, S. Yahyaei, and T. Roelleke. Semi-automatic document classification: Exploiting document difficulty. In Proceedings of the 34th European Conference on Information Retrieval (ECIR 2012), Barcelona, ES, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. G. McDonald, C. Macdonald, I. Ounis, and T. Gollins. Towards a classifier for digital sensitivity review. In Proceedings of the 36th European Conference on Information Retrieval (ECIR 2014), pages 500--506, Amsterdam, NL, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. D. W. Oard and W. Webber. Information retrieval for e-discovery. Foundations and Trends in Information Retrieval, 7(2/3):99--237, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  9. G. Szarvas, R. Farkas, and R. Busa-Fekete. State-of-the-art anonymisation of medical data with an iterative machine learning model/framework. Journal of the American Medical Informatics Association, 14(5):574--580, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  10. J. K. Vinjumur, D. W. Oard, , and J. H. Paik. Assessing the reliability and reusability of an e-discovery privilege test collection. In Proceedings of the 37th ACM Conference on Research and Development in Information Retrieval (SIGIR 2014), pages 1047--1050, Gold Coast, AU, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. T. Wilson, P. Hoffmann, S. Somasundaran, J. Kessler, J. Wiebe, Y. Choi, C. Cardie, E. Riloff, and S. Patwardhan. OpinionFinder: A system for subjectivity analysis. In Proceedings of the HLT/EMNLP 2005 Interactive Demonstrations, pages 34--35, Vancouver, CA, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Semi-Automated Text Classification for Sensitivity Identification

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      CIKM '15: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management
      October 2015
      1998 pages
      ISBN:9781450337946
      DOI:10.1145/2806416

      Copyright © 2015 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 17 October 2015

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • short-paper

      Acceptance Rates

      CIKM '15 Paper Acceptance Rate165of646submissions,26%Overall Acceptance Rate1,861of8,427submissions,22%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader