Skip to main content

2022 | OriginalPaper | Buchkapitel

Comparing Intrinsic and Extrinsic Evaluation of Sensitivity Classification

verfasst von : Mahmoud F. Sayed, Nishanth Mallekav, Douglas W. Oard

Erschienen in: Advances in Information Retrieval

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

With accelerating generation of digital content, it is often impractical at the point of creation to manually segregate sensitive information from information which can be shared. As a result, a great deal of useful content becomes inaccessible simply because it is intermixed with sensitive content. This paper compares traditional and neural techniques for detection of sensitive content, finding that using the two techniques together can yield improved results. Experiments with two test collections, one in which sensitivity is modeled as a topic and a second in which sensitivity is annotated directly, yield consistent improvements with an intrinsic (classification effectiveness) measure. Extrinsic evaluation is conducted by using a recently proposed learning to rank framework for sensitivity-aware ranked retrieval and a measure that rewards finding relevant documents but penalizes revealing sensitive documents.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Sayed, M.F., Oard, D.W.: Jointly modeling relevance and sensitivity for search among sensitive content. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 615–624. ACM (2019) Sayed, M.F., Oard, D.W.: Jointly modeling relevance and sensitivity for search among sensitive content. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 615–624. ACM (2019)
2.
Zurück zum Zitat Thompson, E.D., Kaarst-Brown, M.L.: Sensitive information: a review and research agenda. J. Am. Soc. Inf. Sci. Technol. 56(3), 245–257 (2005)CrossRef Thompson, E.D., Kaarst-Brown, M.L.: Sensitive information: a review and research agenda. J. Am. Soc. Inf. Sci. Technol. 56(3), 245–257 (2005)CrossRef
3.
Zurück zum Zitat Gabriel, M., Paskach, C., Sharpe, D.: The challenge and promise of predictive coding for privilege. In: ICAIL 2013 DESI V Workshop (2013) Gabriel, M., Paskach, C., Sharpe, D.: The challenge and promise of predictive coding for privilege. In: ICAIL 2013 DESI V Workshop (2013)
4.
Zurück zum Zitat Mcdonald, G., Macdonald, C., Ounis, I.: How the accuracy and confidence of sensitivity classification affects digital sensitivity review. ACM Trans. Inf. Syst. (TOIS) 39(1), 1–34 (2020)CrossRef Mcdonald, G., Macdonald, C., Ounis, I.: How the accuracy and confidence of sensitivity classification affects digital sensitivity review. ACM Trans. Inf. Syst. (TOIS) 39(1), 1–34 (2020)CrossRef
5.
Zurück zum Zitat Iqbal, M., Shilton, K., Sayed, M.F., Oard, D., Rivera, J.L., Cox, W.: Search with discretion: value sensitive design of training data for information retrieval. Proc. ACM Human Comput. Interact. 5, 1–20 (2021)CrossRef Iqbal, M., Shilton, K., Sayed, M.F., Oard, D., Rivera, J.L., Cox, W.: Search with discretion: value sensitive design of training data for information retrieval. Proc. ACM Human Comput. Interact. 5, 1–20 (2021)CrossRef
6.
Zurück zum Zitat Biega, J.A., Gummadi, K.P., Mele, I., Milchevski, D., Tryfonopoulos, C., Weikum, G.: R-susceptibility: an IR-centric approach to assessing privacy risks for users in online communities. In: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 365–374 (2016) Biega, J.A., Gummadi, K.P., Mele, I., Milchevski, D., Tryfonopoulos, C., Weikum, G.: R-susceptibility: an IR-centric approach to assessing privacy risks for users in online communities. In: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 365–374 (2016)
7.
Zurück zum Zitat Oard, D.W., Webber, W.: Information retrieval for e-discovery. Found. Trends Inf. Retrieval 7(2–3), 99–237 (2013)CrossRef Oard, D.W., Webber, W.: Information retrieval for e-discovery. Found. Trends Inf. Retrieval 7(2–3), 99–237 (2013)CrossRef
8.
Zurück zum Zitat Oard, D.W., Sebastiani, F., Vinjumur, J.K.: Jointly minimizing the expected costs of review for responsiveness and privilege in e-discovery. ACM Trans. Inf. Syst. (TOIS) 37(1), 11 (2018) Oard, D.W., Sebastiani, F., Vinjumur, J.K.: Jointly minimizing the expected costs of review for responsiveness and privilege in e-discovery. ACM Trans. Inf. Syst. (TOIS) 37(1), 11 (2018)
9.
Zurück zum Zitat Vinjumur, J.K.: Predictive Coding Techniques with Manual Review to Identify Privileged Documents in E-Discovery. PhD thesis, University of Maryland (2018) Vinjumur, J.K.: Predictive Coding Techniques with Manual Review to Identify Privileged Documents in E-Discovery. PhD thesis, University of Maryland (2018)
10.
Zurück zum Zitat McDonald, G., Macdonald, C., Ounis, I.: Enhancing sensitivity classification with semantic features using word embeddings. In: Jose, J.M., Hauff, C., Altıngovde, I.S., Song, D., Albakour, D., Watt, S., Tait, J. (eds.) ECIR 2017. LNCS, vol. 10193, pp. 450–463. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-56608-5_35CrossRef McDonald, G., Macdonald, C., Ounis, I.: Enhancing sensitivity classification with semantic features using word embeddings. In: Jose, J.M., Hauff, C., Altıngovde, I.S., Song, D., Albakour, D., Watt, S., Tait, J. (eds.) ECIR 2017. LNCS, vol. 10193, pp. 450–463. Springer, Cham (2017). https://​doi.​org/​10.​1007/​978-3-319-56608-5_​35CrossRef
12.
Zurück zum Zitat Baron, J.R., Sayed, M.F., Oard, D.W.: Providing more efficient access to government records: a use case involving application of machine learning to improve FOIA review for the deliberative process privilege. arXiv preprint arXiv:2011.07203, 2020 Baron, J.R., Sayed, M.F., Oard, D.W.: Providing more efficient access to government records: a use case involving application of machine learning to improve FOIA review for the deliberative process privilege. arXiv preprint arXiv:​2011.​07203, 2020
13.
14.
15.
Zurück zum Zitat Feurer, M., Eggensperger, K., Falkner, S., Lindauer, M., Hutter, F.: Auto-sklearn 2.0: the next generation. arXiv preprint arXiv:2007.04074 (2020) Feurer, M., Eggensperger, K., Falkner, S., Lindauer, M., Hutter, F.: Auto-sklearn 2.0: the next generation. arXiv preprint arXiv:​2007.​04074 (2020)
16.
17.
Zurück zum Zitat Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019) Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:​1910.​01108 (2019)
18.
Zurück zum Zitat Alkhereyf, S., Rambow, O.: Work hard, play hard: email classification on the Avocado and Enron corpora. In: Proceedings of TextGraphs-11: The Workshop on Graph-based Methods for Natural Language Processing, pp. 57–65 (2017) Alkhereyf, S., Rambow, O.: Work hard, play hard: email classification on the Avocado and Enron corpora. In: Proceedings of TextGraphs-11: The Workshop on Graph-based Methods for Natural Language Processing, pp. 57–65 (2017)
19.
Zurück zum Zitat Crawford, E., Kay, J., McCreath, E.: Automatic induction of rules for e-mail classification. In: Australian Document Computing Symposium (2001) Crawford, E., Kay, J., McCreath, E.: Automatic induction of rules for e-mail classification. In: Australian Document Computing Symposium (2001)
20.
Zurück zum Zitat Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A Bayesian approach to filtering junk e-mail. In: Learning for Text Categorization: Papers from the 1998 workshop, Madison, Wisconsin, vol. 62, pp. 98–105 (1998) Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A Bayesian approach to filtering junk e-mail. In: Learning for Text Categorization: Papers from the 1998 workshop, Madison, Wisconsin, vol. 62, pp. 98–105 (1998)
21.
Zurück zum Zitat Wang, M., He, Y., Jiang, M.: Text categorization of Enron email corpus based on information bottleneck and maximal entropy. In: IEEE 10th International Conference on Signal Processing, pp. 2472–2475. IEEE (2010) Wang, M., He, Y., Jiang, M.: Text categorization of Enron email corpus based on information bottleneck and maximal entropy. In: IEEE 10th International Conference on Signal Processing, pp. 2472–2475. IEEE (2010)
22.
Zurück zum Zitat Sayed, M.F., et al.: A test collection for relevance and sensitivity. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1605–1608 (2020) Sayed, M.F., et al.: A test collection for relevance and sensitivity. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1605–1608 (2020)
23.
Zurück zum Zitat Cormack, G.V., Grossman, M.R., Hedin, B., Oard, D.W.: Overview of the TREC 2010 legal track. In: TREC (2010) Cormack, G.V., Grossman, M.R., Hedin, B., Oard, D.W.: Overview of the TREC 2010 legal track. In: TREC (2010)
24.
Zurück zum Zitat Vinjumur, J.K., Oard, D.W., Paik, J.H.: Assessing the reliability and reusability of an e-discovery privilege test collection. In: Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 1047–1050 (2014) Vinjumur, J.K., Oard, D.W., Paik, J.H.: Assessing the reliability and reusability of an e-discovery privilege test collection. In: Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 1047–1050 (2014)
26.
Zurück zum Zitat Oard, D., Webber, W., Kirsch, D., Golitsynskiy, S.: Avocado research email collection. Linguistic Data Consortium, Philadelphia (2015) Oard, D., Webber, W., Kirsch, D., Golitsynskiy, S.: Avocado research email collection. Linguistic Data Consortium, Philadelphia (2015)
27.
Zurück zum Zitat McNemar, Q.: Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 12(2), 153–157 (1947)CrossRef McNemar, Q.: Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 12(2), 153–157 (1947)CrossRef
28.
Zurück zum Zitat Metzler, D., Croft, W.B.: Linear feature-based models for information retrieval. Inf. Retrieval 10(3), 257–274 (2007)CrossRef Metzler, D., Croft, W.B.: Linear feature-based models for information retrieval. Inf. Retrieval 10(3), 257–274 (2007)CrossRef
29.
Zurück zum Zitat Platt, J.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Adv. Large Margin Classifiers 10(3), 61–74 (1999) Platt, J.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Adv. Large Margin Classifiers 10(3), 61–74 (1999)
30.
Zurück zum Zitat De Winter, J.C.F.: Using the Student’s t-test with extremely small sample sizes. Pract. Assess. Res. Eval. 18(1), 10 (2013)MathSciNet De Winter, J.C.F.: Using the Student’s t-test with extremely small sample sizes. Pract. Assess. Res. Eval. 18(1), 10 (2013)MathSciNet
31.
Zurück zum Zitat Sayed, M.F.: Search Among Sensitive Content. PhD thesis, University of Maryland, College Park (2021) Sayed, M.F.: Search Among Sensitive Content. PhD thesis, University of Maryland, College Park (2021)
Metadaten
Titel
Comparing Intrinsic and Extrinsic Evaluation of Sensitivity Classification
verfasst von
Mahmoud F. Sayed
Nishanth Mallekav
Douglas W. Oard
Copyright-Jahr
2022
DOI
https://doi.org/10.1007/978-3-030-99739-7_25

Neuer Inhalt