Skip to main content
Erschienen in: Soft Computing 8/2016

28.11.2015 | Focus

Term frequency with average term occurrences for textual information retrieval

verfasst von: O. Ali Sadek Ibrahim, D. Landa-Silva

Erschienen in: Soft Computing | Ausgabe 8/2016

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In the context of information retrieval (IR) from text documents, the term weighting scheme (TWS) is a key component of the matching mechanism when using the vector space model. In this paper, we propose a new TWS that is based on computing the average term occurrences of terms in documents and it also uses a discriminative approach based on the document centroid vector to remove less significant weights from the documents. We call our approach Term Frequency With Average Term Occurrence (TF-ATO). An analysis of commonly used document collections shows that test collections are not fully judged as achieving that is expensive and maybe infeasible for large collections. A document collection being fully judged means that every document in the collection acts as a relevant document to a specific query or a group of queries. The discriminative approach used in our proposed approach is a heuristic method for improving the IR effectiveness and performance and it has the advantage of not requiring previous knowledge about relevance judgements. We compare the performance of the proposed TF-ATO to the well-known TF-IDF approach and show that using TF-ATO results in better effectiveness in both static and dynamic document collections. In addition, this paper investigates the impact that stop-words removal and our discriminative approach have on TF-IDF and TF-ATO. The results show that both, stop-words removal and the discriminative approach, have a positive effect on both term-weighting schemes. More importantly, it is shown that using the proposed discriminative approach is beneficial for improving IR effectiveness and performance with no information on the relevance judgement for the collection.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Literatur
Zurück zum Zitat Chang CH, Hsu CC (1999) The design of an information system for hypertext retrieval and automatic discovery on WWW. PhD thesis, National Taiwan University Chang CH, Hsu CC (1999) The design of an information system for hypertext retrieval and automatic discovery on WWW. PhD thesis, National Taiwan University
Zurück zum Zitat Christopher F (1992) Information retrieval. Chapter Lexical Analysis and Stoplists, pp 102–130. Prentice-Hall, Inc., Upper Saddle River, NJ Christopher F (1992) Information retrieval. Chapter Lexical Analysis and Stoplists, pp 102–130. Prentice-Hall, Inc., Upper Saddle River, NJ
Zurück zum Zitat Cordan O, Herrera-Viedma E, Lapez-Pujalte C, Luque M, Zarco C (2003) A review on the application of evolutionary computation to information retrieval. Int J Approx Reason 34(23):241–264 Soft Computing Applications to Intelligent Information Retrieval on the InternetMathSciNetCrossRefMATH Cordan O, Herrera-Viedma E, Lapez-Pujalte C, Luque M, Zarco C (2003) A review on the application of evolutionary computation to information retrieval. Int J Approx Reason 34(23):241–264 Soft Computing Applications to Intelligent Information Retrieval on the InternetMathSciNetCrossRefMATH
Zurück zum Zitat Cover TM, Thomas JA (1991) Elements of information theory. Wiley-Interscience, New York, NYCrossRefMATH Cover TM, Thomas JA (1991) Elements of information theory. Wiley-Interscience, New York, NYCrossRefMATH
Zurück zum Zitat Cummins R (2008) The evolution and analysis of term-weighting schemes in information retrieval. PhD thesis, National University of Ireland, Galway Cummins R (2008) The evolution and analysis of term-weighting schemes in information retrieval. PhD thesis, National University of Ireland, Galway
Zurück zum Zitat Cummins R, O’Riordan C (2006) Term-weighting in information retrieval using genetic programming: a three stage process. In: Proceedings of the 2006 conference on ECAI 2006: 17th European conference on artificial intelligence August 29–September 1, Riva Del Garda, pp 793–794, Amsterdam. IOS Press Cummins R, O’Riordan C (2006) Term-weighting in information retrieval using genetic programming: a three stage process. In: Proceedings of the 2006 conference on ECAI 2006: 17th European conference on artificial intelligence August 29–September 1, Riva Del Garda, pp 793–794, Amsterdam. IOS Press
Zurück zum Zitat Hersh W, Buckley C, Leone TJ, Hickam D (1994) Ohsumed: an interactive retrieval evaluation and new large test collection for research. In: Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’94, New York, NY. Springer-Verlag New York Inc, pp 192–201 Hersh W, Buckley C, Leone TJ, Hickam D (1994) Ohsumed: an interactive retrieval evaluation and new large test collection for research. In: Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’94, New York, NY. Springer-Verlag New York Inc, pp 192–201
Zurück zum Zitat He Y, Saif H, Fernández M, Alani H (2014) On stopwords, filtering and data sparsity for sentiment analysis of Twitter. In: LREC 2014, 9th international conference on language resources and evaluationReykjavik, Iceland, pp 810–817 He Y, Saif H, Fernández M, Alani H (2014) On stopwords, filtering and data sparsity for sentiment analysis of Twitter. In: LREC 2014, 9th international conference on language resources and evaluationReykjavik, Iceland, pp 810–817
Zurück zum Zitat Ibrahim OAS, Landa-Silva D (2014) A new weighting scheme and discriminative approach for information retrieval in static and dynamic document collections. In: Computational intelligence (UKCI), 2014 14th UK Workshop on, pp 1–8, Sept 2014 Ibrahim OAS, Landa-Silva D (2014) A new weighting scheme and discriminative approach for information retrieval in static and dynamic document collections. In: Computational intelligence (UKCI), 2014 14th UK Workshop on, pp 1–8, Sept 2014
Zurück zum Zitat Jin R, Chai JY, Si L (2005) Learn to weight terms in information retrieval using category information. In: Proceedings of the 22nd international conference on machine learning, ICML ’05, New York, NY, ACM, pp 353–360 Jin R, Chai JY, Si L (2005) Learn to weight terms in information retrieval using category information. In: Proceedings of the 22nd international conference on machine learning, ICML ’05, New York, NY, ACM, pp 353–360
Zurück zum Zitat Jin R, Falusos C, Hauptmann AG (2001) Meta-scoring: automatically evaluating term weighting schemes in ir without precision-recall. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’01, New York, NY, ACM, pp 83–89 Jin R, Falusos C, Hauptmann AG (2001) Meta-scoring: automatically evaluating term weighting schemes in ir without precision-recall. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’01, New York, NY, ACM, pp 83–89
Zurück zum Zitat Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Ndellec C, Rouveirol C (eds), Machine Learning: ECML-98, volume 1398 of Lecture Notes in Computer Science. Springer Berlin Heidelberg, pp 137–142 Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Ndellec C, Rouveirol C (eds), Machine Learning: ECML-98, volume 1398 of Lecture Notes in Computer Science. Springer Berlin Heidelberg, pp 137–142
Zurück zum Zitat Jones KS (1988) Document retrieval systems. Chapter a statistical interpretation of term specificity and its application in retrieval. Taylor Graham Publishing, London, pp 132–142 Jones KS (1988) Document retrieval systems. Chapter a statistical interpretation of term specificity and its application in retrieval. Taylor Graham Publishing, London, pp 132–142
Zurück zum Zitat Jones KS, Willett P (eds) (1997) Readings in information retrieval. Morgan Kaufmann Publishers Inc., San Francisco, CA Jones KS, Willett P (eds) (1997) Readings in information retrieval. Morgan Kaufmann Publishers Inc., San Francisco, CA
Zurück zum Zitat Kaden M, Riedel M, Hermann W, Villmann T (2014) Border-sensitive learning in generalized learning vector quantization: an alternative to support vector machines. Soft Comput pp 1–12. doi: 10.1007/s00500-014-1496-1 Kaden M, Riedel M, Hermann W, Villmann T (2014) Border-sensitive learning in generalized learning vector quantization: an alternative to support vector machines. Soft Comput pp 1–12. doi: 10.​1007/​s00500-014-1496-1
Zurück zum Zitat Kwok KL (1997) Comparing representations in Chinese information retrieval. In: SIGIR ’97 Proceedings of the 20th annual international ACM SIGIR conference on research and development in information retrieval, New York, NY, ACM, pp 34–41 Kwok KL (1997) Comparing representations in Chinese information retrieval. In: SIGIR ’97 Proceedings of the 20th annual international ACM SIGIR conference on research and development in information retrieval, New York, NY, ACM, pp 34–41
Zurück zum Zitat Liu T-Y (2009) Learning to rank for information retrieval. Found Trend Inf Retrieval 3(3):225–331CrossRef Liu T-Y (2009) Learning to rank for information retrieval. Found Trend Inf Retrieval 3(3):225–331CrossRef
Zurück zum Zitat Lo RTW, He B, Ounis I (2005) Automatically building a stopword list for an information retrieval system. Digital information management: special issue on the 5th Dutch-Belgian information retrieval Workshop (DIR 2005) 3 (1):3–8 Lo RTW, He B, Ounis I (2005) Automatically building a stopword list for an information retrieval system. Digital information management: special issue on the 5th Dutch-Belgian information retrieval Workshop (DIR 2005) 3 (1):3–8
Zurück zum Zitat Luhn HP (1957) A statistical approach to mechanized encoding and searching of literary information. IBM J Res Dev 1(4):309–317MathSciNetCrossRef Luhn HP (1957) A statistical approach to mechanized encoding and searching of literary information. IBM J Res Dev 1(4):309–317MathSciNetCrossRef
Zurück zum Zitat Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, New York, NY ISBN 0521865719, 9780521865715 Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, New York, NY ISBN 0521865719, 9780521865715
Zurück zum Zitat McCandless M, Hatcher E, Gospodnetic O (2010) Lucene in action, 2nd Edn Covers Apache Lucene 3.0. Manning Publications Co., Greenwich, CT 2010. ISBN 1933988177, 9781933988177 McCandless M, Hatcher E, Gospodnetic O (2010) Lucene in action, 2nd Edn Covers Apache Lucene 3.0. Manning Publications Co., Greenwich, CT 2010. ISBN 1933988177, 9781933988177
Zurück zum Zitat McGill M (1979) An evaluation of factors affecting document ranking by information retrieval systems McGill M (1979) An evaluation of factors affecting document ranking by information retrieval systems
Zurück zum Zitat Noreault T, McGill M, Koll M (1999) A performance evaluation of similarity measures, document term weighting schemes and representations in a Boolean environment. In: SIGIR ’80 Proceedings of the 3rd annual ACM conference on Research and development in information retrieval, Butterworth & Co., Kent, pp 57–76 Noreault T, McGill M, Koll M (1999) A performance evaluation of similarity measures, document term weighting schemes and representations in a Boolean environment. In: SIGIR ’80 Proceedings of the 3rd annual ACM conference on Research and development in information retrieval, Butterworth & Co., Kent, pp 57–76
Zurück zum Zitat Qin T, Liu TY, Xu J, Li H (2010) Letor: a benchmark collection for research on learning to rank for information retrieval. Inf Retrieval, 13(4):346–374. ISSN 1386–4564 Qin T, Liu TY, Xu J, Li H (2010) Letor: a benchmark collection for research on learning to rank for information retrieval. Inf Retrieval, 13(4):346–374. ISSN 1386–4564
Zurück zum Zitat Reed JW, Jiao Y, Potok TE, Klump BA, Elmore MT, Hurson AR (2014) Tf-icf: a new term weighting scheme for clustering dynamic data streams. In: Proceedings of the 5th international conference on machine learning and applications, ICMLA ’06, Washington, DC. IEEE Computer Society, pp 258–263 Reed JW, Jiao Y, Potok TE, Klump BA, Elmore MT, Hurson AR (2014) Tf-icf: a new term weighting scheme for clustering dynamic data streams. In: Proceedings of the 5th international conference on machine learning and applications, ICMLA ’06, Washington, DC. IEEE Computer Society, pp 258–263
Zurück zum Zitat Ricardo A, Baeza-Yates RNB, Ribeiro-Neto B (1999) Modern information retrieval. Addison-Wesley Longman Publishing Co., Inc, Boston, MA Ricardo A, Baeza-Yates RNB, Ribeiro-Neto B (1999) Modern information retrieval. Addison-Wesley Longman Publishing Co., Inc, Boston, MA
Zurück zum Zitat Ricardo A, Baeza-Yates RNB, Ribeiro-Neto B (2011) Modern information retrieval-the concepts and technology behind search, 2nd edn. Pearson Education Ltd, Harlow Ricardo A, Baeza-Yates RNB, Ribeiro-Neto B (2011) Modern information retrieval-the concepts and technology behind search, 2nd edn. Pearson Education Ltd, Harlow
Zurück zum Zitat Robertson SE, Walker S, Hancock-Beaulieu MM, Jones S, Gatford M (1995) Okapi at TREC-3. In: Harman D (ed) Proceeding of 3rd text retrieval conference TREC3, Gaithersburg, pp 109–126 Robertson SE, Walker S, Hancock-Beaulieu MM, Jones S, Gatford M (1995) Okapi at TREC-3. In: Harman D (ed) Proceeding of 3rd text retrieval conference TREC3, Gaithersburg, pp 109–126
Zurück zum Zitat Salton G, Buckley C (1997) Readings in information retrieval. Chapter improving retrieval performance by relevance feedback. Morgan Kaufmann Publishers Inc., San Francisco, CA, pp 355–364 Salton G, Buckley C (1997) Readings in information retrieval. Chapter improving retrieval performance by relevance feedback. Morgan Kaufmann Publishers Inc., San Francisco, CA, pp 355–364
Zurück zum Zitat Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24(5):513–523CrossRef Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24(5):513–523CrossRef
Zurück zum Zitat Salton G, McGill MJ (1986) Introduction to modern information retrieval. McGraw-Hill, New York, NYMATH Salton G, McGill MJ (1986) Introduction to modern information retrieval. McGraw-Hill, New York, NYMATH
Zurück zum Zitat Singhal A, Buckley C, Mitra M (1996) Pivoted document length normalization. In: Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’96, New York, NY, ACM pp 21–29 Singhal A, Buckley C, Mitra M (1996) Pivoted document length normalization. In: Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’96, New York, NY, ACM pp 21–29
Zurück zum Zitat Sinka MP, Corne (2003a) Towards modernised and web-specific stoplists for web document, analysis Sinka MP, Corne (2003a) Towards modernised and web-specific stoplists for web document, analysis
Zurück zum Zitat Sinka MP, Corne DW (2003b) Evolving better stoplists for document clustering and web intelligence. Design and application of hybrid intelligent systems, pp 1015–1023 Sinka MP, Corne DW (2003b) Evolving better stoplists for document clustering and web intelligence. Design and application of hybrid intelligent systems, pp 1015–1023
Zurück zum Zitat Smucker MD, Kazai G, Lease M (2012) Overview of the trec (2012) crowdsourcing track. Technical report, DTIC Document Smucker MD, Kazai G, Lease M (2012) Overview of the trec (2012) crowdsourcing track. Technical report, DTIC Document
Zurück zum Zitat Soboroff I (2014) A comparison of pooled and sampled relevance judgments. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’07, New York, NY, ACM pp 785–786 Soboroff I (2014) A comparison of pooled and sampled relevance judgments. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’07, New York, NY, ACM pp 785–786
Zurück zum Zitat Song S, Myaeng SH (2012) A novel term weighting scheme based on discrimination power obtained from past retrieval results. Inf Process Manag 48(5):919–930 Large-Scale and Distributed Systems for Information RetrievalCrossRef Song S, Myaeng SH (2012) A novel term weighting scheme based on discrimination power obtained from past retrieval results. Inf Process Manag 48(5):919–930 Large-Scale and Distributed Systems for Information RetrievalCrossRef
Zurück zum Zitat Torgerson WS (1958) Theory and methods of scaling Torgerson WS (1958) Theory and methods of scaling
Zurück zum Zitat Vinciarelli A (2005) Application of information retrieval techniques to single writer documents. Pattern Recogn Lett 26(14):2262–2271CrossRef Vinciarelli A (2005) Application of information retrieval techniques to single writer documents. Pattern Recogn Lett 26(14):2262–2271CrossRef
Zurück zum Zitat Voorhees EM (2004) Overview of the trec 2004 robust retrieval track. In: Proceedings of the 13th text retrieval conference (TREC-2004), p 13 Voorhees EM (2004) Overview of the trec 2004 robust retrieval track. In: Proceedings of the 13th text retrieval conference (TREC-2004), p 13
Zurück zum Zitat Winkler S, Schaller S, Dorfer V, Affenzeller M, Petz G, Karpowicz M (2014) Data-based prediction of sentiments using heterogeneous model ensembles. Soft Comput pp 1–12. doi:10.1007/s00500-014-1325-6 Winkler S, Schaller S, Dorfer V, Affenzeller M, Petz G, Karpowicz M (2014) Data-based prediction of sentiments using heterogeneous model ensembles. Soft Comput pp 1–12. doi:10.​1007/​s00500-014-1325-6
Zurück zum Zitat Zhou L, Lai KK, Lean Y (2009) Credit scoring using support vector machines with direct search for parameters selection. Soft Comput 13(2):149–155CrossRefMATH Zhou L, Lai KK, Lean Y (2009) Credit scoring using support vector machines with direct search for parameters selection. Soft Comput 13(2):149–155CrossRefMATH
Zurück zum Zitat Zipf GK (1949) Human behavior and the principle of least effort. Addison-Wesley, Reading, MA Zipf GK (1949) Human behavior and the principle of least effort. Addison-Wesley, Reading, MA
Metadaten
Titel
Term frequency with average term occurrences for textual information retrieval
verfasst von
O. Ali Sadek Ibrahim
D. Landa-Silva
Publikationsdatum
28.11.2015
Verlag
Springer Berlin Heidelberg
Erschienen in
Soft Computing / Ausgabe 8/2016
Print ISSN: 1432-7643
Elektronische ISSN: 1433-7479
DOI
https://doi.org/10.1007/s00500-015-1935-7

Weitere Artikel der Ausgabe 8/2016

Soft Computing 8/2016 Zur Ausgabe