Skip to main content
Top
Published in: Discover Computing 3/2009

01-06-2009

Query structuring and expansion with two-stage term dependence for Japanese web retrieval

Authors: Koji Eguchi, W. Bruce Croft

Published in: Discover Computing | Issue 3/2009

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In this paper, we propose a new term dependence model for information retrieval, which is based on a theoretical framework using Markov random fields. We assume two types of dependencies of terms given in a query: (i) long-range dependencies that may appear for instance within a passage or a sentence in a target document, and (ii) short-range dependencies that may appear for instance within a compound word in a target document. Based on this assumption, our two-stage term dependence model captures both long-range and short-range term dependencies differently, when more than one compound word appear in a query. We also investigate how query structuring with term dependence can improve the performance of query expansion using a relevance model. The relevance model is constructed using the retrieval results of the structured query with term dependence to expand the query. We show that our term dependence model works well, particularly when using query structuring with compound words, through experiments using a 100-gigabyte test collection of web documents mostly written in Japanese. We also show that the performance of the relevance model can be significantly improved by using the structured query with our term dependence model.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
The Japanese language is mainly expressed in kanji, hiragana and katakana characters. Kanji is derived from ancient Chinese characters. English alphabetic words are also sometimes used in a Japanese text, especially as proper nouns.
 
2
This kind of approaches was also employed for English (e.g., Mitra et al. 1997).
 
3
This is also referred to as the term independence model hereafter.
 
4
Query Term Expansion Subtask.
 
5
For the training, we used the relevance judgment data based on the page-unit document model (Eguchi et al. 2003) included in the NTCIR-3 WEB test collection.
 
6
The topics were a subset of those created for the NTCIR-4 WEB, Informational Retrieval Subtask. The relevance judgments were additionally performed by extension of the relevance data of the NTCIR-4 WEB. The task was motivated by the question “Which terms should be added to the original query to improve search results?” The objectives of this paper are different from those of that task; however, the data set is suitable for our experiments.
 
8
Regardless of whether the POS tagging function is chosen or not, the resulting segmentation is the same, in this case of the MeCab tool.
 
9
As suffix words, we used suffix nouns, suffix verbs and suffix adjectives; and as prefix words, we used nominal prefixes, verbal prefixes, adjectival prefixes and numerical prefixes, according to the part-of-speech system used in the MeCab tool.
 
10
In another way of applying Metzler and Croft’s model to all decomposed words in a whole query, ignoring boundaries across query components, the number of combinations of the words exponentially increases. Actually, in our preliminary experiments using some Japanese queries, the searching by this simple application did not accomplish within feasible time.
 
11
We tested using both the sequential dependence model and the full dependence model. The results of these two models were almost the same in this context. Only the result using the sequential dependence model is shown, as naive-lsd, in Fig. 4.
 
12
The results of significance tests on ‘AvgPrec c ’ or ‘AvgPrec o ’ are not presented, since 22 or 13 topics make it difficult to achieve statistical significance.
 
Literature
go back to reference Buckley, C., Salton, G., Allan, J., & Singhal, A. (1994). Automatic query expansion using SMART: TREC 3. In Proceedings of the 3rd text retrieval conference, Gaithersburg, MD (pp. 69–80). Buckley, C., Salton, G., Allan, J., & Singhal, A. (1994). Automatic query expansion using SMART: TREC 3. In Proceedings of the 3rd text retrieval conference, Gaithersburg, MD (pp. 69–80).
go back to reference Chen, A., & Gey, F. C. (2002). Experiments on cross-language and patent retrieval at NTCIR-3 Workshop. In Proceedings of the 3rd NTCIR workshop, Tokyo, Japan. Chen, A., & Gey, F. C. (2002). Experiments on cross-language and patent retrieval at NTCIR-3 Workshop. In Proceedings of the 3rd NTCIR workshop, Tokyo, Japan.
go back to reference Clarke, C., Craswell, N., & Soboroff, I. (2004). Overview of the TREC 2004 terabyte track. In Proceedings of TREC 2004, Gaithersburg, MD. Clarke, C., Craswell, N., & Soboroff, I. (2004). Overview of the TREC 2004 terabyte track. In Proceedings of TREC 2004, Gaithersburg, MD.
go back to reference Craswell, N., & Hawking, D. (2003). Overview of the TREC 2003 web track. In Proceedings of TREC 2003, Gaithersburg, MD (pp. 78–92). Craswell, N., & Hawking, D. (2003). Overview of the TREC 2003 web track. In Proceedings of TREC 2003, Gaithersburg, MD (pp. 78–92).
go back to reference Croft, W. B., & Lafferty, J. (Eds.). (2003). Language modeling for information retrieval. Norwell, MA: Kluwer Academic Publishers. Croft, W. B., & Lafferty, J. (Eds.). (2003). Language modeling for information retrieval. Norwell, MA: Kluwer Academic Publishers.
go back to reference Croft, W. B., Turtle, H. R., & Lewis, D. D. (1991). The use of phrases and structured queries in information retrieval. In Proceedings of ACM SIGIR 1991, Illinois, USA (pp. 32–45). Croft, W. B., Turtle, H. R., & Lewis, D. D. (1991). The use of phrases and structured queries in information retrieval. In Proceedings of ACM SIGIR 1991, Illinois, USA (pp. 32–45).
go back to reference Eguchi, K. (2005). NTCIR-5 query expansion experiments using term dependence models. In Proceedings of the 5th NTCIR workshop. Tokyo, Japan. Eguchi, K. (2005). NTCIR-5 query expansion experiments using term dependence models. In Proceedings of the 5th NTCIR workshop. Tokyo, Japan.
go back to reference Eguchi, K., Oyama, K., Aizawa, A., & Ishikawa, H. (2004). Overview of the informational retrieval task at NTCIR-4 WEB. In Proceedings of the 4th NTCIR workshop, Tokyo, Japan. Eguchi, K., Oyama, K., Aizawa, A., & Ishikawa, H. (2004). Overview of the informational retrieval task at NTCIR-4 WEB. In Proceedings of the 4th NTCIR workshop, Tokyo, Japan.
go back to reference Eguchi, K., Oyama, K., Ishida, E., Kando, N., & Kuriyama, K. (2003). Overview of the web retrieval task at the third NTCIR workshop. In Proceedings of the 3rd NTCIR workshop, Tokyo, Japan. Eguchi, K., Oyama, K., Ishida, E., Kando, N., & Kuriyama, K. (2003). Overview of the web retrieval task at the third NTCIR workshop. In Proceedings of the 3rd NTCIR workshop, Tokyo, Japan.
go back to reference Fujii, H., & Croft, W. B. (1993). A comparison of indexing techniques for Japanese text retrieval. In Proceedings of ACM SIGIR 1993, Pittsburgh, PA (pp. 237–246). Fujii, H., & Croft, W. B. (1993). A comparison of indexing techniques for Japanese text retrieval. In Proceedings of ACM SIGIR 1993, Pittsburgh, PA (pp. 237–246).
go back to reference Fujita, S. (1999). Notes on phrasal indexing: JSCB evaluation experiments at NTCIR ad hoc. In Proceedings of the first NTCIR workshop, Tokyo, Japan (pp. 101–108). Fujita, S. (1999). Notes on phrasal indexing: JSCB evaluation experiments at NTCIR ad hoc. In Proceedings of the first NTCIR workshop, Tokyo, Japan (pp. 101–108).
go back to reference Jones, G. J. F., Sakai, T., Kajiura, M., & Sumita, K. (1998). Experiments in Japanese text retrieval and routing using the NEAT system. In Proceedings of ACM SIGIR 1998, Melbourne, Australia (pp. 197–205). Jones, G. J. F., Sakai, T., Kajiura, M., & Sumita, K. (1998). Experiments in Japanese text retrieval and routing using the NEAT system. In Proceedings of ACM SIGIR 1998, Melbourne, Australia (pp. 197–205).
go back to reference Kando, N., Kageura, K., Yoshioka, M., & Oyama, K. (1998). Phrase processing methods for Japanese text retrieval. SIGIR Forum, 32(2), 23–28.CrossRef Kando, N., Kageura, K., Yoshioka, M., & Oyama, K. (1998). Phrase processing methods for Japanese text retrieval. SIGIR Forum, 32(2), 23–28.CrossRef
go back to reference Lavrenko, V., & Croft, W. B. (2001). Relevance based language models. In Proceedings of ACM SIGIR 2001, New Orleans, LA (pp. 120–127). Lavrenko, V., & Croft, W. B. (2001). Relevance based language models. In Proceedings of ACM SIGIR 2001, New Orleans, LA (pp. 120–127).
go back to reference Metzler, D., & Croft, W. B. (2004). Combining the language model and inference network approaches to retrieval. Information Processing and Management, 40(5), 735–750.CrossRef Metzler, D., & Croft, W. B. (2004). Combining the language model and inference network approaches to retrieval. Information Processing and Management, 40(5), 735–750.CrossRef
go back to reference Metzler, D., & Croft, W. B. (2005). A Markov random field model for term dependencies. In Proceedings of ACM SIGIR 2005, Salvador, Brazil (pp. 472–479). Metzler, D., & Croft, W. B. (2005). A Markov random field model for term dependencies. In Proceedings of ACM SIGIR 2005, Salvador, Brazil (pp. 472–479).
go back to reference Metzler, D., Strohman, T., Turtle, H., & Croft, W. B. (2004). Indri at TREC 2004: Terabyte track. In Proceedings of TREC 2004, Gaithersburg, MD. Metzler, D., Strohman, T., Turtle, H., & Croft, W. B. (2004). Indri at TREC 2004: Terabyte track. In Proceedings of TREC 2004, Gaithersburg, MD.
go back to reference Mishne, G., & de Rijke, M. (2005). Boosting web retrieval through query operations. In Proceedings of the 27th European conference on information retrieval research, Santiago de Compostela, Spain (pp. 502–516). Mishne, G., & de Rijke, M. (2005). Boosting web retrieval through query operations. In Proceedings of the 27th European conference on information retrieval research, Santiago de Compostela, Spain (pp. 502–516).
go back to reference Mitra, M., Buckley, C., Singhal, A., & Cardie, C. (1997). An analysis of statistical and syntactic phrases. In Proceedings of RIAO 97, Montreal, Canada (pp. 200–214). Mitra, M., Buckley, C., Singhal, A., & Cardie, C. (1997). An analysis of statistical and syntactic phrases. In Proceedings of RIAO 97, Montreal, Canada (pp. 200–214).
go back to reference Moulinier, I., Molina-Salgado, H., & Jackson, P. (2002). Thomson legal and regulatory at NTCIR-3: Japanese, Chinese and English retrieval experiments. In Proceedings of the 3rd NTCIR workshop, Tokyo, Japan. Moulinier, I., Molina-Salgado, H., & Jackson, P. (2002). Thomson legal and regulatory at NTCIR-3: Japanese, Chinese and English retrieval experiments. In Proceedings of the 3rd NTCIR workshop, Tokyo, Japan.
go back to reference Ogawa, Y., & Matsuda, T. (1997). Overlapping statistical word indexing: A new indexing method for Japanese text. In Proceedings of ACM SIGIR 1997, Philadelphia, PA (pp. 226–234). Ogawa, Y., & Matsuda, T. (1997). Overlapping statistical word indexing: A new indexing method for Japanese text. In Proceedings of ACM SIGIR 1997, Philadelphia, PA (pp. 226–234).
go back to reference Strohman, T. (2007). Efficient processing of complex features for information retrieval. PhD thesis, University of Massachusetts, Amherst. Strohman, T. (2007). Efficient processing of complex features for information retrieval. PhD thesis, University of Massachusetts, Amherst.
go back to reference Strohman, T., Turtle, H., & Croft, W. B. (2005). Optimization strategies for complex queries. In Proceedings of ACM SIGIR 2005, Salvador, Brazil (pp. 219–225). Strohman, T., Turtle, H., & Croft, W. B. (2005). Optimization strategies for complex queries. In Proceedings of ACM SIGIR 2005, Salvador, Brazil (pp. 219–225).
go back to reference Turtle, H. R., & Croft, W. B. (1991). Evaluation of an inference network-based retrieval model. ACM Transactions on Information Systems, 9(3), 187–222.CrossRef Turtle, H. R., & Croft, W. B. (1991). Evaluation of an inference network-based retrieval model. ACM Transactions on Information Systems, 9(3), 187–222.CrossRef
go back to reference Xu, J., & Croft, W. B. (1996). Query expansion using local and global document analysis. In Proceedings of ACM SIGIR 1996, Zurich, Switzerland (pp. 4–11). Xu, J., & Croft, W. B. (1996). Query expansion using local and global document analysis. In Proceedings of ACM SIGIR 1996, Zurich, Switzerland (pp. 4–11).
go back to reference Yoshioka, M. (2005). Overview of the NTCIR-5 WEB query expansion task. In Proceedings of the 5th NTCIR workshop, Tokyo, Japan. Yoshioka, M. (2005). Overview of the NTCIR-5 WEB query expansion task. In Proceedings of the 5th NTCIR workshop, Tokyo, Japan.
go back to reference Zhai, C., & Lafferty, J. (2001). Model-based feedback in the language modeling approach to information retrieval. In Proceedings of ACM CIKM 2001, Atlanta, GA (pp. 403–410). Zhai, C., & Lafferty, J. (2001). Model-based feedback in the language modeling approach to information retrieval. In Proceedings of ACM CIKM 2001, Atlanta, GA (pp. 403–410).
Metadata
Title
Query structuring and expansion with two-stage term dependence for Japanese web retrieval
Authors
Koji Eguchi
W. Bruce Croft
Publication date
01-06-2009
Publisher
Springer Netherlands
Published in
Discover Computing / Issue 3/2009
Print ISSN: 2948-2984
Electronic ISSN: 2948-2992
DOI
https://doi.org/10.1007/s10791-009-9092-1

Other articles of this Issue 3/2009

Discover Computing 3/2009 Go to the issue

Premium Partner