ABSTRACT
The primary function of current Web search engines is essentially relevance ranking at the document level. However, myriad structured information about real-world objects is embedded in static Web pages and online Web databases. Document-level information retrieval can unfortunately lead to highly inaccurate relevance ranking in answering object-oriented queries. In this paper, we propose a paradigm shift to enable searching at the object level. In traditional information retrieval models, documents are taken as the retrieval units and the content of a document is considered reliable. However, this reliability assumption is no longer valid in the object retrieval context when multiple copies of information about the same object typically exist. These copies may be inconsistent because of diversity of Web site qualities and the limited performance of current information extraction techniques. If we simply combine the noisy and inaccurate attribute information extracted from different sources, we may not be able to achieve satisfactory retrieval performance. In this paper, we propose several language models for Web object retrieval, namely an unstructured object retrieval model, a structured object retrieval model, and a hybrid model with both structured and unstructured retrieval features. We test these models on a paper search engine and compare their performances. We conclude that the hybrid model is the superior by taking into account the extraction errors at varying levels.
- Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley Publishers, 1999. Google ScholarDigital Library
- Deng Cai, Xiaofei He, Ji-Rong Wen, and Wei-Ying Ma. Block-Level Link Analysis. In Proceedings of SIGIR, 2004. Google ScholarDigital Library
- Deng Cai, Shipeng Yu, Ji-Rong Wen and Wei-Ying Ma. Block-based Web Search. In Proceedings of SIGIR, 2004. Google ScholarDigital Library
- J. P. Callan. Passage-Level Evidence in Document Retrieval. In Proceedings of SIGIR, 1994. Google ScholarDigital Library
- J.P. Callan. Distributed information retrieval. In Advances in Information Retrieval: Recent Research from the Center for Intelligent Information Retrieval, edited by W. Bruce Croft. Kluwer Academic Publisher, pp. 127--150, 2000.Google Scholar
- Abdur Chowdhury, Mohammed Aljlayl, Eric Jensen, Steve Beitzel, David Grossman and Ophir Frieder. Linear Combinations Based on Document Structure and Varied Stemming for Arabic Retrieval. In The Eleventh Text REtrieval Conference (TREC 2002), 2003.Google Scholar
- Charles L.A. Clarke. Controlling Overlap in Content-Oriented XML Retrieval. In Proceedings of the SIGIR, 2005. Google ScholarDigital Library
- Nick Craswell, David Hawking and Stephen Roberson. Effective Site Finding using Link Anchor Information. In Proceedings of SIGIR, 2001. Google ScholarDigital Library
- Nick Craswell, David Hawking and Trystan Upstill. TREC12 Web and Interactive Tracks at CSIRO. In The Twelfth Text Retrieval Conference(TREC 2003), 2004.Google Scholar
- Ronald Fagin, Ravi Kumar, Kevin S. McCurley, Jasmine Novak, D. Sivakumar, John A. Tomlin and David P. Williamson. Searching the Workplace Web. In Proceedings of the Twelfth International World Wide Web Conference, 2003. Google ScholarDigital Library
- Hui Fang, Tao Tao and ChengXiang Zhai. A Formal Study of Information Retrieval Heuristics. In Proceedings of SIGIR, 2004. Google ScholarDigital Library
- Norbert Fuhr. Probabilistic Models in Information Retrieval. The computer Journal, Vol.35, No.3, pp. 243--255. Google ScholarDigital Library
- Norbert Fuhr and Kai Großjohann. XIRQL: A Query Language for Information Retrieval in XML documents. In Proceedings of the SIGIR, 2001. Google ScholarDigital Library
- L. Gravano and H. Garcia-Molina. Generalizing gloss to vector-space databases and broker hierarchies. In Proceeding of the International Conference on Very Large Data Bases (VLDB), 1995. Google ScholarDigital Library
- Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmman Publishers, 2000. Google ScholarDigital Library
- David Hull. Using Statistical Testing in the Evaluation of Retrieval Experiments. In Proceedings of the ACM SIGIR, 1993. Google ScholarDigital Library
- Jaap Kamps, Maarten de Rijke and Borkur Sigurbjornsson. Length normalization in XML retrieval. In Proceedings of the SIGIR, 2004. Google ScholarDigital Library
- M. Kaszkiel and J. Zobel. Passage Retrieval Revisited. In Proceedings of SIGIR, 1997. Google ScholarDigital Library
- Mounia Lalmas. Dempster-Shafer's Theory of Evidence Applied to Structured Documents: Modeling Uncertainty. In Proceedings of SIGIR, 1997. Google ScholarDigital Library
- Mounia Lalmas, Uniform representation of content and structure for structured document retrieval. Technical Report, Queen Mary and Westfield College, University of London, 2000.Google Scholar
- K. Lerman, L. Getoor, S. Minton, and C. A. Knoblock. Using the structure of Web sites for automatic segmentation of tables. In ACM SIGMOD Conference (SIGMOD), 2004. Google ScholarDigital Library
- Bing Liu, Robert Grossman, and Yanhong Zhai. Mining Data Records in Web Pages. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2003. Google ScholarDigital Library
- M. Meng, K. Liu, C. Yu, W. Wu, and N. Rishe. Estimating the usefulness of search engines. In ICDE Conference, 1999. Google ScholarDigital Library
- Amihai Motro and Igor Rakov. Estimating the quality of databases. In Proceedings of the 3rd International Conference on Flexible Query Answering (FQAS), Roskilde, Denmark, May 1998. Springer Verlag. Google ScholarDigital Library
- Felix Naumann and Rolker Claudia. Assessment Methods for Information Quality Criteria. In Proceedings of the International Conference on Information Quality (IQ), Cambridge, MA, 2000.Google Scholar
- Zaiqing Nie, Yuanzhi Zhang, Ji-Rong Wen and Wei-Ying Ma. Object-Level Ranking: Bringing Order to Web Objects. In Proceedings of the 14th international World Wide Web Conference (WWW), 2005. Google ScholarDigital Library
- Zaiqing Nie, Ji-Rong Wen and Wei-Ying Ma. Object-level Vertical Search. To appear by the Third Biennial Conference on Innovative Data Systems Research (CIDR), 2007.Google Scholar
- Paul Ogilvie and Jamie Callan. Combining Document Representations for known item search. In Proceedings of SIGIR, 2003. Google ScholarDigital Library
- S. E. Robertson, S. Walker, S. Jones and M. M. Hancock-Beaulieu. Okapi at TREC-3. In The Third Text REtrieval Conference (TREC 3), 1994.Google Scholar
- Stephen Robertson, Hugo Zaragoza, and Michael Taylor. Simple BM25 Extension to Multiple Weighted Fields. ACM CIKM, 2004. Google ScholarDigital Library
- S. Tejada, C. A. Knoblock, and S. Minton. Learning domain-independent string transformation weights for high accuracy object identification. In Knowledge Discovery and Data Mining (KDD), 2002. Google ScholarDigital Library
- J. Wang and F. H. Lochovsky. Data extraction and label assignment for Web databases. In World Wide Web conference (WWW), 2003. Google ScholarDigital Library
- Thijs Westerveld, Wessel Kraaij and Djoerd Hiemstra. Retrieving Web Pages using Content, Links, URLs and Anchors. In The Tenth Text REtrieval Conference (TREC2001), 2001.Google Scholar
- Ross Wilkinson. Effective Retrieval of Structured Documents. In Proceedings of SIGIR, 1994. Google ScholarDigital Library
- J. Xu, and J. Callan. Effective retrieval with distributed collections. In Proceedings of SIGIR, 1998. Google ScholarDigital Library
- Yiming Yang and Xin Liu. A re-examination of text categorization methods. In Proceedings of the ACM SIGIR, 1999. Google ScholarDigital Library
- Jun Zhu, Zaiqing Nie, Ji-Rong Wen, Bo Zhang, Wei-Ying Ma. 2D Conditional Random Fields for Web Information Extraction. In Proceedings of the 22nd International Conference on Machine Learning (ICML), 2005. Google ScholarDigital Library
- Jun Zhu, Zaiqing Nie, Ji-Rong Wen, Bo Zhang, Wei-Ying Ma. Simultaneous Record Detection and Attribute Labeling in Web Data Extraction. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2006. Google ScholarDigital Library
Index Terms
- Web object retrieval
Recommendations
Language models for web object retrieval
WiCOM'09: Proceedings of the 5th International Conference on Wireless communications, networking and mobile computingDocument-level information retrieval can unfortunately lead to highly inaccurate relevance ranking in answering object-oriented queries. A paradigm is proposed to enable searching at the object level. However, this reliability assumption is no longer ...
Language Models for Web Object Retrieval
NISS '09: Proceedings of the 2009 International Conference on New Trends in Information and Service ScienceDocument-level information retrieval can unfortunately lead to highly inaccurate relevance ranking in answering object-oriented queries. A paradigm is proposed to enable searching at the object level. However, this reliability assumption is no longer ...
The Study of Methods for Language Model Based Positive and Negative Relevance Feedback in Information Retrieval
ISISE '12: Proceedings of the 2012 Fourth International Symposium on Information Science and EngineeringRelevance feedback techniques are important to Information retrieval (IR), which can effectively improve the performance of IR. The feedback includes positive and negative relevance one. The most of the previous work using feedback have focused on ...
Comments