skip to main content
10.1145/1242572.1242584acmconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
Article

Web object retrieval

Published:08 May 2007Publication History

ABSTRACT

The primary function of current Web search engines is essentially relevance ranking at the document level. However, myriad structured information about real-world objects is embedded in static Web pages and online Web databases. Document-level information retrieval can unfortunately lead to highly inaccurate relevance ranking in answering object-oriented queries. In this paper, we propose a paradigm shift to enable searching at the object level. In traditional information retrieval models, documents are taken as the retrieval units and the content of a document is considered reliable. However, this reliability assumption is no longer valid in the object retrieval context when multiple copies of information about the same object typically exist. These copies may be inconsistent because of diversity of Web site qualities and the limited performance of current information extraction techniques. If we simply combine the noisy and inaccurate attribute information extracted from different sources, we may not be able to achieve satisfactory retrieval performance. In this paper, we propose several language models for Web object retrieval, namely an unstructured object retrieval model, a structured object retrieval model, and a hybrid model with both structured and unstructured retrieval features. We test these models on a paper search engine and compare their performances. We conclude that the hybrid model is the superior by taking into account the extraction errors at varying levels.

References

  1. Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley Publishers, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Deng Cai, Xiaofei He, Ji-Rong Wen, and Wei-Ying Ma. Block-Level Link Analysis. In Proceedings of SIGIR, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Deng Cai, Shipeng Yu, Ji-Rong Wen and Wei-Ying Ma. Block-based Web Search. In Proceedings of SIGIR, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. P. Callan. Passage-Level Evidence in Document Retrieval. In Proceedings of SIGIR, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J.P. Callan. Distributed information retrieval. In Advances in Information Retrieval: Recent Research from the Center for Intelligent Information Retrieval, edited by W. Bruce Croft. Kluwer Academic Publisher, pp. 127--150, 2000.Google ScholarGoogle Scholar
  6. Abdur Chowdhury, Mohammed Aljlayl, Eric Jensen, Steve Beitzel, David Grossman and Ophir Frieder. Linear Combinations Based on Document Structure and Varied Stemming for Arabic Retrieval. In The Eleventh Text REtrieval Conference (TREC 2002), 2003.Google ScholarGoogle Scholar
  7. Charles L.A. Clarke. Controlling Overlap in Content-Oriented XML Retrieval. In Proceedings of the SIGIR, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Nick Craswell, David Hawking and Stephen Roberson. Effective Site Finding using Link Anchor Information. In Proceedings of SIGIR, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Nick Craswell, David Hawking and Trystan Upstill. TREC12 Web and Interactive Tracks at CSIRO. In The Twelfth Text Retrieval Conference(TREC 2003), 2004.Google ScholarGoogle Scholar
  10. Ronald Fagin, Ravi Kumar, Kevin S. McCurley, Jasmine Novak, D. Sivakumar, John A. Tomlin and David P. Williamson. Searching the Workplace Web. In Proceedings of the Twelfth International World Wide Web Conference, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Hui Fang, Tao Tao and ChengXiang Zhai. A Formal Study of Information Retrieval Heuristics. In Proceedings of SIGIR, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Norbert Fuhr. Probabilistic Models in Information Retrieval. The computer Journal, Vol.35, No.3, pp. 243--255. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Norbert Fuhr and Kai Großjohann. XIRQL: A Query Language for Information Retrieval in XML documents. In Proceedings of the SIGIR, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. L. Gravano and H. Garcia-Molina. Generalizing gloss to vector-space databases and broker hierarchies. In Proceeding of the International Conference on Very Large Data Bases (VLDB), 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmman Publishers, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. David Hull. Using Statistical Testing in the Evaluation of Retrieval Experiments. In Proceedings of the ACM SIGIR, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Jaap Kamps, Maarten de Rijke and Borkur Sigurbjornsson. Length normalization in XML retrieval. In Proceedings of the SIGIR, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. Kaszkiel and J. Zobel. Passage Retrieval Revisited. In Proceedings of SIGIR, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Mounia Lalmas. Dempster-Shafer's Theory of Evidence Applied to Structured Documents: Modeling Uncertainty. In Proceedings of SIGIR, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Mounia Lalmas, Uniform representation of content and structure for structured document retrieval. Technical Report, Queen Mary and Westfield College, University of London, 2000.Google ScholarGoogle Scholar
  21. K. Lerman, L. Getoor, S. Minton, and C. A. Knoblock. Using the structure of Web sites for automatic segmentation of tables. In ACM SIGMOD Conference (SIGMOD), 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Bing Liu, Robert Grossman, and Yanhong Zhai. Mining Data Records in Web Pages. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. Meng, K. Liu, C. Yu, W. Wu, and N. Rishe. Estimating the usefulness of search engines. In ICDE Conference, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Amihai Motro and Igor Rakov. Estimating the quality of databases. In Proceedings of the 3rd International Conference on Flexible Query Answering (FQAS), Roskilde, Denmark, May 1998. Springer Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Felix Naumann and Rolker Claudia. Assessment Methods for Information Quality Criteria. In Proceedings of the International Conference on Information Quality (IQ), Cambridge, MA, 2000.Google ScholarGoogle Scholar
  26. Zaiqing Nie, Yuanzhi Zhang, Ji-Rong Wen and Wei-Ying Ma. Object-Level Ranking: Bringing Order to Web Objects. In Proceedings of the 14th international World Wide Web Conference (WWW), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Zaiqing Nie, Ji-Rong Wen and Wei-Ying Ma. Object-level Vertical Search. To appear by the Third Biennial Conference on Innovative Data Systems Research (CIDR), 2007.Google ScholarGoogle Scholar
  28. Paul Ogilvie and Jamie Callan. Combining Document Representations for known item search. In Proceedings of SIGIR, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. S. E. Robertson, S. Walker, S. Jones and M. M. Hancock-Beaulieu. Okapi at TREC-3. In The Third Text REtrieval Conference (TREC 3), 1994.Google ScholarGoogle Scholar
  30. Stephen Robertson, Hugo Zaragoza, and Michael Taylor. Simple BM25 Extension to Multiple Weighted Fields. ACM CIKM, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. S. Tejada, C. A. Knoblock, and S. Minton. Learning domain-independent string transformation weights for high accuracy object identification. In Knowledge Discovery and Data Mining (KDD), 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. J. Wang and F. H. Lochovsky. Data extraction and label assignment for Web databases. In World Wide Web conference (WWW), 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Thijs Westerveld, Wessel Kraaij and Djoerd Hiemstra. Retrieving Web Pages using Content, Links, URLs and Anchors. In The Tenth Text REtrieval Conference (TREC2001), 2001.Google ScholarGoogle Scholar
  34. Ross Wilkinson. Effective Retrieval of Structured Documents. In Proceedings of SIGIR, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. J. Xu, and J. Callan. Effective retrieval with distributed collections. In Proceedings of SIGIR, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Yiming Yang and Xin Liu. A re-examination of text categorization methods. In Proceedings of the ACM SIGIR, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Jun Zhu, Zaiqing Nie, Ji-Rong Wen, Bo Zhang, Wei-Ying Ma. 2D Conditional Random Fields for Web Information Extraction. In Proceedings of the 22nd International Conference on Machine Learning (ICML), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Jun Zhu, Zaiqing Nie, Ji-Rong Wen, Bo Zhang, Wei-Ying Ma. Simultaneous Record Detection and Attribute Labeling in Web Data Extraction. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Web object retrieval

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      WWW '07: Proceedings of the 16th international conference on World Wide Web
      May 2007
      1382 pages
      ISBN:9781595936547
      DOI:10.1145/1242572

      Copyright © 2007 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 8 May 2007

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      Overall Acceptance Rate1,899of8,196submissions,23%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader