skip to main content
article

Building efficient and effective metasearch engines

Published:01 March 2002Publication History
Skip Abstract Section

Abstract

Frequently a user's information needs are stored in the databases of multiple search engines. It is inconvenient and inefficient for an ordinary user to invoke multiple search engines and identify useful documents from the returned results. To support unified access to multiple search engines, a metasearch engine can be constructed. When a metasearch engine receives a query from a user, it invokes the underlying search engines to retrieve useful information for the user. Metasearch engines have other benefits as a search tool such as increasing the search coverage of the Web and improving the scalability of the search. In this article, we survey techniques that have been proposed to tackle several underlying challenges for building a good metasearch engine. Among the main challenges, the database selection problem is to identify search engines that are likely to return useful documents to a given query. The document selection problem is to determine what documents to retrieve from each identified search engine. The result merging problem is to combine the documents returned from multiple search engines. We will also point out some problems that need to be further researched.

References

  1. ABDULLA, G., LIU, B., SAAD, R., AND FOX, E. 1997. Characterizing World Wide Web queries. In Technical report TR-97-04, Virginia Tech. Google ScholarGoogle Scholar
  2. BAUMGARTEN, C. 1997. A probabilistic model for distributed information retrieval. In Proceedings of the ACM SIGIR Conference (Philadelphia, PA, July 1997), 258-266. Google ScholarGoogle Scholar
  3. BERGMAN, M. 2000. The deep Web: Surfacing the hidden value. BrightPlanet, www.completeplanet. com/Tutorials/DeepWeb/index.asp.Google ScholarGoogle Scholar
  4. BOYAN, J., FREITAG,D.,AND JOACHIMS, T. 1996. A machine learning architecture for optimizing web search engines. In AAAI Workshop on Internet-Based Information Systems (Portland, OR, 1996).Google ScholarGoogle Scholar
  5. BRIN,S.AND PAGE, L. 1998. The anatomy of a largescale hypertextual Web search engine. In Proceedings of the Seventh World Wide Web Conference (Brisbane, Australia, April 1998), 107-117. Google ScholarGoogle Scholar
  6. BUCKLEY, C., SALTON,G.,AND ALLAN, J. 1993. Automatic retrieval with locality information using smart. In Proceedings of the First Text Retrieval Conference, NIST Special Publication 500-207 (March), 59-72.Google ScholarGoogle Scholar
  7. CALLAN, J. 2000. Distributed information retrieval. In Advances in Information Retrieval: Recent Research from the Center for Intelligent Information Retrieval, W. Bruce Croft, ed. Kluwer Academic Publishers. 127-150.Google ScholarGoogle Scholar
  8. CALLAN, J., CONNELL, M., AND DU, A. 1999. Automatic discovery of language models for text databases. In Proceedings of the ACM SIGMOD Conference (Philadelphia, PA, June 1999), 479- 490. Google ScholarGoogle Scholar
  9. CALLAN, J., CROFT,B.,AND HARDING, S. 1992. The inquery retrieval system. In Proceedings of the Third DEXA Conference (Valencia, Spain, 1992), 78-83.Google ScholarGoogle Scholar
  10. CALLAN, J., LU, Z., AND CROFT, W. 1995. Searching distributed collections with inference networks. In Proceedings of the ACM SIGIR Conference (Seattle, WA, July 1995), 21-28. Google ScholarGoogle Scholar
  11. CHAKRABARTI, S., DOM, B., KUMAR, S., RAGHAVAN,P., RAJAGOPALAN, S., TOMKINS, A., GIBSON,D.,AND KLEINBERG, J. 1999. Mining the web's link structure. IEEE Comput. 32, 8 (Aug.), 60-67. Google ScholarGoogle Scholar
  12. CHAKRAVARTHY,A.AND HAASE, K. 1995. Netserf: Using semantic knowledge to find internet information archives. In Proceedings of the ACM SIGIR Conference (Seattle, WA, July 1995), 4-11. Google ScholarGoogle Scholar
  13. CHANG,C.AND GARCIA-MOLINA, H. 1999. Mind your vocabulary: query mapping across heterogeneous information sources. In Proceedings of the ACM SIGMOD Conference (Philadelphia, PA, June 1999), 335-346. Google ScholarGoogle Scholar
  14. CHANG, W., MURTHY, D., ZHANG, A., AND SYEDA- MAHMOOD, T. 1998. Global integration of visual databases. In Proceedings of the IEEE International Conference on Data Engineering (Orlando, FL, Feb. 1998), 542-549. Google ScholarGoogle Scholar
  15. COTTRELL,G.AND BELEW, R. 1994. Automatic combination of multiple ranked retrieval systems. In Proceedings of the ACM SIGIR Conference (Dublin, Ireland, July 1994), 173-181. Google ScholarGoogle Scholar
  16. CRASWELL, N., HAWKING,D.,AND THISTLEWAITE,P. 1999. Merging results from isolated search engines. In Proceedings of the Tenth Australasian Database Conference (Auckland, New Zealand, Jan. 1999), 189-200.Google ScholarGoogle Scholar
  17. CROFT, W. 2000. Combining approaches to information retrieval. In Advances in Information Retrieval: Recent Research from the Center for Intelligent Information Retrieval, W. Bruce Croft, ed. Kluwer Academic Publishers. 1-36.Google ScholarGoogle Scholar
  18. CUTLER, M., SHIH,Y.,AND MENG, W. 1997. Using the structures of html documents to improve retrieval. In Proceedings of the USENIX Symposium 7on Internet Technologies and Systems (Monterey, CA, Dec. 1997), 241-251. Google ScholarGoogle Scholar
  19. DREILINGER,D.AND HOWE, A. 1997. Experiences with selecting search engines using metasearch. ACM Trans. Inform. Syst. 15, 3 (July), 195- 222. Google ScholarGoogle Scholar
  20. FAN,Y.AND GAUCH, S. 1999. Adaptive agents for information gathering from multiple, distributed information sources. In Proceedings of the 1999 AAAI Symposium on Intelligent Agents in Cyerspace (Stanford University, Palo Alto, CA, March 1999), 40-46.Google ScholarGoogle Scholar
  21. FOX,E.AND SHAW, J. 1994. Combination of multiple searches. In Proceedings of the Second Text REtrieval Conference (Gaithersburg, MD, Aug. 1994), 243-252.Google ScholarGoogle Scholar
  22. FRENCH, J., FOX, E., MALY, K., AND SELMAN, A. 1995. Wide area technical report service: technical report online. Commun. ACM 38, 4 (April), 45-46. Google ScholarGoogle Scholar
  23. FRENCH, J., POWELL, A., CALLAN, J., VILES, C., EMMITT, T. , PREY, K., AND MOU, Y. 1999. Comparing the performance of database selection algorithms. In Proceedings of the ACM SIGIR Conference (Berkeley, CA, August 1999), 238-245. Google ScholarGoogle Scholar
  24. FRENCH, J., POWELL, A., AND VILES, C. 1998. Evaluating database selection techniques: a testbed and experiment. In Proceedings of the ACM SIGIR Conference (Melbourne, Australia, August 1998), 121-129. Google ScholarGoogle Scholar
  25. GAUCH, S., WANG,G.,AND GOMEZ, M. 1996. Profusion: intelligent fusion from multiple, distributed search engines. J. Univers. Comput. Sci. 2, 9, 637-649.Google ScholarGoogle Scholar
  26. GRAVANO, L., CHANG, C., GARCIA-MOLINA, H., AND PAEPCKE, A. 1997. Starts: Stanford proposal for Internet meta-searching. In Proceedings of the ACMSIGMOD Conference (Tucson, AZ, May 1997), 207-218. Google ScholarGoogle Scholar
  27. GRAVANO,L.AND GARCIA-MOLINA, H. 1995. Generalizing gloss to vector-space databases and broker hierarchies. In Proceedings of the International Conferences on Very Large Data Bases (Zurich, Switzerland, Sept. 1995), 78-89. Google ScholarGoogle Scholar
  28. GRAVANO,L.AND GARCIA-MOLINA, H. 1997. Merging ranks from heterogeneous Internet sources. In Proceedings of the International Conferences on Very Large Data Bases (Athens, Greece, August 1997), 196-205. Google ScholarGoogle Scholar
  29. GRAVANO, L., GARCIA-MOLINA, H., AND TOMASIC,A. 1994. The effectiveness of gloss for the text database discovery problem. In Proceedings of the ACM SIGMOD Conference (Minnesota, MN, May 1994), 126-137. Google ScholarGoogle Scholar
  30. HAWKING,D.AND THISTLEWAITE, P. 1999. Methods for information server selection. ACM Trans. Inform. Syst. 17, 1 (Jan.), 40-76. Google ScholarGoogle Scholar
  31. IPEIROTIS, P., GRAVANO, L., AND SAHAMI, M. 2001. Probe, count, and classify: categorizing hidden-Web databases. In Proceedings of the ACM SIGMOD Conference (Santa Barbara, CA, 2001), 67-78. Google ScholarGoogle Scholar
  32. JANSEN, B., SPINK, A., BATEMAN,J.,AND SARACEVIC,T. 1998. Real life information retrieval: a study of user queries on the Web. ACM SIGIR Forum 32, 1, 5-17. Google ScholarGoogle Scholar
  33. KAHLE,B.AND MEDLAR, A. 1991. An information system for corporate users: wide area information servers. Technical Report TMC199, Thinking Machine Corporation (April).Google ScholarGoogle Scholar
  34. KIRK, T., LEVY, A., SAGIV,Y.,AND SRIVASTAVA, D. 1995. The information manifold. In AAAI Spring Symposium on Information Gathering in Distributed Heterogeneous Environments (1995).Google ScholarGoogle Scholar
  35. KIRSCH, S. 1998. Internet search: Infoseek's experiences searching the internet. ACM SIGIR Forum 32, 2, 3-7. Google ScholarGoogle Scholar
  36. KLEINBERG, J. 1998. Authoritative sources in hyperlinked environment. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (San Francisco, CA, January 1998), 668-677. Google ScholarGoogle Scholar
  37. KONSTAN, J., MILLER, B., MALTZ, D., HERLOCKER,J., GORDON, L., AND RIEDL, J. 1997. Grouplens: Applying collaborative filtering to usenet news. Commun. ACM 40, 3, 77-87. Google ScholarGoogle Scholar
  38. KOSTER, M. 1994. Aliweb: Archie-like indexing in the Web. Comput. Netw. and ISDN Syst. 27,2, 175-182. Google ScholarGoogle Scholar
  39. LAWRENCE,S.AND LEE GILES, C. 1998. Inquirus, the neci meta search engine. In Proceedings of the Seventh International World Wide Web Conference (Brisbane, Australia, April 1998), 95-105. Google ScholarGoogle Scholar
  40. LAWRENCE,S.AND LEE GILES, C. 1999. Accessibility of information on the web. Nature 400, 107-109.Google ScholarGoogle Scholar
  41. LEE, J.-H. 1997. Analyses of multiple evidence combination. In Proceedings of the ACM SIGIR Conference (Philadelphia, PA, July 1997), 267- 276. Google ScholarGoogle Scholar
  42. LI,S.AND DANZIG, P. 1997. Boolean similarity measures for resource discovery. IEEE Trans. Knowl. Data Eng. 9, 6 (Nov.), 863-876. Google ScholarGoogle Scholar
  43. LIU, K., MENG, W., YU,C.,AND RISHE, N. 2000. Discovery of similarity computations of search engines. In Proceedings of the Ninth ACM International Conference on Information and Knowledge Management (Washington, DC, Nov. 2000), 290-297. Google ScholarGoogle Scholar
  44. LIU, K., YU, C., MENG, W., WU,W.,AND RISHE, N. 2001. A statistical method for estimating the usefulness of text databases. IEEE Trans. Knowl. Data Eng. To appear. Google ScholarGoogle Scholar
  45. LIU, L. 1999. Query routing in large-scale digital library systems. In Proceedings of the IEEE International Conference on Data Engineering (Sydney, Australia, March 1999), 154-163. Google ScholarGoogle Scholar
  46. MANBER,U.AND BIGOT, P. 1997. The search broker. In Proceedings of the USENIX Symposium on Internet Technologies and Systems (Monterey, CA, December 1997), 231-239. Google ScholarGoogle Scholar
  47. MANBER,U.AND BIGOT, P. 1998. Connecting diverse web search facilities. Data Eng. Bull. 21,2 (June), 21-27.Google ScholarGoogle Scholar
  48. MAULDIN, M. 1997. Lycos: design choices in an internet search service. IEEE Expert 12,1 (Feb.), 1-8.Google ScholarGoogle Scholar
  49. MCBRYAN, O. 1994. Genvl and wwww: Tools for training the Web. In Proceedings of the First World Wide Web Conference (Geneva, Switzerland, May 1994), 79-90.Google ScholarGoogle Scholar
  50. MENG, M., LIU, K., YU, C., WANG, X., CHANG,Y.,AND RISHE, N. 1998. Determine text databases to search in the internet. In Proceedings of the International Conferences on Very Large Data Bases (New York, NY, Aug. 1998), 14-25. Google ScholarGoogle Scholar
  51. MENG, M., LIU, K., YU, C., WU,W.,AND RISHE,N. 1999a. Estimating the usefulness of search engines. In Proceedings of the IEEE Interna-tional Conference on Data Engineering (Sydney, Australia, March 1999), 146-153. Google ScholarGoogle Scholar
  52. MENG, W., WANG, W., SUN, H., AND YU, C. 2001. Concept hierarchy based text database categorization. Int. J. Knowl. Inform. Syst. To appear.Google ScholarGoogle Scholar
  53. MENG, W., YU,C.,AND LIU, K. 1999b. Detection of heterogeneities in a multiple text database environment. In Proceedings of the Fourth IFCIS Conference on Cooperative Information Systems (Edinburgh, Scotland, September 1999), 22-33. Google ScholarGoogle Scholar
  54. MILLER, G. 1990. Wordnet: An on-line lexical database. Int. J. Lexicography 3, 4, 235-312.Google ScholarGoogle Scholar
  55. NCSTRL. n.d. Networked computer science technical reference library. At Web site http:// cstr.cs.cornell.edu.Google ScholarGoogle Scholar
  56. PAGE, L., BRIN, S., MOTWANI, R., AND WINOGRAD,T. 1998. The pagerank citation ranking: bring order to the web. Technical report, Stanford University, Palo, Alto, CA.Google ScholarGoogle Scholar
  57. ROBERTSON, S., WALKER,S.,AND BEAULIEU, M. 1999. Okapi at trec-7: automatic ad hoc, filtering, vlc, and interactive track. In Proceedings of the Seventh Text Retrieval Conference (Gaithersburg, MD, Nov. 1999), 253-264.Google ScholarGoogle Scholar
  58. SALTON, G. 1989. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Infor-mation by Computer. Addison Wesley, Reading, MA. Google ScholarGoogle Scholar
  59. SALTON,G.AND MCGILL, M. 1983. Introduction to Modern Information Retrieval. McGraw-Hill, New York, NY. Google ScholarGoogle Scholar
  60. SELBERG,E.AND ETZIONI, O. 1995. Multiservice search and comparison using the metacrawler. In Proceedings of the Fourth World Wide Web Conference (Boston, MA, Dec. 1995), 195-208.Google ScholarGoogle Scholar
  61. SELBERG,E.AND ETZIONI, O. 1997. The metacrawler architecture for resource aggregation on the web. IEEE Expert 12, 1, 8-14.Google ScholarGoogle Scholar
  62. SHELDON, M., DUDA, A., WEISS, R., O'TOOLE,J.,AND GIFFORD, D. 1994. A content routing system for distributed information servers. In Proceedings of the Fourth International Conference on Extending Database Technology (Cambridge, England, March 1994), 109-122. Google ScholarGoogle Scholar
  63. SINGHAL, A., BUCKLEY,C.,AND MITRA, M. 1996. Pivoted document length normalization. In Proceedings of the ACM SIGIR Conference (Zurich, Switzerland, Aug. 1996), 21-29. Google ScholarGoogle Scholar
  64. SUGIURA,A.AND ETZIONI, O. 2000. Query routing for Web search engines: architecture and experiments. In Proceedings of the Ninth World Wide Web Conference (Amsterdam, The Netherlands, May 2000), 417-429. Google ScholarGoogle Scholar
  65. TOWELL, G., VOORHEES, E., GUPTA,N.,AND JOHNSON- LAIRD, B. 1995. Learning collection fusion strategies for information retrieval. In Proceedings of the 12th International Conference on Machine Learning (Tahoe City, CA, July 1995), 540-548.Google ScholarGoogle Scholar
  66. TURTLE,H.AND CROFT, B. 1991. Evaluation of an inference network-based retrieval model. ACM Trans. Inform. Syst. 9, 3 (July), 8-14. Google ScholarGoogle Scholar
  67. VOGT,C.AND COTTRELL, G. 1999. Fusion via a linear combination of scores. Inform. Retr. 1, 3, 151- 173. Google ScholarGoogle Scholar
  68. VOORHEES, E. 1996. Siemens trec-4 report: further experiments with database merging. In Proceedings of the Fourth Text Retrieval Conference (Gaithersburg, MD, Nov. 1996), 121-130.Google ScholarGoogle Scholar
  69. VOORHEES, E., GUPTA,N.,AND JOHNSON-LAIRD,B. 1995a. The collection fusion problem. In Proceedings of the Third Text Retrieval Conference (Gaithersburg, MD, Nov. 1995), 95-104.Google ScholarGoogle Scholar
  70. VOORHEES, E., GUPTA,N.,AND JOHNSON-LAIRD,B. 1995b. Learning collection fusion strategies. In Proceedings of the ACM SIGIR Conference (Seattle, WA, July 1995), 172-179. Google ScholarGoogle Scholar
  71. VOORHEES,E.AND TONG, R. 1997. Multiple search engines in database merging. In Proceedings of the Second ACM International Conference on Digital Libraries (Philadelphia, PA, July 1997), 93-102. Google ScholarGoogle Scholar
  72. WADE, S., WILLETT,P.,AND BAWDEN, D. 1989. Sibris: the sandwich interactive browing and ranking information system. J. Inform. Sci. 15, 249-260. Google ScholarGoogle Scholar
  73. WIDDER, D. 1989. Advanced Calculus, 2nd ed. Dover Publications, Inc., New York, NY.Google ScholarGoogle Scholar
  74. WU, Z., MENG, W., YU,C.,AND LI, Z. 2001. Towards a highly-scalable and effective metasearch engine. In Proceedings of the Tenth World Wide Web Conference (Hong Kong, May 2001), 386-395. Google ScholarGoogle Scholar
  75. XU,J.AND CALLAN, J. 1998. Effective retrieval with distributed collections. In Proceedings of the ACM SIGIR Conference (Melbourne, Australia, 1998), 112-120. Google ScholarGoogle Scholar
  76. XU,J.AND CROFT, B. 1996. Query expansion using local and global document analysis. In Proceedings of the ACM SIGIR Conference (Zurich, Switzerland, Aug. 1996), 4-11. Google ScholarGoogle Scholar
  77. XU,J.AND CROFT, B. 1999. Cluster-based language models for distributed retrieval. In Proceedings of the ACM SIGIR Conference (Berkeley, CA, Aug. 1999), 254-261. Google ScholarGoogle Scholar
  78. YU, C., LIU, K., WU, W., MENG,W.,AND RISHE,N. 1999a. Finding the most similar documents across multiple text databases. In Proceedings of the IEEE Conference on Advances in Digital Libraries (Baltimore, MD, May 1999), 150-162. Google ScholarGoogle Scholar
  79. YU,C.AND MENG, W. 1998. Principles of Database Query Processing for Advanced Applications. Morgan Kaufmann Publishers, San Francisco, CA. Google ScholarGoogle Scholar
  80. YU, C., MENG, W., LIU, K., WU,W.,AND RISHE,N. 1999b. Efficient and effective metasearch for a large number of text databases. In Proceedings of the Eighth ACM International Conference on Information and Knowledge Management (Kansas City, MO, Nov. 1999), 217-224. Google ScholarGoogle Scholar
  81. YU, C., MENG, W., WU,W.,AND LIU, K. 2001. Efficient and effective metasearch for text databases incorporating linkages among documents. In Proceedings of the ACM SIGMOD Conference (Santa Barbara, CA, May 2001), 187-198. Google ScholarGoogle Scholar
  82. YUWONO,B.AND LEE, D. 1996. Search and ranking algorithms for locating resources on the World Wide Web. In Proceedings of the IEEE International Conference on Data Engineering (New Orleans, LA, Feb. 1996), 164-177. Google ScholarGoogle Scholar
  83. YUWONO,B.AND LEE, D. 1997. Server ranking for distributed text resource systems on the Internet. In Proceedings of the 5th International Conference On Database Systems for Advanced Applications (Melbourne, Australia, April 1997), 391-400. Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader