Abstract
Frequently a user's information needs are stored in the databases of multiple search engines. It is inconvenient and inefficient for an ordinary user to invoke multiple search engines and identify useful documents from the returned results. To support unified access to multiple search engines, a metasearch engine can be constructed. When a metasearch engine receives a query from a user, it invokes the underlying search engines to retrieve useful information for the user. Metasearch engines have other benefits as a search tool such as increasing the search coverage of the Web and improving the scalability of the search. In this article, we survey techniques that have been proposed to tackle several underlying challenges for building a good metasearch engine. Among the main challenges, the database selection problem is to identify search engines that are likely to return useful documents to a given query. The document selection problem is to determine what documents to retrieve from each identified search engine. The result merging problem is to combine the documents returned from multiple search engines. We will also point out some problems that need to be further researched.
- ABDULLA, G., LIU, B., SAAD, R., AND FOX, E. 1997. Characterizing World Wide Web queries. In Technical report TR-97-04, Virginia Tech. Google Scholar
- BAUMGARTEN, C. 1997. A probabilistic model for distributed information retrieval. In Proceedings of the ACM SIGIR Conference (Philadelphia, PA, July 1997), 258-266. Google Scholar
- BERGMAN, M. 2000. The deep Web: Surfacing the hidden value. BrightPlanet, www.completeplanet. com/Tutorials/DeepWeb/index.asp.Google Scholar
- BOYAN, J., FREITAG,D.,AND JOACHIMS, T. 1996. A machine learning architecture for optimizing web search engines. In AAAI Workshop on Internet-Based Information Systems (Portland, OR, 1996).Google Scholar
- BRIN,S.AND PAGE, L. 1998. The anatomy of a largescale hypertextual Web search engine. In Proceedings of the Seventh World Wide Web Conference (Brisbane, Australia, April 1998), 107-117. Google Scholar
- BUCKLEY, C., SALTON,G.,AND ALLAN, J. 1993. Automatic retrieval with locality information using smart. In Proceedings of the First Text Retrieval Conference, NIST Special Publication 500-207 (March), 59-72.Google Scholar
- CALLAN, J. 2000. Distributed information retrieval. In Advances in Information Retrieval: Recent Research from the Center for Intelligent Information Retrieval, W. Bruce Croft, ed. Kluwer Academic Publishers. 127-150.Google Scholar
- CALLAN, J., CONNELL, M., AND DU, A. 1999. Automatic discovery of language models for text databases. In Proceedings of the ACM SIGMOD Conference (Philadelphia, PA, June 1999), 479- 490. Google Scholar
- CALLAN, J., CROFT,B.,AND HARDING, S. 1992. The inquery retrieval system. In Proceedings of the Third DEXA Conference (Valencia, Spain, 1992), 78-83.Google Scholar
- CALLAN, J., LU, Z., AND CROFT, W. 1995. Searching distributed collections with inference networks. In Proceedings of the ACM SIGIR Conference (Seattle, WA, July 1995), 21-28. Google Scholar
- CHAKRABARTI, S., DOM, B., KUMAR, S., RAGHAVAN,P., RAJAGOPALAN, S., TOMKINS, A., GIBSON,D.,AND KLEINBERG, J. 1999. Mining the web's link structure. IEEE Comput. 32, 8 (Aug.), 60-67. Google Scholar
- CHAKRAVARTHY,A.AND HAASE, K. 1995. Netserf: Using semantic knowledge to find internet information archives. In Proceedings of the ACM SIGIR Conference (Seattle, WA, July 1995), 4-11. Google Scholar
- CHANG,C.AND GARCIA-MOLINA, H. 1999. Mind your vocabulary: query mapping across heterogeneous information sources. In Proceedings of the ACM SIGMOD Conference (Philadelphia, PA, June 1999), 335-346. Google Scholar
- CHANG, W., MURTHY, D., ZHANG, A., AND SYEDA- MAHMOOD, T. 1998. Global integration of visual databases. In Proceedings of the IEEE International Conference on Data Engineering (Orlando, FL, Feb. 1998), 542-549. Google Scholar
- COTTRELL,G.AND BELEW, R. 1994. Automatic combination of multiple ranked retrieval systems. In Proceedings of the ACM SIGIR Conference (Dublin, Ireland, July 1994), 173-181. Google Scholar
- CRASWELL, N., HAWKING,D.,AND THISTLEWAITE,P. 1999. Merging results from isolated search engines. In Proceedings of the Tenth Australasian Database Conference (Auckland, New Zealand, Jan. 1999), 189-200.Google Scholar
- CROFT, W. 2000. Combining approaches to information retrieval. In Advances in Information Retrieval: Recent Research from the Center for Intelligent Information Retrieval, W. Bruce Croft, ed. Kluwer Academic Publishers. 1-36.Google Scholar
- CUTLER, M., SHIH,Y.,AND MENG, W. 1997. Using the structures of html documents to improve retrieval. In Proceedings of the USENIX Symposium 7on Internet Technologies and Systems (Monterey, CA, Dec. 1997), 241-251. Google Scholar
- DREILINGER,D.AND HOWE, A. 1997. Experiences with selecting search engines using metasearch. ACM Trans. Inform. Syst. 15, 3 (July), 195- 222. Google Scholar
- FAN,Y.AND GAUCH, S. 1999. Adaptive agents for information gathering from multiple, distributed information sources. In Proceedings of the 1999 AAAI Symposium on Intelligent Agents in Cyerspace (Stanford University, Palo Alto, CA, March 1999), 40-46.Google Scholar
- FOX,E.AND SHAW, J. 1994. Combination of multiple searches. In Proceedings of the Second Text REtrieval Conference (Gaithersburg, MD, Aug. 1994), 243-252.Google Scholar
- FRENCH, J., FOX, E., MALY, K., AND SELMAN, A. 1995. Wide area technical report service: technical report online. Commun. ACM 38, 4 (April), 45-46. Google Scholar
- FRENCH, J., POWELL, A., CALLAN, J., VILES, C., EMMITT, T. , PREY, K., AND MOU, Y. 1999. Comparing the performance of database selection algorithms. In Proceedings of the ACM SIGIR Conference (Berkeley, CA, August 1999), 238-245. Google Scholar
- FRENCH, J., POWELL, A., AND VILES, C. 1998. Evaluating database selection techniques: a testbed and experiment. In Proceedings of the ACM SIGIR Conference (Melbourne, Australia, August 1998), 121-129. Google Scholar
- GAUCH, S., WANG,G.,AND GOMEZ, M. 1996. Profusion: intelligent fusion from multiple, distributed search engines. J. Univers. Comput. Sci. 2, 9, 637-649.Google Scholar
- GRAVANO, L., CHANG, C., GARCIA-MOLINA, H., AND PAEPCKE, A. 1997. Starts: Stanford proposal for Internet meta-searching. In Proceedings of the ACMSIGMOD Conference (Tucson, AZ, May 1997), 207-218. Google Scholar
- GRAVANO,L.AND GARCIA-MOLINA, H. 1995. Generalizing gloss to vector-space databases and broker hierarchies. In Proceedings of the International Conferences on Very Large Data Bases (Zurich, Switzerland, Sept. 1995), 78-89. Google Scholar
- GRAVANO,L.AND GARCIA-MOLINA, H. 1997. Merging ranks from heterogeneous Internet sources. In Proceedings of the International Conferences on Very Large Data Bases (Athens, Greece, August 1997), 196-205. Google Scholar
- GRAVANO, L., GARCIA-MOLINA, H., AND TOMASIC,A. 1994. The effectiveness of gloss for the text database discovery problem. In Proceedings of the ACM SIGMOD Conference (Minnesota, MN, May 1994), 126-137. Google Scholar
- HAWKING,D.AND THISTLEWAITE, P. 1999. Methods for information server selection. ACM Trans. Inform. Syst. 17, 1 (Jan.), 40-76. Google Scholar
- IPEIROTIS, P., GRAVANO, L., AND SAHAMI, M. 2001. Probe, count, and classify: categorizing hidden-Web databases. In Proceedings of the ACM SIGMOD Conference (Santa Barbara, CA, 2001), 67-78. Google Scholar
- JANSEN, B., SPINK, A., BATEMAN,J.,AND SARACEVIC,T. 1998. Real life information retrieval: a study of user queries on the Web. ACM SIGIR Forum 32, 1, 5-17. Google Scholar
- KAHLE,B.AND MEDLAR, A. 1991. An information system for corporate users: wide area information servers. Technical Report TMC199, Thinking Machine Corporation (April).Google Scholar
- KIRK, T., LEVY, A., SAGIV,Y.,AND SRIVASTAVA, D. 1995. The information manifold. In AAAI Spring Symposium on Information Gathering in Distributed Heterogeneous Environments (1995).Google Scholar
- KIRSCH, S. 1998. Internet search: Infoseek's experiences searching the internet. ACM SIGIR Forum 32, 2, 3-7. Google Scholar
- KLEINBERG, J. 1998. Authoritative sources in hyperlinked environment. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (San Francisco, CA, January 1998), 668-677. Google Scholar
- KONSTAN, J., MILLER, B., MALTZ, D., HERLOCKER,J., GORDON, L., AND RIEDL, J. 1997. Grouplens: Applying collaborative filtering to usenet news. Commun. ACM 40, 3, 77-87. Google Scholar
- KOSTER, M. 1994. Aliweb: Archie-like indexing in the Web. Comput. Netw. and ISDN Syst. 27,2, 175-182. Google Scholar
- LAWRENCE,S.AND LEE GILES, C. 1998. Inquirus, the neci meta search engine. In Proceedings of the Seventh International World Wide Web Conference (Brisbane, Australia, April 1998), 95-105. Google Scholar
- LAWRENCE,S.AND LEE GILES, C. 1999. Accessibility of information on the web. Nature 400, 107-109.Google Scholar
- LEE, J.-H. 1997. Analyses of multiple evidence combination. In Proceedings of the ACM SIGIR Conference (Philadelphia, PA, July 1997), 267- 276. Google Scholar
- LI,S.AND DANZIG, P. 1997. Boolean similarity measures for resource discovery. IEEE Trans. Knowl. Data Eng. 9, 6 (Nov.), 863-876. Google Scholar
- LIU, K., MENG, W., YU,C.,AND RISHE, N. 2000. Discovery of similarity computations of search engines. In Proceedings of the Ninth ACM International Conference on Information and Knowledge Management (Washington, DC, Nov. 2000), 290-297. Google Scholar
- LIU, K., YU, C., MENG, W., WU,W.,AND RISHE, N. 2001. A statistical method for estimating the usefulness of text databases. IEEE Trans. Knowl. Data Eng. To appear. Google Scholar
- LIU, L. 1999. Query routing in large-scale digital library systems. In Proceedings of the IEEE International Conference on Data Engineering (Sydney, Australia, March 1999), 154-163. Google Scholar
- MANBER,U.AND BIGOT, P. 1997. The search broker. In Proceedings of the USENIX Symposium on Internet Technologies and Systems (Monterey, CA, December 1997), 231-239. Google Scholar
- MANBER,U.AND BIGOT, P. 1998. Connecting diverse web search facilities. Data Eng. Bull. 21,2 (June), 21-27.Google Scholar
- MAULDIN, M. 1997. Lycos: design choices in an internet search service. IEEE Expert 12,1 (Feb.), 1-8.Google Scholar
- MCBRYAN, O. 1994. Genvl and wwww: Tools for training the Web. In Proceedings of the First World Wide Web Conference (Geneva, Switzerland, May 1994), 79-90.Google Scholar
- MENG, M., LIU, K., YU, C., WANG, X., CHANG,Y.,AND RISHE, N. 1998. Determine text databases to search in the internet. In Proceedings of the International Conferences on Very Large Data Bases (New York, NY, Aug. 1998), 14-25. Google Scholar
- MENG, M., LIU, K., YU, C., WU,W.,AND RISHE,N. 1999a. Estimating the usefulness of search engines. In Proceedings of the IEEE Interna-tional Conference on Data Engineering (Sydney, Australia, March 1999), 146-153. Google Scholar
- MENG, W., WANG, W., SUN, H., AND YU, C. 2001. Concept hierarchy based text database categorization. Int. J. Knowl. Inform. Syst. To appear.Google Scholar
- MENG, W., YU,C.,AND LIU, K. 1999b. Detection of heterogeneities in a multiple text database environment. In Proceedings of the Fourth IFCIS Conference on Cooperative Information Systems (Edinburgh, Scotland, September 1999), 22-33. Google Scholar
- MILLER, G. 1990. Wordnet: An on-line lexical database. Int. J. Lexicography 3, 4, 235-312.Google Scholar
- NCSTRL. n.d. Networked computer science technical reference library. At Web site http:// cstr.cs.cornell.edu.Google Scholar
- PAGE, L., BRIN, S., MOTWANI, R., AND WINOGRAD,T. 1998. The pagerank citation ranking: bring order to the web. Technical report, Stanford University, Palo, Alto, CA.Google Scholar
- ROBERTSON, S., WALKER,S.,AND BEAULIEU, M. 1999. Okapi at trec-7: automatic ad hoc, filtering, vlc, and interactive track. In Proceedings of the Seventh Text Retrieval Conference (Gaithersburg, MD, Nov. 1999), 253-264.Google Scholar
- SALTON, G. 1989. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Infor-mation by Computer. Addison Wesley, Reading, MA. Google Scholar
- SALTON,G.AND MCGILL, M. 1983. Introduction to Modern Information Retrieval. McGraw-Hill, New York, NY. Google Scholar
- SELBERG,E.AND ETZIONI, O. 1995. Multiservice search and comparison using the metacrawler. In Proceedings of the Fourth World Wide Web Conference (Boston, MA, Dec. 1995), 195-208.Google Scholar
- SELBERG,E.AND ETZIONI, O. 1997. The metacrawler architecture for resource aggregation on the web. IEEE Expert 12, 1, 8-14.Google Scholar
- SHELDON, M., DUDA, A., WEISS, R., O'TOOLE,J.,AND GIFFORD, D. 1994. A content routing system for distributed information servers. In Proceedings of the Fourth International Conference on Extending Database Technology (Cambridge, England, March 1994), 109-122. Google Scholar
- SINGHAL, A., BUCKLEY,C.,AND MITRA, M. 1996. Pivoted document length normalization. In Proceedings of the ACM SIGIR Conference (Zurich, Switzerland, Aug. 1996), 21-29. Google Scholar
- SUGIURA,A.AND ETZIONI, O. 2000. Query routing for Web search engines: architecture and experiments. In Proceedings of the Ninth World Wide Web Conference (Amsterdam, The Netherlands, May 2000), 417-429. Google Scholar
- TOWELL, G., VOORHEES, E., GUPTA,N.,AND JOHNSON- LAIRD, B. 1995. Learning collection fusion strategies for information retrieval. In Proceedings of the 12th International Conference on Machine Learning (Tahoe City, CA, July 1995), 540-548.Google Scholar
- TURTLE,H.AND CROFT, B. 1991. Evaluation of an inference network-based retrieval model. ACM Trans. Inform. Syst. 9, 3 (July), 8-14. Google Scholar
- VOGT,C.AND COTTRELL, G. 1999. Fusion via a linear combination of scores. Inform. Retr. 1, 3, 151- 173. Google Scholar
- VOORHEES, E. 1996. Siemens trec-4 report: further experiments with database merging. In Proceedings of the Fourth Text Retrieval Conference (Gaithersburg, MD, Nov. 1996), 121-130.Google Scholar
- VOORHEES, E., GUPTA,N.,AND JOHNSON-LAIRD,B. 1995a. The collection fusion problem. In Proceedings of the Third Text Retrieval Conference (Gaithersburg, MD, Nov. 1995), 95-104.Google Scholar
- VOORHEES, E., GUPTA,N.,AND JOHNSON-LAIRD,B. 1995b. Learning collection fusion strategies. In Proceedings of the ACM SIGIR Conference (Seattle, WA, July 1995), 172-179. Google Scholar
- VOORHEES,E.AND TONG, R. 1997. Multiple search engines in database merging. In Proceedings of the Second ACM International Conference on Digital Libraries (Philadelphia, PA, July 1997), 93-102. Google Scholar
- WADE, S., WILLETT,P.,AND BAWDEN, D. 1989. Sibris: the sandwich interactive browing and ranking information system. J. Inform. Sci. 15, 249-260. Google Scholar
- WIDDER, D. 1989. Advanced Calculus, 2nd ed. Dover Publications, Inc., New York, NY.Google Scholar
- WU, Z., MENG, W., YU,C.,AND LI, Z. 2001. Towards a highly-scalable and effective metasearch engine. In Proceedings of the Tenth World Wide Web Conference (Hong Kong, May 2001), 386-395. Google Scholar
- XU,J.AND CALLAN, J. 1998. Effective retrieval with distributed collections. In Proceedings of the ACM SIGIR Conference (Melbourne, Australia, 1998), 112-120. Google Scholar
- XU,J.AND CROFT, B. 1996. Query expansion using local and global document analysis. In Proceedings of the ACM SIGIR Conference (Zurich, Switzerland, Aug. 1996), 4-11. Google Scholar
- XU,J.AND CROFT, B. 1999. Cluster-based language models for distributed retrieval. In Proceedings of the ACM SIGIR Conference (Berkeley, CA, Aug. 1999), 254-261. Google Scholar
- YU, C., LIU, K., WU, W., MENG,W.,AND RISHE,N. 1999a. Finding the most similar documents across multiple text databases. In Proceedings of the IEEE Conference on Advances in Digital Libraries (Baltimore, MD, May 1999), 150-162. Google Scholar
- YU,C.AND MENG, W. 1998. Principles of Database Query Processing for Advanced Applications. Morgan Kaufmann Publishers, San Francisco, CA. Google Scholar
- YU, C., MENG, W., LIU, K., WU,W.,AND RISHE,N. 1999b. Efficient and effective metasearch for a large number of text databases. In Proceedings of the Eighth ACM International Conference on Information and Knowledge Management (Kansas City, MO, Nov. 1999), 217-224. Google Scholar
- YU, C., MENG, W., WU,W.,AND LIU, K. 2001. Efficient and effective metasearch for text databases incorporating linkages among documents. In Proceedings of the ACM SIGMOD Conference (Santa Barbara, CA, May 2001), 187-198. Google Scholar
- YUWONO,B.AND LEE, D. 1996. Search and ranking algorithms for locating resources on the World Wide Web. In Proceedings of the IEEE International Conference on Data Engineering (New Orleans, LA, Feb. 1996), 164-177. Google Scholar
- YUWONO,B.AND LEE, D. 1997. Server ranking for distributed text resource systems on the Internet. In Proceedings of the 5th International Conference On Database Systems for Advanced Applications (Melbourne, Australia, April 1997), 391-400. Google Scholar
Recommendations
Advanced metasearch engines
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge managementA metasearch engine is a system, which is connected to different search engines. In response to a user query, it invokes suitable search engines for the query, merges the information returned by these search engines and output the merged result. There ...
Efficient and effective metasearch for text databases incorporating linkages among documents
Linkages among documents have a significant impact on the importance of documents, as it can be argued that important documents are pointed to by many documents or by other important documents. Metasearch engines can be used to facilitate ordinary users ...
A Statistical Method for Estimating the Usefulness of Text Databases
Searching desired data on the Internet is one of the most common ways the Internet is used. No single search engine is capable of searching all data on the Internet. The approach that provides an interface for invoking multiple search engines for each ...
Comments