ABSTRACT
Entity synonyms are critical for many applications like information retrieval and named entity recognition in documents. The current trend is to automatically discover entity synonyms using statistical techniques on web data. Prior techniques suffer from several limitations like click log sparsity and inability to distinguish between entities of different concept classes. In this paper, we propose a general framework for robustly discovering entity synonym with two novel similarity functions that overcome the limitations of prior techniques. We develop efficient and scalable techniques leveraging the MapReduce framework to discover synonyms at large scale. To handle long entity names with extraneous tokens, we propose techniques to effectively map long entity names to short queries in query log. Our experiments on real data from different entity domains demonstrate the superior quality of our synonyms as well as the efficiency of our algorithms. The entity synonyms produced by our system is in production in Bing Shopping and Video search, with experiments showing the significance it brings in improving search experience.
Supplemental Material
- S. Agrawal, K. Chakrabarti, S. Chaudhuri, and V. Ganti. Scalable ad-hoc entity extraction from text collections. Proc. VLDB Endow., 2008. Google ScholarDigital Library
- M. Baroni and S. Bisi. Using cooccurrence statistics and the web to discover synonyms in technical language. In In Proceedings of LREC 2004, pages 1725--1728, 2004.Google Scholar
- S. Chaudhuri, V. Ganti, and D. Xin. Exploiting web search to generate synonyms for entities. In WWW Conference, 2009. Google ScholarDigital Library
- S. Chaudhuri, V. Ganti, and D. Xin. Mining document collections to facilitate accurate approximate entity matching. PVLDB, 2009. Google ScholarDigital Library
- S. Chaudhuri, V. Ganti, and D. Xin. Mining document collections to facilitate accurate approximate entity matching. PVLDB, 2(1), 2009. Google ScholarDigital Library
- T. Cheng, H. Lauw, and S. Paparizos. Fuzzy matching of web queries to structured data. In ICDE, 2010.Google ScholarCross Ref
- T. Cheng, H. W. Lauw, and S. Paparizos. Entity synonyms for structured web search. TKDE, 2011.Google Scholar
- N. Craswell and M. Szummer. Random walks on the click graph. In SIGIR, 2007. Google ScholarDigital Library
- X. Dong, A. Halevy, and J. Madhavan. Reference reconciliation in complex information spaces. In SIGMOD, 2005. Google ScholarDigital Library
- G. W. Furnas, S. C. Deerwester, S. T. Dumais, T. K. Landauer, R. A. Harshman, L. A. Streeter, and K. E. Lochbaum. Information retrieval using a singular value decomposition model of latent semantic structure. In SIGIR, 1988. Google ScholarDigital Library
- Z. Harris. Distributional structure. Word, 10(23), 1954.Google Scholar
- M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In EuroSys, 2007. Google ScholarDigital Library
- R. Jones, B. Rey, O. Madani, and W. Greiner. Generating query substitutions. In WWW, 2006. Google ScholarDigital Library
- C. D. Manning and H. Schütze. Foundations of statistical natural language processing. MIT Press, 1999. Google ScholarDigital Library
- Q. Mei, D. Zhou, and K. Church. Query suggestion using hitting time. In CIKM, 2008. Google ScholarDigital Library
- G. Navarro. A guided tour to approximate string matching. ACM Comput. Surv., 2001. Google ScholarDigital Library
- P. Pantel, E. Crestan, A. Borkovsky, A.-M. Popescu, and V. Vyas. Web-scale distributional similarity and entity set expansion. In EMNLP, 2009. Google ScholarDigital Library
- P. D. Turney. Mining the web for synonyms: Pmi-ir versus lsa on toefl. CoRR, cs.LG/0212033, 2002. Google ScholarDigital Library
- T. Wang and G. Hirst. Near-synonym lexical choice in latent semantic space. In COLING, 2010. Google ScholarDigital Library
Index Terms
- A framework for robust discovery of entity synonyms
Recommendations
Rule based synonyms for entity extraction from noisy text
AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text dataIdentification of named entities such as person, organization and product names from text is an important task in information extraction. In many domains, the same entity could be referred to in multiple ways due to variations introduced by different ...
An Automatic Approach for Extracting Chinese Entity Synonyms from Encyclopedias
ICBDT '20: Proceedings of the 3rd International Conference on Big Data TechnologiesSynonyms play an important role in many entity-based applications. However, most known synonym extraction methods are in English, while Chinese ones are relatively rare. In this paper, we propose a simple yet effective extraction and cleaning framework ...
KGSynNet: A Novel Entity Synonyms Discovery Framework with Knowledge Graph
Database Systems for Advanced ApplicationsAbstractEntity synonyms discovery is crucial for entity-leveraging applications. However, existing studies suffer from several critical issues: (1) the input mentions may be out-of-vocabulary (OOV) and may come from a different semantic space of the ...
Comments