Abstract
A wealth of information produced by individuals and organizations is expressed in natural language text. This is a problem since text lacks the explicit structure that is necessary to support rich querying and analysis. Information extraction systems are sophisticated software tools to discover structured information in natural language text. Unfortunately, information extraction is a challenging and time-consuming task. In this paper, we address the limitations of state-of-the-art systems for the optimization of information extraction programs, with the objective of producing efficient extraction executions. Our solution relies on exploiting a wide range of optimization opportunities. For efficiency, we consider a wide spectrum of execution plans, including approximate plans whose results differ in their precision and recall. Our optimizer accounts for these characteristics of the competing execution plans, and uses accurate predictors of their extraction time, recall, and precision. We demonstrate the efficiency and effectiveness of our optimizer through a large-scale experimental evaluation over real-world datasets and multiple extraction tasks and approaches.
- E. Agichtein and L. Gravano. Querying text databases for efficient information extraction. In ICDE, pages 113-124, 2003.Google Scholar
- A. Baronchelli, E. Caglioti, V. Loreto, and E. Pizzi. Dictionary based methods for information extraction. Physica A: Statistical Mechanics and its Applications, 342:294-300, 2004.Google Scholar
- S. Brin. Extracting patterns and relations from the world wide web. In WebDB, pages 172-183, 1998. Google Scholar
- A. Ekbal and S. Bandyopadhyay. A Hidden Markov Model based named entity recognition system: Bengali and Hindi as case studies. Lecture Notes in Computer Science, 4815:545-552, 2007. Google Scholar
- Y. Fang and K. C.-C. Chang. Searching patterns for relation extraction over the web: Rediscovering the pattern-relation duality. In WSDM, pages 825-834, 2011. Google Scholar
- C. Giuliano, A. Lavelli, and L. Romano. Exploiting shallow linguistic information for relation extraction from biomedical literature. In EACL, pages 3-7, 2006.Google Scholar
- R. Grishman, S. Huttunen, and R. Yangarber. Information extraction for enhanced access to disease outbreak reports. Journal of Biomedical Informatics, 35:236-246, 2002. Google Scholar
- P. G. Ipeirotis, E. Agichtein, P. Jain, and L. Gravano. To search or to crawl?: Towards a query optimizer for text-centric tasks. In SIGMOD, pages 265-276, 2006. Google Scholar
- A. Jain, A. Doan, and L. Gravano. Optimizing SQL queries over text databases. In ICDE, pages 636-645, 2008. Google Scholar
- A. Jain and P. Ipeirotis. A quality-aware optimizer for information extraction. ACM Transactions on Database Systems, 34:5:1-5:48, 2009. Google Scholar
- D. Knuth, J. Morris, Jr., and V. Pratt. Fast pattern matching in strings. SIAM Journal on Computing, 6:323-350, 1977.Google Scholar
- R. Krishnamurthy, Y. Li, S. Raghavan, F. Reiss, S. Vaithyanathan, and H. Zhu. SystemT: A system for declarative information extraction. In SIGMOD, pages 7-13, 2009. Google Scholar
- B. Lowerre. The Harpy speech understanding system. In Readings in speech recognition. Morgan Kaufmann Publishers Inc., 1990. Google Scholar
- A. McCallum. Information Extraction: Distilling structured data from unstructured text. ACM Queue, 3:48-57, 2005. Google Scholar
- A. McCallum and W. Li. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In CONLL, pages 188-191, 2003. Google Scholar
- F. Reiss, S. Raghavan, R. Krishnamurthy, H. Zhu, and S. Vaithyanathan. An algebraic approach to rule-based information extraction. In ICDE, pages 933-942, 2008. Google Scholar
- J. A. Rice. Mathematical statistics and data analysis. Duxbury Press, 2nd edition, 1994.Google Scholar
- E. Sandhaus. The New York Times Annotated Corpus. In Linguistic Data Consortium, 2008.Google Scholar
- W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan. Declarative information extraction using datalog with embedded extraction predicates. In VLDB, pages 1033-1044, 2007. Google Scholar
- G. Simões, H. Galhardas, and L. Gravano. When speed has a price: Fast information extraction using approximate algorithms. Technical Report 3/2013, INESC-ID, June 2013. URL: http://www.inesc-id.pt/ficheiros/publicacoes/8982.pdf.Google Scholar
- H. W. Sorenson. Parameter estimation: Principles and problems. M. Dekker, 1980.Google Scholar
- A. Viterbi. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13:260-269, 1967. Google Scholar
- C. Whitelaw, A. Kehlenbeck, N. Petrovic, and L. Ungar. Web-scale named entity recognition. In CIKM, pages 123-132, 2008. Google Scholar
Index Terms
- When speed has a price: fast information extraction using approximate algorithms
Recommendations
A speed-up and speed-down strategy for swarm optimization
GECCO Comp '14: Proceedings of the Companion Publication of the 2014 Annual Conference on Genetic and Evolutionary ComputationIn this paper, inspired by speed-up and speed-down (SUSD) mechanism observed by the fish swarm avoiding light, an SUSD strategy is proposed to develop new swarm intelligence based optimization algorithms to enhance the accuracy and efficiency of swarm ...
High speed ant colony optimization CMOS chip
Ant colony optimization (ACO) is an optimization computation inspired by the study of the ant colonies' behavior. This paper presents design and CMOS implementation of the ant colony optimization based algorithm for solving the TSP problem. In order to ...
Optimal posted-price mechanism in microtask crowdsourcing
IJCAI'17: Proceedings of the 26th International Joint Conference on Artificial IntelligencePosted-price mechanisms are widely-adopted to decide the price of tasks in popular microtask crowdsourcing. In this paper, we propose a novel postedprice mechanism which not only outperforms existing mechanisms on performance but also avoids their need ...
Comments