article

When speed has a price: fast information extraction using approximate algorithms

Authors:
Gonçalo Simões

INESC-ID and Instituto Superior Técnico, Portugal

INESC-ID and Instituto Superior Técnico, Portugal
View Profile

,
Helena Galhardas

INESC-ID and Instituto Superior Técnico, Portugal

INESC-ID and Instituto Superior Técnico, Portugal
View Profile

,
Luis Gravano

Columbia University, New York

Columbia University, New York
View Profile

Proceedings of the VLDB Endowment Volume 6 Issue 13pp 1462–1473https://doi.org/10.14778/2536258.2536259

Published:01 August 2013Publication History

Proceedings of the VLDB Endowment

Abstract

A wealth of information produced by individuals and organizations is expressed in natural language text. This is a problem since text lacks the explicit structure that is necessary to support rich querying and analysis. Information extraction systems are sophisticated software tools to discover structured information in natural language text. Unfortunately, information extraction is a challenging and time-consuming task. In this paper, we address the limitations of state-of-the-art systems for the optimization of information extraction programs, with the objective of producing efficient extraction executions. Our solution relies on exploiting a wide range of optimization opportunities. For efficiency, we consider a wide spectrum of execution plans, including approximate plans whose results differ in their precision and recall. Our optimizer accounts for these characteristics of the competing execution plans, and uses accurate predictors of their extraction time, recall, and precision. We demonstrate the efficiency and effectiveness of our optimizer through a large-scale experimental evaluation over real-world datasets and multiple extraction tasks and approaches.

References

E. Agichtein and L. Gravano. Querying text databases for efficient information extraction. In ICDE, pages 113-124, 2003.Google Scholar
A. Baronchelli, E. Caglioti, V. Loreto, and E. Pizzi. Dictionary based methods for information extraction. Physica A: Statistical Mechanics and its Applications, 342:294-300, 2004.Google Scholar
S. Brin. Extracting patterns and relations from the world wide web. In WebDB, pages 172-183, 1998. Google Scholar
A. Ekbal and S. Bandyopadhyay. A Hidden Markov Model based named entity recognition system: Bengali and Hindi as case studies. Lecture Notes in Computer Science, 4815:545-552, 2007. Google Scholar
Y. Fang and K. C.-C. Chang. Searching patterns for relation extraction over the web: Rediscovering the pattern-relation duality. In WSDM, pages 825-834, 2011. Google Scholar
C. Giuliano, A. Lavelli, and L. Romano. Exploiting shallow linguistic information for relation extraction from biomedical literature. In EACL, pages 3-7, 2006.Google Scholar
R. Grishman, S. Huttunen, and R. Yangarber. Information extraction for enhanced access to disease outbreak reports. Journal of Biomedical Informatics, 35:236-246, 2002. Google Scholar
P. G. Ipeirotis, E. Agichtein, P. Jain, and L. Gravano. To search or to crawl?: Towards a query optimizer for text-centric tasks. In SIGMOD, pages 265-276, 2006. Google Scholar
A. Jain, A. Doan, and L. Gravano. Optimizing SQL queries over text databases. In ICDE, pages 636-645, 2008. Google Scholar
A. Jain and P. Ipeirotis. A quality-aware optimizer for information extraction. ACM Transactions on Database Systems, 34:5:1-5:48, 2009. Google Scholar
D. Knuth, J. Morris, Jr., and V. Pratt. Fast pattern matching in strings. SIAM Journal on Computing, 6:323-350, 1977.Google Scholar
R. Krishnamurthy, Y. Li, S. Raghavan, F. Reiss, S. Vaithyanathan, and H. Zhu. SystemT: A system for declarative information extraction. In SIGMOD, pages 7-13, 2009. Google Scholar
B. Lowerre. The Harpy speech understanding system. In Readings in speech recognition. Morgan Kaufmann Publishers Inc., 1990. Google Scholar
A. McCallum. Information Extraction: Distilling structured data from unstructured text. ACM Queue, 3:48-57, 2005. Google Scholar
A. McCallum and W. Li. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In CONLL, pages 188-191, 2003. Google Scholar
F. Reiss, S. Raghavan, R. Krishnamurthy, H. Zhu, and S. Vaithyanathan. An algebraic approach to rule-based information extraction. In ICDE, pages 933-942, 2008. Google Scholar
J. A. Rice. Mathematical statistics and data analysis. Duxbury Press, 2nd edition, 1994.Google Scholar
E. Sandhaus. The New York Times Annotated Corpus. In Linguistic Data Consortium, 2008.Google Scholar
W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan. Declarative information extraction using datalog with embedded extraction predicates. In VLDB, pages 1033-1044, 2007. Google Scholar
G. Simões, H. Galhardas, and L. Gravano. When speed has a price: Fast information extraction using approximate algorithms. Technical Report 3/2013, INESC-ID, June 2013. URL: http://www.inesc-id.pt/ficheiros/publicacoes/8982.pdf.Google Scholar
H. W. Sorenson. Parameter estimation: Principles and problems. M. Dekker, 1980.Google Scholar
A. Viterbi. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13:260-269, 1967. Google Scholar
C. Whitelaw, A. Kehlenbeck, N. Petrovic, and L. Ungar. Web-scale named entity recognition. In CIKM, pages 123-132, 2008. Google Scholar

Index Terms

When speed has a price: fast information extraction using approximate algorithms
1. Information systems
  1. Data management systems
    1. Database management system engines

Index terms have been assigned to the content through auto-classification.

Recommendations

A speed-up and speed-down strategy for swarm optimization
GECCO Comp '14: Proceedings of the Companion Publication of the 2014 Annual Conference on Genetic and Evolutionary Computation

In this paper, inspired by speed-up and speed-down (SUSD) mechanism observed by the fish swarm avoiding light, an SUSD strategy is proposed to develop new swarm intelligence based optimization algorithms to enhance the accuracy and efficiency of swarm ...
Read More
High speed ant colony optimization CMOS chip

Ant colony optimization (ACO) is an optimization computation inspired by the study of the ant colonies' behavior. This paper presents design and CMOS implementation of the ant colony optimization based algorithm for solving the TSP problem. In order to ...
Read More
Optimal posted-price mechanism in microtask crowdsourcing
IJCAI'17: Proceedings of the 26th International Joint Conference on Artificial Intelligence

Posted-price mechanisms are widely-adopted to decide the price of tasks in popular microtask crowdsourcing. In this paper, we propose a novel postedprice mechanism which not only outperforms existing mechanisms on performance but also avoids their need ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Proceedings of the VLDB Endowment Volume 6, Issue 13
August 2013
180 pages
ISSN:2150-8097
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 August 2013
Published in pvldb Volume 6, Issue 13
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 83
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

When speed has a price: fast information extraction using approximate algorithms

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

A speed-up and speed-down strategy for swarm optimization

High speed ant colony optimization CMOS chip

Optimal posted-price mechanism in microtask crowdsourcing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

When speed has a price: fast information extraction using approximate algorithms

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

A speed-up and speed-down strategy for swarm optimization

High speed ant colony optimization CMOS chip

Optimal posted-price mechanism in microtask crowdsourcing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media