skip to main content
article

When speed has a price: fast information extraction using approximate algorithms

Published:01 August 2013Publication History
Skip Abstract Section

Abstract

A wealth of information produced by individuals and organizations is expressed in natural language text. This is a problem since text lacks the explicit structure that is necessary to support rich querying and analysis. Information extraction systems are sophisticated software tools to discover structured information in natural language text. Unfortunately, information extraction is a challenging and time-consuming task. In this paper, we address the limitations of state-of-the-art systems for the optimization of information extraction programs, with the objective of producing efficient extraction executions. Our solution relies on exploiting a wide range of optimization opportunities. For efficiency, we consider a wide spectrum of execution plans, including approximate plans whose results differ in their precision and recall. Our optimizer accounts for these characteristics of the competing execution plans, and uses accurate predictors of their extraction time, recall, and precision. We demonstrate the efficiency and effectiveness of our optimizer through a large-scale experimental evaluation over real-world datasets and multiple extraction tasks and approaches.

References

  1. E. Agichtein and L. Gravano. Querying text databases for efficient information extraction. In ICDE, pages 113-124, 2003.Google ScholarGoogle Scholar
  2. A. Baronchelli, E. Caglioti, V. Loreto, and E. Pizzi. Dictionary based methods for information extraction. Physica A: Statistical Mechanics and its Applications, 342:294-300, 2004.Google ScholarGoogle Scholar
  3. S. Brin. Extracting patterns and relations from the world wide web. In WebDB, pages 172-183, 1998. Google ScholarGoogle Scholar
  4. A. Ekbal and S. Bandyopadhyay. A Hidden Markov Model based named entity recognition system: Bengali and Hindi as case studies. Lecture Notes in Computer Science, 4815:545-552, 2007. Google ScholarGoogle Scholar
  5. Y. Fang and K. C.-C. Chang. Searching patterns for relation extraction over the web: Rediscovering the pattern-relation duality. In WSDM, pages 825-834, 2011. Google ScholarGoogle Scholar
  6. C. Giuliano, A. Lavelli, and L. Romano. Exploiting shallow linguistic information for relation extraction from biomedical literature. In EACL, pages 3-7, 2006.Google ScholarGoogle Scholar
  7. R. Grishman, S. Huttunen, and R. Yangarber. Information extraction for enhanced access to disease outbreak reports. Journal of Biomedical Informatics, 35:236-246, 2002. Google ScholarGoogle Scholar
  8. P. G. Ipeirotis, E. Agichtein, P. Jain, and L. Gravano. To search or to crawl?: Towards a query optimizer for text-centric tasks. In SIGMOD, pages 265-276, 2006. Google ScholarGoogle Scholar
  9. A. Jain, A. Doan, and L. Gravano. Optimizing SQL queries over text databases. In ICDE, pages 636-645, 2008. Google ScholarGoogle Scholar
  10. A. Jain and P. Ipeirotis. A quality-aware optimizer for information extraction. ACM Transactions on Database Systems, 34:5:1-5:48, 2009. Google ScholarGoogle Scholar
  11. D. Knuth, J. Morris, Jr., and V. Pratt. Fast pattern matching in strings. SIAM Journal on Computing, 6:323-350, 1977.Google ScholarGoogle Scholar
  12. R. Krishnamurthy, Y. Li, S. Raghavan, F. Reiss, S. Vaithyanathan, and H. Zhu. SystemT: A system for declarative information extraction. In SIGMOD, pages 7-13, 2009. Google ScholarGoogle Scholar
  13. B. Lowerre. The Harpy speech understanding system. In Readings in speech recognition. Morgan Kaufmann Publishers Inc., 1990. Google ScholarGoogle Scholar
  14. A. McCallum. Information Extraction: Distilling structured data from unstructured text. ACM Queue, 3:48-57, 2005. Google ScholarGoogle Scholar
  15. A. McCallum and W. Li. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In CONLL, pages 188-191, 2003. Google ScholarGoogle Scholar
  16. F. Reiss, S. Raghavan, R. Krishnamurthy, H. Zhu, and S. Vaithyanathan. An algebraic approach to rule-based information extraction. In ICDE, pages 933-942, 2008. Google ScholarGoogle Scholar
  17. J. A. Rice. Mathematical statistics and data analysis. Duxbury Press, 2nd edition, 1994.Google ScholarGoogle Scholar
  18. E. Sandhaus. The New York Times Annotated Corpus. In Linguistic Data Consortium, 2008.Google ScholarGoogle Scholar
  19. W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan. Declarative information extraction using datalog with embedded extraction predicates. In VLDB, pages 1033-1044, 2007. Google ScholarGoogle Scholar
  20. G. Simões, H. Galhardas, and L. Gravano. When speed has a price: Fast information extraction using approximate algorithms. Technical Report 3/2013, INESC-ID, June 2013. URL: http://www.inesc-id.pt/ficheiros/publicacoes/8982.pdf.Google ScholarGoogle Scholar
  21. H. W. Sorenson. Parameter estimation: Principles and problems. M. Dekker, 1980.Google ScholarGoogle Scholar
  22. A. Viterbi. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13:260-269, 1967. Google ScholarGoogle Scholar
  23. C. Whitelaw, A. Kehlenbeck, N. Petrovic, and L. Ungar. Web-scale named entity recognition. In CIKM, pages 123-132, 2008. Google ScholarGoogle Scholar

Index Terms

  1. When speed has a price: fast information extraction using approximate algorithms
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image Proceedings of the VLDB Endowment
      Proceedings of the VLDB Endowment  Volume 6, Issue 13
      August 2013
      180 pages

      Publisher

      VLDB Endowment

      Publication History

      • Published: 1 August 2013
      Published in pvldb Volume 6, Issue 13

      Qualifiers

      • article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader