skip to main content
research-article

Information extraction challenges in managing unstructured data

Published:20 March 2009Publication History
Skip Abstract Section

Abstract

Over the past few years, we have been trying to build an end-to-end system at Wisconsin to manage unstructured data, using extraction, integration, and user interaction. This paper describes the key information extraction (IE) challenges that we have run into, and sketches our solutions. We discuss in particular developing a declarative IE language, optimizing for this language, generating IE provenance, incorporating user feedback into the IE process, developing a novel wiki-based user interface for feedback, best-effort IE, pushing IE into RDBMSs, and more. Our work suggests that IE in managing unstructured data can open up many interesting research challenges, and that these challenges can greatly benefit from the wealth of work on managing structured data that has been carried out by the database community.

References

  1. E. Agichtein, L. Gravano, J. Pavel, V. Sokolova, and A. Voskoboynik. Snowball: A prototype system for extracting relations from large text collections. In SIGMOD, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Brin. Extracting patterns and relations from the world wide web. In WebDB, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Y. Cai, X. Dong, A. Y. Halevy, J. Liu, and J. Madhavan. Personal information management with semex. In SIGMOD, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. X. Chai, B. Vuong, A. Doan, and J. F. Naughton. Efficiently incorporating user interaction into extraction and integration programs. Technical Report UW-CSE-2008, University of Wisconsin-Madison, 2008.Google ScholarGoogle Scholar
  5. F. Chen, A. Doan, J. Yang, and R. Ramakrishnan. Efficient information extraction over evolving text data. In ICDE, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. E. Chu, A. Baid, T. Chen, A. Doan, and J. F. Naughton. A relational approach to incrementally extracting and querying structure in unstructured data. In VLDB, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P. DeRose, X. Chai, B. Gao, W. Shen, A. Doan, P. Bohannon, and X. Zhu. Building community wikipedias: A machine-human partnership approach. In ICDE, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. P. DeRose, W. Shen, F. Chen, A. Doan, and R. Ramakrishnan. Building structured web community portals: A top-down, compositional, and incremental approach. In VLDB, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. P. DeRose, W. Shen, F. Chen, Y. Lee, D. Burdick, A. Doan, and R. Ramakrishnan. Dblife: A community information management platform for the database research community (demo). In CIDR, 2007.Google ScholarGoogle Scholar
  10. A. Doan. Data integration research challenges in community information management systems, 2008. Keynote talk, Workshop on Information Integration Methods, Architectures, and Systems (IIMAS) at ICDE-08.Google ScholarGoogle Scholar
  11. A. Doan, P. Bohannon, R. Ramakrishnan, X. Chai, P. DeRose, B. Gao, and W. Shen. User-centric research challenges in community information management systems. IEEE Data Engineering Bulletin, 30(2):32--40, 2007.Google ScholarGoogle Scholar
  12. A. Doan, J. F. Naughton, A. Baid, X. Chai, F. Chen, T. Chen, E. Chu, P. DeRose, B. Gao, C. Gokhale, J. Huang, W. Shen, and B. Vuong. The case for a structured approach to managing unstructured data. In CIDR, 2009.Google ScholarGoogle Scholar
  13. A. Doan, R. Ramakrishnan, F. Chen, P. DeRose, Y. Lee, R. McCann, M. Sayyadian, and W. Shen. Community information management. IEEE Data Engineering Bulletin, 29(1):64--72, 2006.Google ScholarGoogle Scholar
  14. A. Y. Halevy, M. J. Franklin, and D. Maier. Principles of dataspace systems. In PODS, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. Huang, T. Chen, A. Doan, and J. F. Naughton. On the provenance of non-answers to queries over extracted data. PVLDB, 1(1):736--747, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. R. Krishnamurthy, Y. Li, S. Raghavan, F. Reiss, S. Vaithyanathan, and H. Zhu. Systemt: A system for declarative information extraction, 2008. SIGMOD Record, Special Issue on Managing Information Extraction. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. W. Shen, P. DeRose, R. McCann, A. Doan, and R. Ramakrishnan. Toward best-effort information extraction. In SIGMOD, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan. Declarative information extraction using datalog with embedded extraction predicates. In VLDB, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. W. Shen, C. Gokhale, J. Patel, A. Doan, and J. F. Naughton. Relational databases for information extraction: Limitations and opportunities. Technical Report UW-CSE-2008, University of Wisconsin-Madison, 2008.Google ScholarGoogle Scholar
  20. W. C. Tan. Provenance in databases: Past, current, and future. IEEE Data Eng. Bull., 30(4):3--12, 2007.Google ScholarGoogle Scholar

Index Terms

  1. Information extraction challenges in managing unstructured data

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM SIGMOD Record
            ACM SIGMOD Record  Volume 37, Issue 4
            December 2008
            116 pages
            ISSN:0163-5808
            DOI:10.1145/1519103
            Issue’s Table of Contents

            Copyright © 2009 Authors

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 20 March 2009

            Check for updates

            Qualifiers

            • research-article

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader