skip to main content
10.1145/1055558.1055560acmconferencesArticle/Chapter ViewAbstractPublication PagespodsConference Proceedingsconference-collections
Article

The Lixto data extraction project: back and forth between theory and practice

Published:14 June 2004Publication History

ABSTRACT

We present the Lixto project, which is both a research project in database theory and a commercial enterprise that develops Web data extraction (wrapping) and Web service definition software. We discuss the project's main motivations and ideas, in particular the use of a logic-based framework for wrapping. Then we present theoretical results on monadic datalog over trees and on Elog, its close relative which is used as the internal wrapper language in the Lixto system. These results include both a characterization of the expressive power and the complexity of these languages. We describe the visual wrapper specification process in Lixto and various practical aspects of wrapping. We discuss work on the complexity of query languages for trees that was inseminated by our theoretical study of logic-based languages for wrapping. Then we return to the practice of wrapping and the Lixto Transformation Server, which allows for streaming integration of data extracted from Web pages. This is a natural requirement in complex services based on Web wrapping. Finally, we discuss industrial applications of Lixto and point to open problems for future study.

References

  1. S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. Baumgartner, S. Eichholz, S. Flesca, G. Gottlob, and M. Herzog. Semantic Markup of News Items with Lixto, 2003.]]Google ScholarGoogle Scholar
  3. R. Baumgartner, S. Flesca, and G. Gottlob. "Declarative Information Extraction, Web Crawling, and Recursive Wrapping with Lixto". In Proc. LPNMR'01, Vienna, Austria, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. R. Baumgartner, S. Flesca, and G. Gottlob. "Visual Web Information Extraction with Lixto". In Proceedings of the 27th International Conference on Very Large Data Bases (VLDB'01), 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. Baumgartner, S. Flesca, G. Gottlob, and M. Herzog. "Building Dynamic Information Portals - A Case Study in the Agrarian Domain". In Proc. IS, 2002.]]Google ScholarGoogle Scholar
  6. R. Baumgartner, M. Herzog, and G. Gottlob. "Visual Programming of Web Data Aggregation Applications". In Proc. IIWeb-03, 2003.]]Google ScholarGoogle Scholar
  7. S. Cosmadakis, H. Gaifman, P. Kanellakis, and M. Vardi. "Decidable Optimization Problems for Database Logic Programs". In Proceedings of the 20th Annual ACM Symposium on Theory of Computing, pages 477--490, Chicago, Illinois, USA, 1988.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. B. Courcelle. "Graph Rewriting: An Algebraic and Logic Approach". In J. van Leeuwen, editor, Handbook of Theoretical Computer Science, volume 2, chapter 5, pages 193--242. Elsevier Science Publishers B. V., 1990.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. E. Dantsin, T. Eiter, G. Gottlob, and A. Voronkov. "Complexity and Expressive Power of Logic Programming". ACM Computing Surveys,33(3):374--425, Sept. 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Doner. "Tree Acceptors and some of their Applications". Journal of Computer and System Sciences,4:406--451, 1970.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Flum, M. Frick, and M. Grohe. "Query Evaluation via Tree-Decompositions". In Proc. ICDT'01, volume 1973 of LNCS, pages 22--38. Springer, Jan. 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. Frick, M. Grohe, and C. Koch. "Query Evaluation on Compressed Trees". In Proc. LICS'03, Ottawa, Canada, June 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. E. Gold. "Language Identification in the Limit". Inform. Control, 10:447--474, 1967.]]Google ScholarGoogle ScholarCross RefCross Ref
  14. G. Gottlob and C. Koch. "Monadic Datalog and the Expressive Power of Web Information Extraction Languages". Journal of the ACM,51(1):74--113, 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. G. Gottlob, C. Koch, and R. Pichler. "Efficient Algorithms for Processing XPath Queries". In Proc. VLDB 2002, Hong Kong, China, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. G. Gottlob, C. Koch, and R. Pichler. "The Complexity of XPath Query Processing". In Proc. PODS'03, 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. G. Gottlob, C. Koch, and R. Pichler. "XPath Query Evaluation: Improving Time and Space Efficiency". In ICDE'03, Bangalore, India, Mar. 2003.]]Google ScholarGoogle ScholarCross RefCross Ref
  18. G. Gottlob, C. Koch, and K. U. Schulz. Conjunctive Queries over Trees. In Proc. PODS'04, 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. R. Greenlaw, H. J. Hoover, and W. L. Ruzzo. Limits to Parallel Computation: P-Completeness Theory. Oxford University Press, 1995.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Herzog and G. Gottlob: "InfoPipes: A Flexible Framework for M-Commerce Applications". In Proc. TES, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. C. Koch. "Efficient Processing of Expressive Node-Selecting Queries on XML Data in Secondary Storage: A Tree Automata-based Approach". In Proc. VLDB 2003, pages 249--260, 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. R. Kosala, H. Blockeel, M. Bruynooghe, and J. V. den Bussche. "Information Extraction from Web Documents based on Local Unranked Tree Automaton Inference". In Proc. IJCAI, 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. N. Kushmerick, D. Weld, and R. Doorenbos. "Wrapper Induction for Information Extraction". In Proc. IJCAI, 1997.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. A. H. F. Laender, B. Ribeiro-Neto, and A. S. da Silva. "DEByE -- Data Extraction By Example". Data and Knowledge Engineering,40(2):121--154, Feb. 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. L. Liu, C. Pu, and W. Han. "XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources". In Proc. ICDE 2000, pages 611--621, San Diego, USA, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. http://www.lixto.com.]]Google ScholarGoogle Scholar
  27. B. Ludäscher, R. Himmeröder, G. Lausen, W. May, and C. Schlepphorst. "Managing Semistructured Data with Florid: A Deductive Object-oriented Perspective". Information Systems,23(8):1--25, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. H. Meuss, K. U. Schulz, and F. Bry. "Towards Aggregated Answers for Semistructured Data". In Proc. ICDT'01, pages 346--360, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. M. Minoux. "LTUR: A Simplified Linear-Time Unit Resolution Algorithm for Horn Formulae and Computer Implementation". Information Processing Letters,29(1):1--12, 1988.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Mostrare project. www.grappa.univ-lille3.fr/mostrare/.]]Google ScholarGoogle Scholar
  31. I. Muslea, S. Minton, and C. Knoblock. "A Hierarchical Approach to Wrapper Induction". In Proc. 3rd Intern. Conf. on Autonomous Agents, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. F. Neven and T. Schwentick. "Query Automata on Finite Trees". Theoretical Computer Science,275:633--674, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. F. Neven and J. van den Bussche. "Expressiveness of Structured Document Query Languages Based on Attribute Grammars". Journal of the ACM, 49(1):56--100, Jan. 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Y. Papakonstantinou, A. Gupta, H. Garcia-Molina, and J. Ullman. "A Query Translation Scheme for Rapid Implementation of Wrappers". In DOOD'95, pages 161--186, Singapore, 1995. Springer.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. A. Sahuguet and F. Azavant. "Building Intelligent Web Applications Using Lightweight Wrappers". Data and Knowledge Engineering,36(3):283--316, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. H. Seidl, T. Schwentick, and A. Muscholl. "Numerical Document Queries". In Proc. PODS'03, pages 155--166, San Diego, California, 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. J. Thatcher and J. Wright. "Generalized Finite Automata Theory with an Application to a Decision Problem of Second-order Logic". Mathematical Systems Theory,2(1):57--81, 1968.]]Google ScholarGoogle ScholarCross RefCross Ref
  38. W. Thomas. "Languages, Automata, and Logic". In G. Rozenberg and A. Salomaa, editors, Handbook of Formal Languages, volume 3, chapter 7, pages 389--455. Springer Verlag, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. World Wide Web Consortium. XML Path Language (XPath) Recommendation. http://www.w3c.org/TR/xpath/, Nov. 1999.]]Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    PODS '04: Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
    June 2004
    350 pages
    ISBN:158113858X
    DOI:10.1145/1055558

    Copyright © 2004 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 14 June 2004

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • Article

    Acceptance Rates

    Overall Acceptance Rate642of2,707submissions,24%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader