ABSTRACT
We present the Lixto project, which is both a research project in database theory and a commercial enterprise that develops Web data extraction (wrapping) and Web service definition software. We discuss the project's main motivations and ideas, in particular the use of a logic-based framework for wrapping. Then we present theoretical results on monadic datalog over trees and on Elog, its close relative which is used as the internal wrapper language in the Lixto system. These results include both a characterization of the expressive power and the complexity of these languages. We describe the visual wrapper specification process in Lixto and various practical aspects of wrapping. We discuss work on the complexity of query languages for trees that was inseminated by our theoretical study of logic-based languages for wrapping. Then we return to the practice of wrapping and the Lixto Transformation Server, which allows for streaming integration of data extracted from Web pages. This is a natural requirement in complex services based on Web wrapping. Finally, we discuss industrial applications of Lixto and point to open problems for future study.
- S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995.]] Google ScholarDigital Library
- R. Baumgartner, S. Eichholz, S. Flesca, G. Gottlob, and M. Herzog. Semantic Markup of News Items with Lixto, 2003.]]Google Scholar
- R. Baumgartner, S. Flesca, and G. Gottlob. "Declarative Information Extraction, Web Crawling, and Recursive Wrapping with Lixto". In Proc. LPNMR'01, Vienna, Austria, 2001.]] Google ScholarDigital Library
- R. Baumgartner, S. Flesca, and G. Gottlob. "Visual Web Information Extraction with Lixto". In Proceedings of the 27th International Conference on Very Large Data Bases (VLDB'01), 2001.]] Google ScholarDigital Library
- R. Baumgartner, S. Flesca, G. Gottlob, and M. Herzog. "Building Dynamic Information Portals - A Case Study in the Agrarian Domain". In Proc. IS, 2002.]]Google Scholar
- R. Baumgartner, M. Herzog, and G. Gottlob. "Visual Programming of Web Data Aggregation Applications". In Proc. IIWeb-03, 2003.]]Google Scholar
- S. Cosmadakis, H. Gaifman, P. Kanellakis, and M. Vardi. "Decidable Optimization Problems for Database Logic Programs". In Proceedings of the 20th Annual ACM Symposium on Theory of Computing, pages 477--490, Chicago, Illinois, USA, 1988.]] Google ScholarDigital Library
- B. Courcelle. "Graph Rewriting: An Algebraic and Logic Approach". In J. van Leeuwen, editor, Handbook of Theoretical Computer Science, volume 2, chapter 5, pages 193--242. Elsevier Science Publishers B. V., 1990.]] Google ScholarDigital Library
- E. Dantsin, T. Eiter, G. Gottlob, and A. Voronkov. "Complexity and Expressive Power of Logic Programming". ACM Computing Surveys,33(3):374--425, Sept. 2001.]] Google ScholarDigital Library
- J. Doner. "Tree Acceptors and some of their Applications". Journal of Computer and System Sciences,4:406--451, 1970.]]Google ScholarDigital Library
- J. Flum, M. Frick, and M. Grohe. "Query Evaluation via Tree-Decompositions". In Proc. ICDT'01, volume 1973 of LNCS, pages 22--38. Springer, Jan. 2001.]] Google ScholarDigital Library
- M. Frick, M. Grohe, and C. Koch. "Query Evaluation on Compressed Trees". In Proc. LICS'03, Ottawa, Canada, June 2003.]] Google ScholarDigital Library
- E. Gold. "Language Identification in the Limit". Inform. Control, 10:447--474, 1967.]]Google ScholarCross Ref
- G. Gottlob and C. Koch. "Monadic Datalog and the Expressive Power of Web Information Extraction Languages". Journal of the ACM,51(1):74--113, 2004.]] Google ScholarDigital Library
- G. Gottlob, C. Koch, and R. Pichler. "Efficient Algorithms for Processing XPath Queries". In Proc. VLDB 2002, Hong Kong, China, 2002.]] Google ScholarDigital Library
- G. Gottlob, C. Koch, and R. Pichler. "The Complexity of XPath Query Processing". In Proc. PODS'03, 2003.]] Google ScholarDigital Library
- G. Gottlob, C. Koch, and R. Pichler. "XPath Query Evaluation: Improving Time and Space Efficiency". In ICDE'03, Bangalore, India, Mar. 2003.]]Google ScholarCross Ref
- G. Gottlob, C. Koch, and K. U. Schulz. Conjunctive Queries over Trees. In Proc. PODS'04, 2004.]] Google ScholarDigital Library
- R. Greenlaw, H. J. Hoover, and W. L. Ruzzo. Limits to Parallel Computation: P-Completeness Theory. Oxford University Press, 1995.]] Google ScholarDigital Library
- M. Herzog and G. Gottlob: "InfoPipes: A Flexible Framework for M-Commerce Applications". In Proc. TES, 2001.]] Google ScholarDigital Library
- C. Koch. "Efficient Processing of Expressive Node-Selecting Queries on XML Data in Secondary Storage: A Tree Automata-based Approach". In Proc. VLDB 2003, pages 249--260, 2003.]] Google ScholarDigital Library
- R. Kosala, H. Blockeel, M. Bruynooghe, and J. V. den Bussche. "Information Extraction from Web Documents based on Local Unranked Tree Automaton Inference". In Proc. IJCAI, 2003.]] Google ScholarDigital Library
- N. Kushmerick, D. Weld, and R. Doorenbos. "Wrapper Induction for Information Extraction". In Proc. IJCAI, 1997.]]Google ScholarDigital Library
- A. H. F. Laender, B. Ribeiro-Neto, and A. S. da Silva. "DEByE -- Data Extraction By Example". Data and Knowledge Engineering,40(2):121--154, Feb. 2002.]] Google ScholarDigital Library
- L. Liu, C. Pu, and W. Han. "XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources". In Proc. ICDE 2000, pages 611--621, San Diego, USA, 2000.]] Google ScholarDigital Library
- http://www.lixto.com.]]Google Scholar
- B. Ludäscher, R. Himmeröder, G. Lausen, W. May, and C. Schlepphorst. "Managing Semistructured Data with Florid: A Deductive Object-oriented Perspective". Information Systems,23(8):1--25, 1998.]] Google ScholarDigital Library
- H. Meuss, K. U. Schulz, and F. Bry. "Towards Aggregated Answers for Semistructured Data". In Proc. ICDT'01, pages 346--360, 2001.]] Google ScholarDigital Library
- M. Minoux. "LTUR: A Simplified Linear-Time Unit Resolution Algorithm for Horn Formulae and Computer Implementation". Information Processing Letters,29(1):1--12, 1988.]] Google ScholarDigital Library
- Mostrare project. www.grappa.univ-lille3.fr/mostrare/.]]Google Scholar
- I. Muslea, S. Minton, and C. Knoblock. "A Hierarchical Approach to Wrapper Induction". In Proc. 3rd Intern. Conf. on Autonomous Agents, 1999.]] Google ScholarDigital Library
- F. Neven and T. Schwentick. "Query Automata on Finite Trees". Theoretical Computer Science,275:633--674, 2002.]] Google ScholarDigital Library
- F. Neven and J. van den Bussche. "Expressiveness of Structured Document Query Languages Based on Attribute Grammars". Journal of the ACM, 49(1):56--100, Jan. 2002.]] Google ScholarDigital Library
- Y. Papakonstantinou, A. Gupta, H. Garcia-Molina, and J. Ullman. "A Query Translation Scheme for Rapid Implementation of Wrappers". In DOOD'95, pages 161--186, Singapore, 1995. Springer.]] Google ScholarDigital Library
- A. Sahuguet and F. Azavant. "Building Intelligent Web Applications Using Lightweight Wrappers". Data and Knowledge Engineering,36(3):283--316, 2001.]] Google ScholarDigital Library
- H. Seidl, T. Schwentick, and A. Muscholl. "Numerical Document Queries". In Proc. PODS'03, pages 155--166, San Diego, California, 2003.]] Google ScholarDigital Library
- J. Thatcher and J. Wright. "Generalized Finite Automata Theory with an Application to a Decision Problem of Second-order Logic". Mathematical Systems Theory,2(1):57--81, 1968.]]Google ScholarCross Ref
- W. Thomas. "Languages, Automata, and Logic". In G. Rozenberg and A. Salomaa, editors, Handbook of Formal Languages, volume 3, chapter 7, pages 389--455. Springer Verlag, 1997.]] Google ScholarDigital Library
- World Wide Web Consortium. XML Path Language (XPath) Recommendation. http://www.w3c.org/TR/xpath/, Nov. 1999.]]Google Scholar
Recommendations
The lixto project: exploring new frontiers of web data extraction
BNCOD'06: Proceedings of the 23rd British National Conference on Databases, conference on Flexible and Efficient Information HandlingThe Lixto project is an ongoing research effort in the area of Web data extraction. Whereas the project originally started out with the idea to develop a logic-based extraction language and a tool to visually define extraction programs from sample Web ...
Declarative Information Extraction, Web Crawling, and Recursive Wrapping with Lixto
LPNMR '01: Proceedings of the 6th International Conference on Logic Programming and Nonmonotonic ReasoningLixto is a system and method for the visual and interactive generation of wrappers for Web pages under the supervision of a human developer, for automatically extracting information from Web pages using such wrappers, and for translating the extracted ...
Effective Web Data Extraction with Ducky
IDEAS '15: Proceedings of the 19th International Database Engineering & Applications SymposiumThe World Wide Web has become an invaluable source of data. However, extracting useful information from the vastness of the web can become a challenge as depending on the amount of data needed, manual extraction or creation of web scraping programs may ...
Comments