Abstract
Tabular data is an abundant source of information on the Web, but remains mostly isolated from the latter's interconnections since tables lack links and computer-accessible descriptions of their structure. In other words, the schemas of these tables -- attribute names, values, data types, etc. -- are not explicitly stored as table metadata. Consequently, the structure that these tables contain is not accessible to the crawlers that power search engines and thus not accessible to user search queries. We address this lack of structure with a new method for leveraging the principles of table construction in order to extract table schemas. Discovering the schema by which a table is constructed is achieved by harnessing the similarities and differences of nearby table rows through the use of a novel set of features and a feature processing scheme. The schemas of these data tables are determined using a classification technique based on conditional random fields in combination with a novel feature encoding method called logarithmic binning, which is specifically designed for the data table extraction task. Our method provides considerable improvement over the well-known WebTables schema extraction method. In contrast with previous work that focuses on extracting individual relations, our method excels at correctly interpreting full tables, thereby being capable of handling general tables such as those found in spreadsheets, instead of being restricted to HTML tables as is the case with the WebTables method. We also extract additional schema characteristics, such as row groupings, which are important for supporting information retrieval tasks on tabular data.
- P. A. Bernstein, J. Madhavan, and E. Rahm. Generic schema matching, ten years later. PVLDB, 4(11):695-701, 2011.Google Scholar
- M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Uncovering the relational web. In WebDB, Vancouver, Canada, June 2008.Google Scholar
- M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. WebTables: Exploring the power of tables on the web. In VLDB, pages 538-549, Auckland, New Zealand, Aug. 2008. Google Scholar
- M. J. Cafarella, A. Y. Halevy, and N. Khoussainova. Data integration for the relational web. In VLDB, pages 1090-1101, Lyon, France, Aug. 2009. Google Scholar
- H.-H. Chen, S.-C. Tsai, and J.-H. Tsai. Mining tables from large scale HTML texts. In COLING, pages 166-172, Saarbrücken, Germany, July 2000. Google Scholar
- E. F. Codd. A relational model of data for large shared data banks. CACM, 13(6):377-387, June 1970. Google Scholar
- A. Das Sarma, L. Fang, N. Gupta, A. Halevy, H. Lee, F. Wu, R. Xin, and C. Yu. Finding related tables. In SIGMOD, pages 817-828, Scottsdale, Arizona, USA, May 2012. Google Scholar
- D. W. Embley, M. Hurst, D. P. Lopresti, and G. Nagy. Table-processing paradigms: a research survey. IJDAR, 8(2):66-86, 2006.Google Scholar
- W. Gatterbauer, P. Bohunsky, M. Herzog, B. Krüpl, and B. Pollak. Towards domain-independent information extraction from web tables. In WWW, pages 71-80, Banff, Canada, May 2007. Google Scholar
- G. S. Iwerks and H. Samet. The spatial spreadsheet. In VISUAL, pages 317-324, Amsterdam, The Netherlands, June 1999. Google Scholar
- E. Jacox and H. Samet. Spatial join techniques. Computer Science Technical Report TR-4730, University of Maryland, College Park, MD, June 2005.Google Scholar
- D. Jannach, K. Shchekotykhin, and G. Friedrich. Automated ontology instantiation from tabular web sources--the AllRight system. Web Semantics, 7(3):136-153, Sept. 2009. Google Scholar
- J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, pages 282-289, Williamstown, Massachussetts, USA, 2001. Google Scholar
- O. Lassila. The resource description framework. IEEE Intelligent Systems, 15(6):67-69, 2000.Google Scholar
- M. D. Lieberman, H. Samet, J. Sankaranarayanan, and J. Sperling. Spatio-textual spreadsheets: Geotagging via spatial coherence. In SIGSPATIAL, pages 524-527, Seattle, WA, Nov. 2009. Google Scholar
- G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and searching web tables using entities, types and relationships. PVLDB, 3(1):1338-1347, 2010. Google Scholar
- Y. Liu, K. Bai, P. Mitra, and C. L. Giles. TableSeer: Automatic table metadata extraction and searching in digital libraries. In JCDL, pages 91-100, Vancouver, Canada, June 2007. Google Scholar
- R. Pimplikar and S. Sarawagi. Answering table queries on the web using column keywords. In VLDB, pages 908-919, Istanbul, Turkey, Aug. 2012. Google Scholar
- D. Pinto, A. McCallum, X. Wei, and W. B. Croft. Table extraction using conditional random fields. In SIGIR, pages 235-242, 2003. Google Scholar
- H. Samet, A. Rosenfeld, C. A. Shaffer, and R. E. Webber. A geographic information system using quadtrees. Pattern Recognition, 17(6):647-656, November/December 1984.Google Scholar
- H. Samet, H. Alborzi, F. Brabec, C. Esperança, G. R. Hjaltason, F. Morgan, and E. Tanin. Use of the SAND spatial browser for digital government applications. CACM, 46(1):63-66, Jan. 2003. Google Scholar
- F. Sha and F. C. N. Pereira. Shallow parsing with conditional random fields. In HLT-NAACL, pages 213-220, 2003. Google Scholar
- P. Venetis, A. Halevy, J. Madhavan, M. Pasca, W. Shen, F. Wu, G. Miao, and C. Wu. Recovering semantics of tables on the web. PVLDB, 4(9):528-538, June 2011. Google Scholar
- Y. Wang and J. Hu. A machine learning based approach for table detection on the web. In WWW, pages 242-250, Honolulu, HI, May 2002. Google Scholar
- M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri. Infogather: entity augmentation and attribute discovery by holistic matching with web tables. In SIGMOD, pages 97-108, Scottsdale, Arizona, USA, May 2012. Google Scholar
- R. Zanibbi, D. Blostein, and J. R. Cordy. A survey of table recognition: Models, observations, transformations, and inferences. IJDAR, 7(1):1-16, Mar. 2004. Google Scholar
Index Terms
- Schema extraction for tabular data on the web
Recommendations
Schema extraction from XML collections
JCDL '02: Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital librariesXML Schema language has been proposed to replace Document Type Definitions (DTDs) as schema mechanism for XML data. This language consistently extends grammar-based constructions with constraint- and pattern-based ones and have a higher expressive power ...
Tabular Web Data: Schema Discovery and Integration
DaWaK 2013: Proceedings of the 15th International Conference on Data Warehousing and Knowledge Discovery - Volume 8057Web data such as web tables, lists, and data records from a wide variety of domains can be combined for different purposes such as querying for information and creating example data sets. Tabular web data location, extraction, and schema discovery and ...
Comments