skip to main content
research-article

Dexter: large-scale discovery and extraction of product specifications on the web

Published:01 September 2015Publication History
Skip Abstract Section

Abstract

The web is a rich resource of structured data. There has been an increasing interest in using web structured data for many applications such as data integration, web search and question answering. In this paper, we present Dexter, a system to find product sites on the web, and detect and extract product specifications from them. Since product specifications exist in multiple product sites, our focused crawler relies on search queries and backlinks to discover product sites. To perform the detection, and handle the high diversity of specifications in terms of content, size and format, our system uses supervised learning to classify HTML fragments (e.g., tables and lists) present in web pages as specifications or not. To perform large-scale extraction of the attribute-value pairs from the HTML fragments identified by the specification detector, Dexter adopts two lightweight strategies: a domain-independent and unsupervised wrapper method, which relies on the observation that these HTML fragments have very similar structure; and a combination of this strategy with a previous approach, which infers extraction patterns by annotations generated by automatic but noisy annotators. The results show that our crawler strategy to locate product specification pages is effective: (1) it discovered 1:46AM product specification pages from 3; 005 sites and 9 different categories; (2) the specification detector obtains high values of F-measure (close to 0:9) over a heterogeneous set of product specifications; and (3) our efficient wrapper methods for attribute-value extraction get very high values of precision (0.92) and recall (0.95) and obtain better results than a state-of-the-art, supervised rule-based wrapper.

References

  1. L. Barbosa, S. Bangalore, and V. K. R. Sridhar. Crawling back and forth: Using back and out links to locate bilingual sites. In IJCNLP, pages 429--437, 2011.Google ScholarGoogle Scholar
  2. L. Blanco, V. Crescenzi, P. Merialdo, and P. Papotti. Supporting the automatic construction of entity aware search engines. In WIDM, pages 149--156, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Bronzi, V. Crescenzi, P. Merialdo, and P. Papotti. Extraction and integration of partially overlapping web sources. PVLDB, 6(10):805--816, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. Cafarella, A. Halevy, and J. Madhavan. Structured data on the web. Communications of the ACM, 54(2):72--79, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. J. Cafarella, A. Y. Halevy, and N. Khoussainova. Data integration for the relational web. PVLDB, 2(1):1090--1101, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. J. Cafarella, A. Y. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. PVLDB, 1(1):538--549, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S.-L. Chuang, K. C.-C. Chang, and C. Zhai. Context-aware wrapping: synchronized data extraction. In VLDB, pages 699--710. VLDB Endowment, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In VLDB, pages 109--118, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. V. Crescenzi, P. Merialdo, and P. Missier. Clustering web pages based on their structure. DKE, 54(3):279--299, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. E. Crestan and P. Pantel. Web-scale table census and classification. In WSDM, pages 545--554, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. N. Dalvi, P. Bohannon, and F. Sha. Robust web extraction: an approach based on a probabilistic tree-edit model. In SIGMOD, pages 335--348, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. N. Dalvi, R. Kumar, and M. Soliman. Automatic wrappers for large scale web extraction. PVLDB, 4(4):219--230, Jan. 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. N. Dalvi, A. Machanavajjhala, and B. Pang. An analysis of structured data on the web. PVLDB, 5(7):680--691, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Das, G. Das, and V. Hristidis. Leveraging collaborative tagging for web item design. In KDD, pages 538--546, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Das Sarma, L. Fang, N. Gupta, A. Halevy, H. Lee, F. Wu, R. Xin, and C. Yu. Finding related tables. In SIGMOD, pages 817--828, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. O. Etzioni, M. Cafarella, D. Downey, S. Kok, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. Web-scale information extraction in knowitall: (preliminary results). In WWW, pages 100--110, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. E. Ferrara, P. D. Meo, G. Fiumara, and R. Baumgartner. Web data extraction, applications and techniques: A survey. CoRR, abs/1207.0246, 2012.Google ScholarGoogle Scholar
  18. T. Furche, G. Gottlob, G. Grasso, X. Guo, G. Orsi, C. Schallhart, and C. Wang. Diadem: Thousands of websites to a single database. PVLDB, 7(14), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. R. Gupta and S. Sarawagi. Answering table augmentation queries from unstructured lists on the web. PVLDB, 2(1):289--300, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. Jiang, X. Song, N. Yu, and C.-Y. Lin. Focus: learning to crawl web forums. IEEE TKDE, 25(6):1293--1306, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Kannan, I. E. Givoni, R. Agrawal, and A. Fuxman. Matching unstructured product offers to structured product specifications. In KDD, pages 404--412, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. N. Kushmerick. Wrapper induction: efficiency and expressiveness. Artif. Intell., 118(1-2):15--68, Apr. 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. X. Li, X. L. Dong, K. Lyon, W. Meng, and D. Srivastava. Truth finding on deep web: Is the problem solved. PVLDB, 6(2):97--102, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. R. Meusel, P. Mika, and R. Blanco. Focused crawling for structured data. In CIKM, pages 1039--1048, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. H. Nguyen, A. Fuxman, S. Paparizos, J. Freire, and R. Agrawal. Synthesizing products for online catalogs. PVLDB, 4(7):409--418, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. A. Sahuguet and F. Azavant. Building light-weight wrappers for legacy web data-sources using w4f. In VLDB, pages 738--741, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. S. Soderland. Learning information extraction rules for semi-structured and free text. Mach. Learn., 34(1-3):233--272, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Y. Wang and J. Hu. A machine learning based approach for table detection on the web. In WWW, pages 242--250, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. G. Weiss and F. Provost. Learning when training data are costly: The effect of class distribution on tree induction. J. Artif. Intell. Res. (JAIR), 19:315--354, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. T. Weninger, T. J. Johnston, and J. Han. The parallel path framework for entity discovery on the web. ACM Trans. Web, 7(3):16:1--16:29, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 8, Issue 13
    Proceedings of the 41st International Conference on Very Large Data Bases, Kohala Coast, Hawaii
    September 2015
    144 pages

    Publisher

    VLDB Endowment

    Publication History

    • Published: 1 September 2015
    Published in pvldb Volume 8, Issue 13

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader