research-article

Dexter: large-scale discovery and extraction of product specifications on the web

Authors:
Disheng Qiu

Università degli Studi Roma Tre

Università degli Studi Roma Tre
View Profile

,
Luciano Barbosa

IBM Research Brazil

IBM Research Brazil
View Profile

,
Xin Luna Dong

Google Inc.

Google Inc.
View Profile

,
Yanyan Shen

National University of Singapore

National University of Singapore
View Profile

,
Divesh Srivastava

AT&T Labs - Research

AT&T Labs - Research
View Profile

Proceedings of the VLDB Endowment Volume 8 Issue 13pp 2194–2205https://doi.org/10.14778/2831360.2831372

Published:01 September 2015Publication History

Proceedings of the VLDB Endowment

Abstract

The web is a rich resource of structured data. There has been an increasing interest in using web structured data for many applications such as data integration, web search and question answering. In this paper, we present Dexter, a system to find product sites on the web, and detect and extract product specifications from them. Since product specifications exist in multiple product sites, our focused crawler relies on search queries and backlinks to discover product sites. To perform the detection, and handle the high diversity of specifications in terms of content, size and format, our system uses supervised learning to classify HTML fragments (e.g., tables and lists) present in web pages as specifications or not. To perform large-scale extraction of the attribute-value pairs from the HTML fragments identified by the specification detector, Dexter adopts two lightweight strategies: a domain-independent and unsupervised wrapper method, which relies on the observation that these HTML fragments have very similar structure; and a combination of this strategy with a previous approach, which infers extraction patterns by annotations generated by automatic but noisy annotators. The results show that our crawler strategy to locate product specification pages is effective: (1) it discovered 1:46AM product specification pages from 3; 005 sites and 9 different categories; (2) the specification detector obtains high values of F-measure (close to 0:9) over a heterogeneous set of product specifications; and (3) our efficient wrapper methods for attribute-value extraction get very high values of precision (0.92) and recall (0.95) and obtain better results than a state-of-the-art, supervised rule-based wrapper.

References

L. Barbosa, S. Bangalore, and V. K. R. Sridhar. Crawling back and forth: Using back and out links to locate bilingual sites. In IJCNLP, pages 429--437, 2011.Google Scholar
L. Blanco, V. Crescenzi, P. Merialdo, and P. Papotti. Supporting the automatic construction of entity aware search engines. In WIDM, pages 149--156, 2008. Google ScholarDigital Library
M. Bronzi, V. Crescenzi, P. Merialdo, and P. Papotti. Extraction and integration of partially overlapping web sources. PVLDB, 6(10):805--816, 2013. Google ScholarDigital Library
M. Cafarella, A. Halevy, and J. Madhavan. Structured data on the web. Communications of the ACM, 54(2):72--79, 2011. Google ScholarDigital Library
M. J. Cafarella, A. Y. Halevy, and N. Khoussainova. Data integration for the relational web. PVLDB, 2(1):1090--1101, 2009. Google ScholarDigital Library
M. J. Cafarella, A. Y. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. PVLDB, 1(1):538--549, 2008. Google ScholarDigital Library
S.-L. Chuang, K. C.-C. Chang, and C. Zhai. Context-aware wrapping: synchronized data extraction. In VLDB, pages 699--710. VLDB Endowment, 2007. Google ScholarDigital Library
V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In VLDB, pages 109--118, 2001. Google ScholarDigital Library
V. Crescenzi, P. Merialdo, and P. Missier. Clustering web pages based on their structure. DKE, 54(3):279--299, 2005. Google ScholarDigital Library
E. Crestan and P. Pantel. Web-scale table census and classification. In WSDM, pages 545--554, 2011. Google ScholarDigital Library
N. Dalvi, P. Bohannon, and F. Sha. Robust web extraction: an approach based on a probabilistic tree-edit model. In SIGMOD, pages 335--348, 2009. Google ScholarDigital Library
N. Dalvi, R. Kumar, and M. Soliman. Automatic wrappers for large scale web extraction. PVLDB, 4(4):219--230, Jan. 2011. Google ScholarDigital Library
N. Dalvi, A. Machanavajjhala, and B. Pang. An analysis of structured data on the web. PVLDB, 5(7):680--691, 2012. Google ScholarDigital Library
M. Das, G. Das, and V. Hristidis. Leveraging collaborative tagging for web item design. In KDD, pages 538--546, 2011. Google ScholarDigital Library
A. Das Sarma, L. Fang, N. Gupta, A. Halevy, H. Lee, F. Wu, R. Xin, and C. Yu. Finding related tables. In SIGMOD, pages 817--828, 2012. Google ScholarDigital Library
O. Etzioni, M. Cafarella, D. Downey, S. Kok, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. Web-scale information extraction in knowitall: (preliminary results). In WWW, pages 100--110, 2004. Google ScholarDigital Library
E. Ferrara, P. D. Meo, G. Fiumara, and R. Baumgartner. Web data extraction, applications and techniques: A survey. CoRR, abs/1207.0246, 2012.Google Scholar
T. Furche, G. Gottlob, G. Grasso, X. Guo, G. Orsi, C. Schallhart, and C. Wang. Diadem: Thousands of websites to a single database. PVLDB, 7(14), 2014. Google ScholarDigital Library
R. Gupta and S. Sarawagi. Answering table augmentation queries from unstructured lists on the web. PVLDB, 2(1):289--300, 2009. Google ScholarDigital Library
J. Jiang, X. Song, N. Yu, and C.-Y. Lin. Focus: learning to crawl web forums. IEEE TKDE, 25(6):1293--1306, 2013. Google ScholarDigital Library
A. Kannan, I. E. Givoni, R. Agrawal, and A. Fuxman. Matching unstructured product offers to structured product specifications. In KDD, pages 404--412, 2011. Google ScholarDigital Library
N. Kushmerick. Wrapper induction: efficiency and expressiveness. Artif. Intell., 118(1-2):15--68, Apr. 2000. Google ScholarDigital Library
X. Li, X. L. Dong, K. Lyon, W. Meng, and D. Srivastava. Truth finding on deep web: Is the problem solved. PVLDB, 6(2):97--102, 2013. Google ScholarDigital Library
R. Meusel, P. Mika, and R. Blanco. Focused crawling for structured data. In CIKM, pages 1039--1048, 2014. Google ScholarDigital Library
H. Nguyen, A. Fuxman, S. Paparizos, J. Freire, and R. Agrawal. Synthesizing products for online catalogs. PVLDB, 4(7):409--418, 2011. Google ScholarDigital Library
A. Sahuguet and F. Azavant. Building light-weight wrappers for legacy web data-sources using w4f. In VLDB, pages 738--741, 1999. Google ScholarDigital Library
S. Soderland. Learning information extraction rules for semi-structured and free text. Mach. Learn., 34(1-3):233--272, 1999. Google ScholarDigital Library
Y. Wang and J. Hu. A machine learning based approach for table detection on the web. In WWW, pages 242--250, 2002. Google ScholarDigital Library
G. Weiss and F. Provost. Learning when training data are costly: The effect of class distribution on tree induction. J. Artif. Intell. Res. (JAIR), 19:315--354, 2003. Google ScholarDigital Library
T. Weninger, T. J. Johnston, and J. Han. The parallel path framework for entity discovery on the web. ACM Trans. Web, 7(3):16:1--16:29, 2013. Google ScholarDigital Library

Recommendations

Extracting attribute-value pairs from product specifications on the web
WI '17: Proceedings of the International Conference on Web Intelligence

Comparison shopping portals integrate product offers from large numbers of e-shops in order to support consumers in their buying decisions. Product offers often consist of a title and a free-text product description, both describing product attributes ...
Read More
Learning to Extract Attribute Value from Product via Question Answering: A Multi-task Approach
KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Attribute value extraction refers to the task of identifying values of an attribute of interest from product information. It is an important research topic which has been widely studied in e-Commerce and relation learning. There are two main limitations ...
Read More
Text mining for product attribute extraction

We describe our work on extracting attribute and value pairs from textual product descriptions. The goal is to augment databases of products by representing each product as a set of attribute-value pairs. Such a representation is beneficial for tasks ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Proceedings of the VLDB Endowment Volume 8, Issue 13
Proceedings of the 41st International Conference on Very Large Data Bases, Kohala Coast, Hawaii
September 2015
144 pages
ISSN:2150-8097
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 September 2015
Published in pvldb Volume 8, Issue 13
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 22
  Total Citations
  View Citations
- 323
  Total Downloads
- Downloads (Last 12 months)27
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Dexter: large-scale discovery and extraction of product specifications on the web

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Extracting attribute-value pairs from product specifications on the web

Learning to Extract Attribute Value from Product via Question Answering: A Multi-task Approach

Text mining for product attribute extraction

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Dexter: large-scale discovery and extraction of product specifications on the web

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Extracting attribute-value pairs from product specifications on the web

Learning to Extract Attribute Value from Product via Question Answering: A Multi-task Approach

Text mining for product attribute extraction

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media