Abstract
The web is a rich resource of structured data. There has been an increasing interest in using web structured data for many applications such as data integration, web search and question answering. In this paper, we present Dexter, a system to find product sites on the web, and detect and extract product specifications from them. Since product specifications exist in multiple product sites, our focused crawler relies on search queries and backlinks to discover product sites. To perform the detection, and handle the high diversity of specifications in terms of content, size and format, our system uses supervised learning to classify HTML fragments (e.g., tables and lists) present in web pages as specifications or not. To perform large-scale extraction of the attribute-value pairs from the HTML fragments identified by the specification detector, Dexter adopts two lightweight strategies: a domain-independent and unsupervised wrapper method, which relies on the observation that these HTML fragments have very similar structure; and a combination of this strategy with a previous approach, which infers extraction patterns by annotations generated by automatic but noisy annotators. The results show that our crawler strategy to locate product specification pages is effective: (1) it discovered 1:46AM product specification pages from 3; 005 sites and 9 different categories; (2) the specification detector obtains high values of F-measure (close to 0:9) over a heterogeneous set of product specifications; and (3) our efficient wrapper methods for attribute-value extraction get very high values of precision (0.92) and recall (0.95) and obtain better results than a state-of-the-art, supervised rule-based wrapper.
- L. Barbosa, S. Bangalore, and V. K. R. Sridhar. Crawling back and forth: Using back and out links to locate bilingual sites. In IJCNLP, pages 429--437, 2011.Google Scholar
- L. Blanco, V. Crescenzi, P. Merialdo, and P. Papotti. Supporting the automatic construction of entity aware search engines. In WIDM, pages 149--156, 2008. Google ScholarDigital Library
- M. Bronzi, V. Crescenzi, P. Merialdo, and P. Papotti. Extraction and integration of partially overlapping web sources. PVLDB, 6(10):805--816, 2013. Google ScholarDigital Library
- M. Cafarella, A. Halevy, and J. Madhavan. Structured data on the web. Communications of the ACM, 54(2):72--79, 2011. Google ScholarDigital Library
- M. J. Cafarella, A. Y. Halevy, and N. Khoussainova. Data integration for the relational web. PVLDB, 2(1):1090--1101, 2009. Google ScholarDigital Library
- M. J. Cafarella, A. Y. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. PVLDB, 1(1):538--549, 2008. Google ScholarDigital Library
- S.-L. Chuang, K. C.-C. Chang, and C. Zhai. Context-aware wrapping: synchronized data extraction. In VLDB, pages 699--710. VLDB Endowment, 2007. Google ScholarDigital Library
- V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In VLDB, pages 109--118, 2001. Google ScholarDigital Library
- V. Crescenzi, P. Merialdo, and P. Missier. Clustering web pages based on their structure. DKE, 54(3):279--299, 2005. Google ScholarDigital Library
- E. Crestan and P. Pantel. Web-scale table census and classification. In WSDM, pages 545--554, 2011. Google ScholarDigital Library
- N. Dalvi, P. Bohannon, and F. Sha. Robust web extraction: an approach based on a probabilistic tree-edit model. In SIGMOD, pages 335--348, 2009. Google ScholarDigital Library
- N. Dalvi, R. Kumar, and M. Soliman. Automatic wrappers for large scale web extraction. PVLDB, 4(4):219--230, Jan. 2011. Google ScholarDigital Library
- N. Dalvi, A. Machanavajjhala, and B. Pang. An analysis of structured data on the web. PVLDB, 5(7):680--691, 2012. Google ScholarDigital Library
- M. Das, G. Das, and V. Hristidis. Leveraging collaborative tagging for web item design. In KDD, pages 538--546, 2011. Google ScholarDigital Library
- A. Das Sarma, L. Fang, N. Gupta, A. Halevy, H. Lee, F. Wu, R. Xin, and C. Yu. Finding related tables. In SIGMOD, pages 817--828, 2012. Google ScholarDigital Library
- O. Etzioni, M. Cafarella, D. Downey, S. Kok, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. Web-scale information extraction in knowitall: (preliminary results). In WWW, pages 100--110, 2004. Google ScholarDigital Library
- E. Ferrara, P. D. Meo, G. Fiumara, and R. Baumgartner. Web data extraction, applications and techniques: A survey. CoRR, abs/1207.0246, 2012.Google Scholar
- T. Furche, G. Gottlob, G. Grasso, X. Guo, G. Orsi, C. Schallhart, and C. Wang. Diadem: Thousands of websites to a single database. PVLDB, 7(14), 2014. Google ScholarDigital Library
- R. Gupta and S. Sarawagi. Answering table augmentation queries from unstructured lists on the web. PVLDB, 2(1):289--300, 2009. Google ScholarDigital Library
- J. Jiang, X. Song, N. Yu, and C.-Y. Lin. Focus: learning to crawl web forums. IEEE TKDE, 25(6):1293--1306, 2013. Google ScholarDigital Library
- A. Kannan, I. E. Givoni, R. Agrawal, and A. Fuxman. Matching unstructured product offers to structured product specifications. In KDD, pages 404--412, 2011. Google ScholarDigital Library
- N. Kushmerick. Wrapper induction: efficiency and expressiveness. Artif. Intell., 118(1-2):15--68, Apr. 2000. Google ScholarDigital Library
- X. Li, X. L. Dong, K. Lyon, W. Meng, and D. Srivastava. Truth finding on deep web: Is the problem solved. PVLDB, 6(2):97--102, 2013. Google ScholarDigital Library
- R. Meusel, P. Mika, and R. Blanco. Focused crawling for structured data. In CIKM, pages 1039--1048, 2014. Google ScholarDigital Library
- H. Nguyen, A. Fuxman, S. Paparizos, J. Freire, and R. Agrawal. Synthesizing products for online catalogs. PVLDB, 4(7):409--418, 2011. Google ScholarDigital Library
- A. Sahuguet and F. Azavant. Building light-weight wrappers for legacy web data-sources using w4f. In VLDB, pages 738--741, 1999. Google ScholarDigital Library
- S. Soderland. Learning information extraction rules for semi-structured and free text. Mach. Learn., 34(1-3):233--272, 1999. Google ScholarDigital Library
- Y. Wang and J. Hu. A machine learning based approach for table detection on the web. In WWW, pages 242--250, 2002. Google ScholarDigital Library
- G. Weiss and F. Provost. Learning when training data are costly: The effect of class distribution on tree induction. J. Artif. Intell. Res. (JAIR), 19:315--354, 2003. Google ScholarDigital Library
- T. Weninger, T. J. Johnston, and J. Han. The parallel path framework for entity discovery on the web. ACM Trans. Web, 7(3):16:1--16:29, 2013. Google ScholarDigital Library
Recommendations
Extracting attribute-value pairs from product specifications on the web
WI '17: Proceedings of the International Conference on Web IntelligenceComparison shopping portals integrate product offers from large numbers of e-shops in order to support consumers in their buying decisions. Product offers often consist of a title and a free-text product description, both describing product attributes ...
Learning to Extract Attribute Value from Product via Question Answering: A Multi-task Approach
KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data MiningAttribute value extraction refers to the task of identifying values of an attribute of interest from product information. It is an important research topic which has been widely studied in e-Commerce and relation learning. There are two main limitations ...
Text mining for product attribute extraction
We describe our work on extracting attribute and value pairs from textual product descriptions. The goal is to augment databases of products by representing each product as a set of attribute-value pairs. Such a representation is beneficial for tasks ...
Comments