Top

Published in:

2016 | OriginalPaper | Chapter

Data Driven Discovery of Attribute Dictionaries

Authors : Fei Chiang, Periklis Andritsos, Renée J. Miller

Published in: Transactions on Computational Collective Intelligence XXI

Publisher: Springer Berlin Heidelberg

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Online product search engines such as Google and Yahoo shopping, rely on having extensive and complete product information to return accurate and timely search results. Given the expanding scope of products and updates to existing products, automated techniques are needed to ensure the underlying product dictionaries remain current and complete. Product search engines receive offers from merchants describing product specific attributes and characteristics. These offers normally contain structured attribute-value pairs, and unstructured (textual) descriptions describing product characteristics and features. For example, a laptop offer may contain attribute-value pairs such as “model-X42” and “RAM-8 GB”, and a text description of the software, accessories, battery features, warranty, etc. Updating the product dictionaries using the textual descriptions is a more challenging task than using the attribute-value pairs since the relevant attribute values must first be extracted. This task becomes difficult since the text descriptions often do not follow a predefined format, and the data in the descriptions vary across different merchants and products. However, this information needs to be captured to ensure a comprehensive and complete product listing. In this paper, we present techniques that extract attribute values from textual product descriptions. We introduce an end-to-end framework that takes an input string record, and parses the tokens in a record to identify candidate attribute values. We then map these values to attributes. We take an information theoretic approach to identify groups of tokens that represent an attribute value. We demonstrate the accuracy and relevance of our approach using a variety of real data sets.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

previous chapter Evaluation of Keyword Search in Affective Multimedia Databases

next chapter Subject-Related Message Filtering in Social Media Through Context-Enriched Language Models

Since a record contains tokens that may match a segment, we can think of the records containing segments.

Yelp dataset challenge (2011). www.yelp.ca/dataset_challenge

General electric second annual major purchase shopper study. GE Capital Retail (2013)

Agathangelou, P., Katakis, I., Kokkoras, F., Ntonas, K.: Mining domain-specific dictionaries of opinion words. In: Web Information Systems Engineering, pp. 47–62 (2014)

Agrawal, R., Imielinski, T., Swami, A.N.: Mining association rules between sets of items in large databases. In: SIGMOD Conference, pp. 207–216 (1993)

Bing, L., Lam, W., Wong, T.-L.: Wikipedia entity expansion and attribute extraction from the web using semi-supervised learning. In: International Conference on Web Search and Data Mining, WSDM, pp. 567–576 (2013)

Borkar, V., Deshmukh, K., Sarawagi, S.: Automatic segmentation of text into structured records. SIGMOD Rec. 30(2), 175–186 (2001)CrossRef

Chaturvedi, S., Prasad, K.H., Faruquie, T.A., Chawda, B., Subramaniam, L.V., Krishnapuram, R.: Automating pattern discovery for rule based data standardization systems. In: ICDE, pp. 1231–1241 (2013)

Chiang, F., Andritsos, P., Zhu, E., Miller, R.J.: Autodict: automated dictionary discovery. In: ICDE, pp. 1277–1280 (2012)

Cohen, W.W., Sarawagi, S.: Exploiting dictionaries in named entity extraction: combining semi-markov extraction processes and data integration methods. In: SIGKDD, pp. 89–98 (2004)

10.

Cortez, E., da Silva, A.S., Gonçalves, M.A., de Moura, E.S.: Ondux: on-demand unsupervised learning for information extraction. In: SIGMOD Conference, pp. 807–818 (2010)

11.

Cover, T., Thomas, J.: Elements of Information Theory. Wiley, New York (1991)CrossRefMATH

12.

Godbole, S., Bhattacharya, I., Gupta, A., Verma, A.: Building re-usable dictionary repositories for real-world text mining. In: CIKM, pp. 1189–1198 (2010)

13.

http://crf.sourceforge.net/

14.

Klein, D., Manning, C.: Accurate unlexicalized parsing. In: Proceedings of ACL, pp. 423–430 (2003)

15.

Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: ICML, pp. 282–289 (2001)

16.

Lee, T., Wang, Z., Wang, H., Hwang, S.-W.: Attribute extraction and scoring: a probabilistic approach. In: ICDE, pp. 194–205 (2013)

17.

Li, G., Deng, D., Feng, J.: Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction. SIGMOD 2011, pp. 529–540 (2011)

18.

Li, X., Wang, Y.-Y., Acero, A.: Extracting structured information from user queries with semi-supervised conditional random fields. In: SIGIR 2009, pp. 572–579 (2009)

19.

Nguyen, H., Fuxman, A., Paparizos, S., Freire, J., Agrawal, R.: Synthesizing products for online catalogs. Proc. VLDB Endow. 4(7), 409–418 (2011)CrossRef

20.

Pantel, P., Philpot, A., Hovy, E.H.: An information theoretic model for database alignment. In: SSDBM, pp. 14–23 (2005)

21.

Peshkin, L., Pfeffer, A.: Bayesian information extraction network. In: IJCAI 2003, pp. 421–426 (2003)

22.

Rissanen, J.: Modeling shortest data description. In: Automatica (1978)

23.

Roy, S., Chiticariu, L., Feldman, V., Reiss, F., Zhu, H.: Provenance-based dictionary refinement in information extraction. In: SIGMOD Conference, pp. 457–468 (2013)

24.

Sarawagi, S., Cohen, W.W.: Semi-markov conditional random fields for information extraction. In: NIPS, pp. 1185–1192 (2004)

25.

Sarkas, N., Paparizos, S., Tsaparas, P.: Structured annotations of web queries. In: SIGMOD Conference, pp. 771–782 (2010)

26.

Slonim, N., Tishby, N.: Agglomerative information bottleneck. In: NIPS, pp. 617–623 (1999)

27.

Socher, R., Bauer, J., Manning, C., Ng, A.: Parsing with compositional vector grammars. In: Proceedings of ACL, pp. 455–465 (2013)

28.

Sutton, C., Mccallum, A.: Introduction to Conditional Random Fields for Relational Learning. MIT Press, Cambridge (2006)MATH

29.

Tan, B., Peng, F.: Unsupervised query segmentation using generative language models and wikipedia. In: WWW, pp. 347–356 (2008)

30.

Winkler, W.E.: String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage. In: Survey Research, pp. 354–359 (1990)

31.

Zhang, M., Hadjieleftheriou, M., Ooi, B.C., Procopiuc, C.M., Srivastava, D.: Automatic discovery of attributes in relational databases. In: SIGMOD Conference, pp. 109–120 (2011)

32.

Zhang, Z., Zhu, K.Q., Wang, H., Li, H.: Automatic extraction of top-k lists from the web. In: ICDE, pp. 1057–1068 (2013)

Title: Data Driven Discovery of Attribute Dictionaries
Authors: Fei Chiang
Periklis Andritsos
Renée J. Miller
Publisher: Springer Berlin Heidelberg
Book: Transactions on Computational Collective Intelligence XXI
Print ISBN: 978-3-662-49520-9

Electronic ISBN: 978-3-662-49521-6

Copyright Year: 2016
DOI: https://doi.org/10.1007/978-3-662-49521-6_4

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner