Skip to main content
Top

2016 | OriginalPaper | Chapter

Data Driven Discovery of Attribute Dictionaries

Authors : Fei Chiang, Periklis Andritsos, Renée J. Miller

Published in: Transactions on Computational Collective Intelligence XXI

Publisher: Springer Berlin Heidelberg

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Online product search engines such as Google and Yahoo shopping, rely on having extensive and complete product information to return accurate and timely search results. Given the expanding scope of products and updates to existing products, automated techniques are needed to ensure the underlying product dictionaries remain current and complete. Product search engines receive offers from merchants describing product specific attributes and characteristics. These offers normally contain structured attribute-value pairs, and unstructured (textual) descriptions describing product characteristics and features. For example, a laptop offer may contain attribute-value pairs such as “model-X42” and “RAM-8 GB”, and a text description of the software, accessories, battery features, warranty, etc. Updating the product dictionaries using the textual descriptions is a more challenging task than using the attribute-value pairs since the relevant attribute values must first be extracted. This task becomes difficult since the text descriptions often do not follow a predefined format, and the data in the descriptions vary across different merchants and products. However, this information needs to be captured to ensure a comprehensive and complete product listing. In this paper, we present techniques that extract attribute values from textual product descriptions. We introduce an end-to-end framework that takes an input string record, and parses the tokens in a record to identify candidate attribute values. We then map these values to attributes. We take an information theoretic approach to identify groups of tokens that represent an attribute value. We demonstrate the accuracy and relevance of our approach using a variety of real data sets.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
Since a record contains tokens that may match a segment, we can think of the records containing segments.
 
Literature
2.
go back to reference General electric second annual major purchase shopper study. GE Capital Retail (2013) General electric second annual major purchase shopper study. GE Capital Retail (2013)
3.
go back to reference Agathangelou, P., Katakis, I., Kokkoras, F., Ntonas, K.: Mining domain-specific dictionaries of opinion words. In: Web Information Systems Engineering, pp. 47–62 (2014) Agathangelou, P., Katakis, I., Kokkoras, F., Ntonas, K.: Mining domain-specific dictionaries of opinion words. In: Web Information Systems Engineering, pp. 47–62 (2014)
4.
go back to reference Agrawal, R., Imielinski, T., Swami, A.N.: Mining association rules between sets of items in large databases. In: SIGMOD Conference, pp. 207–216 (1993) Agrawal, R., Imielinski, T., Swami, A.N.: Mining association rules between sets of items in large databases. In: SIGMOD Conference, pp. 207–216 (1993)
5.
go back to reference Bing, L., Lam, W., Wong, T.-L.: Wikipedia entity expansion and attribute extraction from the web using semi-supervised learning. In: International Conference on Web Search and Data Mining, WSDM, pp. 567–576 (2013) Bing, L., Lam, W., Wong, T.-L.: Wikipedia entity expansion and attribute extraction from the web using semi-supervised learning. In: International Conference on Web Search and Data Mining, WSDM, pp. 567–576 (2013)
6.
go back to reference Borkar, V., Deshmukh, K., Sarawagi, S.: Automatic segmentation of text into structured records. SIGMOD Rec. 30(2), 175–186 (2001)CrossRef Borkar, V., Deshmukh, K., Sarawagi, S.: Automatic segmentation of text into structured records. SIGMOD Rec. 30(2), 175–186 (2001)CrossRef
7.
go back to reference Chaturvedi, S., Prasad, K.H., Faruquie, T.A., Chawda, B., Subramaniam, L.V., Krishnapuram, R.: Automating pattern discovery for rule based data standardization systems. In: ICDE, pp. 1231–1241 (2013) Chaturvedi, S., Prasad, K.H., Faruquie, T.A., Chawda, B., Subramaniam, L.V., Krishnapuram, R.: Automating pattern discovery for rule based data standardization systems. In: ICDE, pp. 1231–1241 (2013)
8.
go back to reference Chiang, F., Andritsos, P., Zhu, E., Miller, R.J.: Autodict: automated dictionary discovery. In: ICDE, pp. 1277–1280 (2012) Chiang, F., Andritsos, P., Zhu, E., Miller, R.J.: Autodict: automated dictionary discovery. In: ICDE, pp. 1277–1280 (2012)
9.
go back to reference Cohen, W.W., Sarawagi, S.: Exploiting dictionaries in named entity extraction: combining semi-markov extraction processes and data integration methods. In: SIGKDD, pp. 89–98 (2004) Cohen, W.W., Sarawagi, S.: Exploiting dictionaries in named entity extraction: combining semi-markov extraction processes and data integration methods. In: SIGKDD, pp. 89–98 (2004)
10.
go back to reference Cortez, E., da Silva, A.S., Gonçalves, M.A., de Moura, E.S.: Ondux: on-demand unsupervised learning for information extraction. In: SIGMOD Conference, pp. 807–818 (2010) Cortez, E., da Silva, A.S., Gonçalves, M.A., de Moura, E.S.: Ondux: on-demand unsupervised learning for information extraction. In: SIGMOD Conference, pp. 807–818 (2010)
12.
go back to reference Godbole, S., Bhattacharya, I., Gupta, A., Verma, A.: Building re-usable dictionary repositories for real-world text mining. In: CIKM, pp. 1189–1198 (2010) Godbole, S., Bhattacharya, I., Gupta, A., Verma, A.: Building re-usable dictionary repositories for real-world text mining. In: CIKM, pp. 1189–1198 (2010)
14.
go back to reference Klein, D., Manning, C.: Accurate unlexicalized parsing. In: Proceedings of ACL, pp. 423–430 (2003) Klein, D., Manning, C.: Accurate unlexicalized parsing. In: Proceedings of ACL, pp. 423–430 (2003)
15.
go back to reference Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: ICML, pp. 282–289 (2001) Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: ICML, pp. 282–289 (2001)
16.
go back to reference Lee, T., Wang, Z., Wang, H., Hwang, S.-W.: Attribute extraction and scoring: a probabilistic approach. In: ICDE, pp. 194–205 (2013) Lee, T., Wang, Z., Wang, H., Hwang, S.-W.: Attribute extraction and scoring: a probabilistic approach. In: ICDE, pp. 194–205 (2013)
17.
go back to reference Li, G., Deng, D., Feng, J.: Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction. SIGMOD 2011, pp. 529–540 (2011) Li, G., Deng, D., Feng, J.: Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction. SIGMOD 2011, pp. 529–540 (2011)
18.
go back to reference Li, X., Wang, Y.-Y., Acero, A.: Extracting structured information from user queries with semi-supervised conditional random fields. In: SIGIR 2009, pp. 572–579 (2009) Li, X., Wang, Y.-Y., Acero, A.: Extracting structured information from user queries with semi-supervised conditional random fields. In: SIGIR 2009, pp. 572–579 (2009)
19.
go back to reference Nguyen, H., Fuxman, A., Paparizos, S., Freire, J., Agrawal, R.: Synthesizing products for online catalogs. Proc. VLDB Endow. 4(7), 409–418 (2011)CrossRef Nguyen, H., Fuxman, A., Paparizos, S., Freire, J., Agrawal, R.: Synthesizing products for online catalogs. Proc. VLDB Endow. 4(7), 409–418 (2011)CrossRef
20.
go back to reference Pantel, P., Philpot, A., Hovy, E.H.: An information theoretic model for database alignment. In: SSDBM, pp. 14–23 (2005) Pantel, P., Philpot, A., Hovy, E.H.: An information theoretic model for database alignment. In: SSDBM, pp. 14–23 (2005)
21.
go back to reference Peshkin, L., Pfeffer, A.: Bayesian information extraction network. In: IJCAI 2003, pp. 421–426 (2003) Peshkin, L., Pfeffer, A.: Bayesian information extraction network. In: IJCAI 2003, pp. 421–426 (2003)
22.
go back to reference Rissanen, J.: Modeling shortest data description. In: Automatica (1978) Rissanen, J.: Modeling shortest data description. In: Automatica (1978)
23.
go back to reference Roy, S., Chiticariu, L., Feldman, V., Reiss, F., Zhu, H.: Provenance-based dictionary refinement in information extraction. In: SIGMOD Conference, pp. 457–468 (2013) Roy, S., Chiticariu, L., Feldman, V., Reiss, F., Zhu, H.: Provenance-based dictionary refinement in information extraction. In: SIGMOD Conference, pp. 457–468 (2013)
24.
go back to reference Sarawagi, S., Cohen, W.W.: Semi-markov conditional random fields for information extraction. In: NIPS, pp. 1185–1192 (2004) Sarawagi, S., Cohen, W.W.: Semi-markov conditional random fields for information extraction. In: NIPS, pp. 1185–1192 (2004)
25.
go back to reference Sarkas, N., Paparizos, S., Tsaparas, P.: Structured annotations of web queries. In: SIGMOD Conference, pp. 771–782 (2010) Sarkas, N., Paparizos, S., Tsaparas, P.: Structured annotations of web queries. In: SIGMOD Conference, pp. 771–782 (2010)
26.
go back to reference Slonim, N., Tishby, N.: Agglomerative information bottleneck. In: NIPS, pp. 617–623 (1999) Slonim, N., Tishby, N.: Agglomerative information bottleneck. In: NIPS, pp. 617–623 (1999)
27.
go back to reference Socher, R., Bauer, J., Manning, C., Ng, A.: Parsing with compositional vector grammars. In: Proceedings of ACL, pp. 455–465 (2013) Socher, R., Bauer, J., Manning, C., Ng, A.: Parsing with compositional vector grammars. In: Proceedings of ACL, pp. 455–465 (2013)
28.
go back to reference Sutton, C., Mccallum, A.: Introduction to Conditional Random Fields for Relational Learning. MIT Press, Cambridge (2006)MATH Sutton, C., Mccallum, A.: Introduction to Conditional Random Fields for Relational Learning. MIT Press, Cambridge (2006)MATH
29.
go back to reference Tan, B., Peng, F.: Unsupervised query segmentation using generative language models and wikipedia. In: WWW, pp. 347–356 (2008) Tan, B., Peng, F.: Unsupervised query segmentation using generative language models and wikipedia. In: WWW, pp. 347–356 (2008)
30.
go back to reference Winkler, W.E.: String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage. In: Survey Research, pp. 354–359 (1990) Winkler, W.E.: String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage. In: Survey Research, pp. 354–359 (1990)
31.
go back to reference Zhang, M., Hadjieleftheriou, M., Ooi, B.C., Procopiuc, C.M., Srivastava, D.: Automatic discovery of attributes in relational databases. In: SIGMOD Conference, pp. 109–120 (2011) Zhang, M., Hadjieleftheriou, M., Ooi, B.C., Procopiuc, C.M., Srivastava, D.: Automatic discovery of attributes in relational databases. In: SIGMOD Conference, pp. 109–120 (2011)
32.
go back to reference Zhang, Z., Zhu, K.Q., Wang, H., Li, H.: Automatic extraction of top-k lists from the web. In: ICDE, pp. 1057–1068 (2013) Zhang, Z., Zhu, K.Q., Wang, H., Li, H.: Automatic extraction of top-k lists from the web. In: ICDE, pp. 1057–1068 (2013)
Metadata
Title
Data Driven Discovery of Attribute Dictionaries
Authors
Fei Chiang
Periklis Andritsos
Renée J. Miller
Copyright Year
2016
Publisher
Springer Berlin Heidelberg
DOI
https://doi.org/10.1007/978-3-662-49521-6_4

Premium Partner