Skip to main content

2016 | OriginalPaper | Buchkapitel

A Hybrid Machine-Crowdsourcing Approach for Web Table Matching and Cleaning

verfasst von : Chunhua Li, Pengpeng Zhao, Victor S. Sheng, Zhixu Li, Guanfeng Liu, Jian Wu, Zhiming Cui

Erschienen in: Web-Age Information Management

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Table matching and data cleaning are two crucial activities in integrating data from different web tables, which have traditionally been considered as separate activities. We show that data cleaning can effectively help us discover table matches, and vice versa. In this paper, we study a hybrid machine-crowdsourcing approach to handle the two activities together with a well-developed knowledge base. Understanding the semantics of tables is fundamental to both matching and cleaning. We select the most valuable columns to crowdsourcing validation and infer others by consolidating crowdsourcing results and machine-generated results. When resolving inconsistency between data and semantics, relative trust is taken into account to validate data or semantics via crowd. Our experimental results show the effectiveness of the proposed approach for matching and cleaning web tables using real-life datasets.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Bernstein, P.A., Madhavan, J., Rahm, E.: Generic schema matching, ten years later. Proc. VLDB Endowment 4(11), 695–701 (2011) Bernstein, P.A., Madhavan, J., Rahm, E.: Generic schema matching, ten years later. Proc. VLDB Endowment 4(11), 695–701 (2011)
2.
Zurück zum Zitat Cafarella, M.J., Halevy, A., Khoussainova, N.: Data integration for the relational web. Proc. VLDB Endowment 2(1), 1090–1101 (2009)CrossRef Cafarella, M.J., Halevy, A., Khoussainova, N.: Data integration for the relational web. Proc. VLDB Endowment 2(1), 1090–1101 (2009)CrossRef
3.
Zurück zum Zitat Cafarella, M.J., Halevy, A., Wang, D.Z., Wu, E., Zhang, Y.: Webtables: exploring the power of tables on the web. Proc. VLDB Endowment 1(1), 538–549 (2008)CrossRef Cafarella, M.J., Halevy, A., Wang, D.Z., Wu, E., Zhang, Y.: Webtables: exploring the power of tables on the web. Proc. VLDB Endowment 1(1), 538–549 (2008)CrossRef
4.
Zurück zum Zitat Chu, X., Morcos, J., Ilyas, I.F., Ouzzani, M., Papotti, P., Tang, N., Ye, Y.: Katara: A data cleaning system powered by knowledge bases and crowdsourcing. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1247–1261. ACM (2015) Chu, X., Morcos, J., Ilyas, I.F., Ouzzani, M., Papotti, P., Tang, N., Ye, Y.: Katara: A data cleaning system powered by knowledge bases and crowdsourcing. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1247–1261. ACM (2015)
5.
Zurück zum Zitat Cong, G., Fan, W., Geerts, F., Jia, X., Ma, S.: Improving data quality: Consistency and accuracy. In: Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 315–326. VLDB Endowment (2007) Cong, G., Fan, W., Geerts, F., Jia, X., Ma, S.: Improving data quality: Consistency and accuracy. In: Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 315–326. VLDB Endowment (2007)
6.
Zurück zum Zitat Deng, D., Jiang, Y., Li, G., Li, J., Yu, C.: Scalable column concept determination for web tables using large knowledge bases. Proc. VLDB Endowment 6(13), 1606–1617 (2013)CrossRef Deng, D., Jiang, Y., Li, G., Li, J., Yu, C.: Scalable column concept determination for web tables using large knowledge bases. Proc. VLDB Endowment 6(13), 1606–1617 (2013)CrossRef
7.
Zurück zum Zitat Fan, J., Lu, M., Ooi, B.C., Tan, W.C., Zhang, M.: A hybrid machine-crowdsourcing system for matching web tables. In: 2014 IEEE 30th International Conference on Data Engineering (ICDE), pp. 976–987. IEEE (2014) Fan, J., Lu, M., Ooi, B.C., Tan, W.C., Zhang, M.: A hybrid machine-crowdsourcing system for matching web tables. In: 2014 IEEE 30th International Conference on Data Engineering (ICDE), pp. 976–987. IEEE (2014)
8.
Zurück zum Zitat Fan, W., Ma, S., Tang, N., Yu, W.: Interaction between record matching and data repairing. J. Data Inf. Qual. (JDIQ) 4(4), 16 (2014) Fan, W., Ma, S., Tang, N., Yu, W.: Interaction between record matching and data repairing. J. Data Inf. Qual. (JDIQ) 4(4), 16 (2014)
9.
Zurück zum Zitat Geerts, F., Mecca, G., Papotti, P., Santoro, D.: Mapping and cleaning. In: 2014 IEEE 30th International Conference on Data Engineering (ICDE), pp. 232–243. IEEE (2014) Geerts, F., Mecca, G., Papotti, P., Santoro, D.: Mapping and cleaning. In: 2014 IEEE 30th International Conference on Data Engineering (ICDE), pp. 232–243. IEEE (2014)
10.
Zurück zum Zitat Geerts, F., Mecca, G., Papotti, P., Santoro, D.: The llunatic data-cleaning framework. Proc. VLDB Endowment 6(9), 625–636 (2013)CrossRef Geerts, F., Mecca, G., Papotti, P., Santoro, D.: The llunatic data-cleaning framework. Proc. VLDB Endowment 6(9), 625–636 (2013)CrossRef
11.
Zurück zum Zitat Hoffart, J., Suchanek, F.M., Berberich, K., Weikum, G.: Yago2: A spatially and temporally enhanced knowledge base from wikipedia. In: Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, pp. 3161–3165. AAAI Press (2013) Hoffart, J., Suchanek, F.M., Berberich, K., Weikum, G.: Yago2: A spatially and temporally enhanced knowledge base from wikipedia. In: Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, pp. 3161–3165. AAAI Press (2013)
12.
Zurück zum Zitat Limaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and searching web tables using entities, types and relationships. Proc. VLDB Endowment 3(1–2), 1338–1347 (2010)CrossRef Limaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and searching web tables using entities, types and relationships. Proc. VLDB Endowment 3(1–2), 1338–1347 (2010)CrossRef
13.
Zurück zum Zitat Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001)CrossRefMATH Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001)CrossRefMATH
14.
Zurück zum Zitat Rahm, E., Do, H.H.: Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000) Rahm, E., Do, H.H.: Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)
15.
Zurück zum Zitat Venetis, P., Halevy, A., Madhavan, J., Paşca, M., Shen, W., Wu, F., Miao, G., Wu, C.: Recovering semantics of tables on the web. Proc. VLDB Endowment 4(9), 528–538 (2011)CrossRef Venetis, P., Halevy, A., Madhavan, J., Paşca, M., Shen, W., Wu, F., Miao, G., Wu, C.: Recovering semantics of tables on the web. Proc. VLDB Endowment 4(9), 528–538 (2011)CrossRef
16.
Zurück zum Zitat Wang, S., Xiao, X., Lee, C.H.: Crowd-based deduplication: An adaptive approach. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1263–1277. ACM (2015) Wang, S., Xiao, X., Lee, C.H.: Crowd-based deduplication: An adaptive approach. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1263–1277. ACM (2015)
Metadaten
Titel
A Hybrid Machine-Crowdsourcing Approach for Web Table Matching and Cleaning
verfasst von
Chunhua Li
Pengpeng Zhao
Victor S. Sheng
Zhixu Li
Guanfeng Liu
Jian Wu
Zhiming Cui
Copyright-Jahr
2016
DOI
https://doi.org/10.1007/978-3-319-39958-4_11

Neuer Inhalt