Skip to main content
Top

2016 | OriginalPaper | Chapter

14. Quality of Web Data and Quality of Big Data: Open Problems

Authors : Monica Scannapieco, Laure Berti

Published in: Data and Information Quality

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In this chapter we discuss some open issues related to two typologies of information sources that nowadays are particularly significant, namely, Web data and Big Data.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
19.
go back to reference Amann B, Constantin C, Caron C, Giroux P (2013) Weblab prov: computing fine-grained provenance links for xml artifacts. In: EDBT/ICDT Workshops, pp 298–306 Amann B, Constantin C, Caron C, Giroux P (2013) Weblab prov: computing fine-grained provenance links for xml artifacts. In: EDBT/ICDT Workshops, pp 298–306
22.
go back to reference Anand MK, Bowers S, Ludscher B (2010) Techniques for efficiently querying scientific workflow provenance graphs. In: International Conference on Extending Database Technology (EDBT), pp 287–298 Anand MK, Bowers S, Ludscher B (2010) Techniques for efficiently querying scientific workflow provenance graphs. In: International Conference on Extending Database Technology (EDBT), pp 287–298
35.
go back to reference Barcaroli G, Nurra A, Scarno M, Summa D (2014) Use of web scraping and text mining techniques in the istat survey on information and communication technology in enterprises. In: Proceedings of Quality Conference 2014 (Q2014), Wien Barcaroli G, Nurra A, Scarno M, Summa D (2014) Use of web scraping and text mining techniques in the istat survey on information and communication technology in enterprises. In: Proceedings of Quality Conference 2014 (Q2014), Wien
53.
go back to reference Bender C, Orszag S (1999) Advanced Mathematical Methods for Scientists and Engineers: Asymptotic Methods and Perturbation Theory. Springer, New YorkCrossRefMATH Bender C, Orszag S (1999) Advanced Mathematical Methods for Scientists and Engineers: Asymptotic Methods and Perturbation Theory. Springer, New YorkCrossRefMATH
78.
go back to reference Bizer C (2007) Quality-driven information filtering in the context of web-based information systems. PhD thesis, Freie Universität Berlin Bizer C (2007) Quality-driven information filtering in the context of web-based information systems. PhD thesis, Freie Universität Berlin
84.
go back to reference Bonatti PA, Hogan A, Polleres A, Sauro L (2011) Robust and scalable linked data reasoning incorporating provenance and trust annotations. Journal of Web Semantics 9(2):165–201CrossRef Bonatti PA, Hogan A, Polleres A, Sauro L (2011) Robust and scalable linked data reasoning incorporating provenance and trust annotations. Journal of Web Semantics 9(2):165–201CrossRef
87.
go back to reference Bowers S, McPhillips T, Ludscher B (2012) Declarative rules for inferring fine-grained data provenance from scientific workflow execution traces. In: International Provenance and Annotation Workshop (IPAW), pp 1–15 Bowers S, McPhillips T, Ludscher B (2012) Declarative rules for inferring fine-grained data provenance from scientific workflow execution traces. In: International Provenance and Annotation Workshop (IPAW), pp 1–15
89.
go back to reference Boyd D, Crawford K (2012) Critical questions for big data: provocations for a cultural, technological, and scholarly phenomenon. Information, Communication, & Society 15(5) Boyd D, Crawford K (2012) Critical questions for big data: provocations for a cultural, technological, and scholarly phenomenon. Information, Communication, & Society 15(5)
98.
go back to reference Buneman P, Khanna S, Tan WC (2001) Why and where: a characterization of data provenance. In: Proceedings of the 8th International Conference on Database Theory (ICDT) Buneman P, Khanna S, Tan WC (2001) Why and where: a characterization of data provenance. In: Proceedings of the 8th International Conference on Database Theory (ICDT)
99.
go back to reference Burke J, Estrin D, Hansen M, Parker A, Ramanathan N, Reddy S, Srivastava MB (2006) Participatory sensing. In: Proceedings of the Workshop on World-Sensor-Web (WSW) at ACM Conference on Embedded Networked Sensor Systems (SenSys 2006), Boulder, pp 417–418 Burke J, Estrin D, Hansen M, Parker A, Ramanathan N, Reddy S, Srivastava MB (2006) Participatory sensing. In: Proceedings of the Workshop on World-Sensor-Web (WSW) at ACM Conference on Embedded Networked Sensor Systems (SenSys 2006), Boulder, pp 417–418
112.
go back to reference Carroll J (2003) Signing rdf graphs. Technical report, HPL-2003-142, HP Labs Carroll J (2003) Signing rdf graphs. Technical report, HPL-2003-142, HP Labs
125.
go back to reference Chen H, Ku W, Wang H, Sun M (2010) Leveraging spatio-temporal redundancy for rfid data cleansing. In: Proceedings of SIGMOD 2010, Indianapolis, pp 51–62 Chen H, Ku W, Wang H, Sun M (2010) Leveraging spatio-temporal redundancy for rfid data cleansing. In: Proceedings of SIGMOD 2010, Indianapolis, pp 51–62
128.
go back to reference Cheney J, Chiticariu L, Tan W (2007) Provenance in Databases: Why, How, and Where. Foundations and Trends in Databases 1:379–474CrossRef Cheney J, Chiticariu L, Tan W (2007) Provenance in Databases: Why, How, and Where. Foundations and Trends in Databases 1:379–474CrossRef
132.
go back to reference Chirigati F, Freire J (2012) Towards integrating workflow and database provenance. In: 4th International Provenance and Annotation Workshop (IPAW 2012), pp 11–23 Chirigati F, Freire J (2012) Towards integrating workflow and database provenance. In: 4th International Provenance and Annotation Workshop (IPAW 2012), pp 11–23
158.
go back to reference Cui Y, Widom J, Wiener JL (2000) Tracing the Lineage of View Data in a Warehousing Environment. ACM Transactions on Database Systems 25(2):179–227CrossRef Cui Y, Widom J, Wiener JL (2000) Tracing the Lineage of View Data in a Warehousing Environment. ACM Transactions on Database Systems 25(2):179–227CrossRef
174.
go back to reference Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of Royal Statistical Society 39:1–38MathSciNetMATH Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of Royal Statistical Society 39:1–38MathSciNetMATH
176.
go back to reference Dividino R, Sizov S, Staab S, Schueler B (2009) Querying for provenance, trust, uncertainty and other meta knowledge in RDF. Web Semantics: Science, Services and Agents on the World Wide Web 7:204–219CrossRef Dividino R, Sizov S, Staab S, Schueler B (2009) Querying for provenance, trust, uncertainty and other meta knowledge in RDF. Web Semantics: Science, Services and Agents on the World Wide Web 7:204–219CrossRef
181.
go back to reference Dong XL, Berti-Equille L, Srivastava D (2009) Truth discovery and copying detection in a dynamic world. PVLDB 2(1):562–573 Dong XL, Berti-Equille L, Srivastava D (2009) Truth discovery and copying detection in a dynamic world. PVLDB 2(1):562–573
185.
go back to reference Duda R, Hart P, Stork D (2000) Pattern Classification. Wiley, New YorkMATH Duda R, Hart P, Stork D (2000) Pattern Classification. Wiley, New YorkMATH
229.
go back to reference Fellegi IP, Sunter AB (1969) A theory for record linkage. Journal of the American Statistical Association 64 Fellegi IP, Sunter AB (1969) A theory for record linkage. Journal of the American Statistical Association 64
252.
go back to reference Galland A, Abiteboul S, Marian A, Senellart P (2010) Corroborating information from disagreeing views. In: WSDM, pp 131–140 Galland A, Abiteboul S, Marian A, Senellart P (2010) Corroborating information from disagreeing views. In: WSDM, pp 131–140
253.
go back to reference Gallegos I, Gates A, Tweedie C (2010) Dapros: a data property specification tool to capture scientific sensor data properties. In: Proceedings of ER Workshops. Vancouver, BC, pp 232–241 Gallegos I, Gates A, Tweedie C (2010) Dapros: a data property specification tool to capture scientific sensor data properties. In: Proceedings of ER Workshops. Vancouver, BC, pp 232–241
254.
go back to reference Gamble M, Goble C (2011) Quality, trust, and utility of scientific data on the web: towards a joint model. In: ACM WebScience, pp 1–8 Gamble M, Goble C (2011) Quality, trust, and utility of scientific data on the web: towards a joint model. In: ACM WebScience, pp 1–8
263.
go back to reference Gil Y, Artz D (2007) Towards content trust of web resources. Web Semantics 5(4):227–239CrossRef Gil Y, Artz D (2007) Towards content trust of web resources. Web Semantics 5(4):227–239CrossRef
264.
go back to reference Gil Y, Ratnakar V (2002) Trusting information sources one citizen at a time. In: ISWC. Springer, New York, pp 162–176MATH Gil Y, Ratnakar V (2002) Trusting information sources one citizen at a time. In: ISWC. Springer, New York, pp 162–176MATH
269.
go back to reference Golbeck J (2004) Inferring reputation on the semantic web. In: WWW Golbeck J (2004) Inferring reputation on the semantic web. In: WWW
298.
go back to reference Hartig O (2008) Trustworthiness of data on the web. In: STI Berlin and CSW PhD Workshop, Berlin Hartig O (2008) Trustworthiness of data on the web. In: STI Berlin and CSW PhD Workshop, Berlin
299.
go back to reference Hartig O (2009) Provenance information in the web of data. In: Proceedings of the Linked Data on the Web (LDOW’09), Workshop of the World Wide Web Conference (WWW) Hartig O (2009) Provenance information in the web of data. In: Proceedings of the Linked Data on the Web (LDOW’09), Workshop of the World Wide Web Conference (WWW)
304.
go back to reference Heath T, Bizer C (2011) Linked Data: Evolving the Web into a Global Data Space. Morgan & Claypool Heath T, Bizer C (2011) Linked Data: Evolving the Web into a Global Data Space. Morgan & Claypool
315.
go back to reference Hopkins D, King G (2010) A method of automated nonparametric content analysis for social science. American Journal of Political Science 54(1):229–247CrossRef Hopkins D, King G (2010) A method of automated nonparametric content analysis for social science. American Journal of Political Science 54(1):229–247CrossRef
336.
go back to reference Jacobi I, Kagal L, Khandelwal A (2011) Rule-based trust assessment on the semantic web. In: International Conference on Rule-Based Reasoning, Programming, and Applications Series, pp 227–241 Jacobi I, Kagal L, Khandelwal A (2011) Rule-based trust assessment on the semantic web. In: International Conference on Rule-Based Reasoning, Programming, and Applications Series, pp 227–241
338.
go back to reference James G, Witten D, Hastie T, Tibshirani R (2013) An Introduction to Statistical Learning with Applications in R. Springer Texts in Statistics. Springer, New YorkCrossRefMATH James G, Witten D, Hastie T, Tibshirani R (2013) An Introduction to Statistical Learning with Applications in R. Springer Texts in Statistics. Springer, New YorkCrossRefMATH
346.
go back to reference Jeffery S, Alonso M Gand Franklin, Hong W, Widom J (2005) A Pipelined Framework for Online Cleaning of Sensor Data Streams. Technical report, Computer Science Division (EECS), University of California, uCB/CSD-5-1413 Jeffery S, Alonso M Gand Franklin, Hong W, Widom J (2005) A Pipelined Framework for Online Cleaning of Sensor Data Streams. Technical report, Computer Science Division (EECS), University of California, uCB/CSD-5-1413
347.
go back to reference Jeffery S, Garofalakis M, Franklin M (2006) Adaptive cleansing for rfid data streams. In: Proceedings of Very Large Database Conference (VLDB 2006), Seoul, 2006, pp 163–174 Jeffery S, Garofalakis M, Franklin M (2006) Adaptive cleansing for rfid data streams. In: Proceedings of Very Large Database Conference (VLDB 2006), Seoul, 2006, pp 163–174
371.
go back to reference Klein A, Lehner W (2009) Representing data quality in sensor data streaming environments. Journal of Data and Information Quality 1(2) Klein A, Lehner W (2009) Representing data quality in sensor data streaming environments. Journal of Data and Information Quality 1(2)
386.
go back to reference Lantz B (2013) Machine Learning with R. Packt Publishing Ltd Lantz B (2013) Machine Learning with R. Packt Publishing Ltd
401.
go back to reference Li X, Dong XL, Lyons K, Srivastava D (1999) Truth finding on the deep web: is the problem solved? In: PVLDB Li X, Dong XL, Lyons K, Srivastava D (1999) Truth finding on the deep web: is the problem solved? In: PVLDB
423.
go back to reference Madhavan J, Ko D, Kot L, Ganapathy V, Rasmussen A, Halevy AY (2008) Google’s deep web crawl. PVLDB 1(2):1241–1252 Madhavan J, Ko D, Kot L, Ganapathy V, Rasmussen A, Halevy AY (2008) Google’s deep web crawl. PVLDB 1(2):1241–1252
427.
go back to reference Manzoor A, Truong H, S D (2008) On the evaluation of quality of context. In: European Conference on Smart Sensing & Context (EuroSSC), Zurich, pp 140–153 Manzoor A, Truong H, S D (2008) On the evaluation of quality of context. In: European Conference on Smart Sensing & Context (EuroSSC), Zurich, pp 140–153
436.
go back to reference Mendes P, Mühleisen H, Bizer C (2012) Sieve: linked data quality assessment and fusion. In: LWDMCrossRef Mendes P, Mühleisen H, Bizer C (2012) Sieve: linked data quality assessment and fusion. In: LWDMCrossRef
496.
go back to reference Pei L, Dong XL, Maurino M, Srivastava D (2011) Linking temporal records. Frontiers of Computer ScienceMATH Pei L, Dong XL, Maurino M, Srivastava D (2011) Linking temporal records. Frontiers of Computer ScienceMATH
497.
go back to reference Perkowitz M, Etzioni O (2000) Adaptive web-sites. Communication of the ACM 43(8) Perkowitz M, Etzioni O (2000) Adaptive web-sites. Communication of the ACM 43(8)
506.
go back to reference Planet B (2000) The deep web: Surfacing hidden value. The Journal of Electronic Publishing Planet B (2000) The deep web: Surfacing hidden value. The Journal of Electronic Publishing
517.
go back to reference Rao J, Doraiswamy S, Thakkar H, Colby L (2006) A deferred cleansing method for rfid data analytics. In: Proceedings of Very Large Database Conference (VLDB 2006), Seoul, pp 175–186 Rao J, Doraiswamy S, Thakkar H, Colby L (2006) A deferred cleansing method for rfid data analytics. In: Proceedings of Very Large Database Conference (VLDB 2006), Seoul, pp 175–186
543.
go back to reference Salamone S, Scannapieco, Scarno M (2014) Web scraping and web mining: new tools for official statistics. In: Proceedings of Societa Italiana di Statistica (SIS 2014), Cagliari, Sardegna Salamone S, Scannapieco, Scarno M (2014) Web scraping and web mining: new tools for official statistics. In: Proceedings of Societa Italiana di Statistica (SIS 2014), Cagliari, Sardegna
551.
go back to reference Scannapieco M, Virgillito A, Zardetto D (2013) Placing big data in official statistics: a big challenge? In: Proceedings of 2013 New Techniques and Tools for Statistics (NTTS) Conference, Brussels Scannapieco M, Virgillito A, Zardetto D (2013) Placing big data in official statistics: a big challenge? In: Proceedings of 2013 New Techniques and Tools for Statistics (NTTS) Conference, Brussels
562.
go back to reference Sha K, Shi W (2008) Consistency-driven data quality management of networked sensor systems. Journal of Parallel and Distributed Computing 68(9):1207–1221CrossRef Sha K, Shi W (2008) Consistency-driven data quality management of networked sensor systems. Journal of Parallel and Distributed Computing 68(9):1207–1221CrossRef
569.
go back to reference Shekarpour S, Katebi S (2010) Modeling and evaluation of trust with an extension in semantic web. Web Semantics: Science, Services and Agents on the World Wide Web 8(1):26–36CrossRef Shekarpour S, Katebi S (2010) Modeling and evaluation of trust with an extension in semantic web. Web Semantics: Science, Services and Agents on the World Wide Web 8(1):26–36CrossRef
599.
go back to reference Talukdar PP, Jacob M, Mehmood MS, Crammer K, Ives ZG, Pereira F, Guha S (2008) Learning to create data-integrating queries. PVLDB 1(1):785–796 Talukdar PP, Jacob M, Mehmood MS, Crammer K, Ives ZG, Pereira F, Guha S (2008) Learning to create data-integrating queries. PVLDB 1(1):785–796
600.
go back to reference Talukdar PP, Ives ZG, Pereira F (2010) Automatically incorporating new sources in keyword search-based data integration. In: SIGMOD Conference 2010, pp 387–398 Talukdar PP, Ives ZG, Pereira F (2010) Automatically incorporating new sources in keyword search-based data integration. In: SIGMOD Conference 2010, pp 387–398
602.
go back to reference Tan WC (2007) Provenance in databases: past, current, and future. IEEE Data Engineering Bulletin 30(4):3–12 Tan WC (2007) Provenance in databases: past, current, and future. IEEE Data Engineering Bulletin 30(4):3–12
607.
go back to reference Theoharis Y, Fundulaki I, Karvounarakis G, Christophides V (2011) On provenance of queries on semantic web data. IEEE Internet Computing 15(1):31–39CrossRef Theoharis Y, Fundulaki I, Karvounarakis G, Christophides V (2011) On provenance of queries on semantic web data. IEEE Internet Computing 15(1):31–39CrossRef
609.
go back to reference Thirunarayan K, Anantharam P, Henson C, Sheth A (2013) Comparative trust management with applications: Bayesian approaches emphasis. Future Generation Computer Systems Thirunarayan K, Anantharam P, Henson C, Sheth A (2013) Comparative trust management with applications: Bayesian approaches emphasis. Future Generation Computer Systems
634.
go back to reference Vydiswaran VGV, Zhai C, Roth D (2011) Content-driven trust propagation framework. In: KDD, pp 974–982 Vydiswaran VGV, Zhai C, Roth D (2011) Content-driven trust propagation framework. In: KDD, pp 974–982
679.
go back to reference Wu W, Yu CT, Doan A, Meng W (2004) An interactive clustering-based approach to integrating source query interfaces on the deep web. In: SIGMOD Conference, pp 95–106 Wu W, Yu CT, Doan A, Meng W (2004) An interactive clustering-based approach to integrating source query interfaces on the deep web. In: SIGMOD Conference, pp 95–106
688.
go back to reference Yin X, Han J (2007) Truth discovery with multiple conflicting information providers on the web. In: Proceedings of the 2007 ACM SIGKDD International Conference Knowledge Discovery in Databases (KDD’07) Yin X, Han J (2007) Truth discovery with multiple conflicting information providers on the web. In: Proceedings of the 2007 ACM SIGKDD International Conference Knowledge Discovery in Databases (KDD’07)
690.
go back to reference Zardetto D, Scannapieco M, Catarci T (2010) Effective automated object matching. In: Proceedings of the International Conference on Data Engineering (ICDE 2010), pp 757–768 Zardetto D, Scannapieco M, Catarci T (2010) Effective automated object matching. In: Proceedings of the International Conference on Data Engineering (ICDE 2010), pp 757–768
691.
go back to reference Zardetto D, Valentino L, Scannapieco M (2011) MAERLIN: new record linkage methods at work. In: Proceedings of the 6th International Conference on New Techniques and Technologies for Statistics (NTTS 2011) Zardetto D, Valentino L, Scannapieco M (2011) MAERLIN: new record linkage methods at work. In: Proceedings of the 6th International Conference on New Techniques and Technologies for Statistics (NTTS 2011)
694.
go back to reference Zhao B, Rubinstein BIP, Gemmell J, Han J (2012) A Bayesian approach to discovering truth from conflicting sources for data integration. PVLDB 5(6):550–561 Zhao B, Rubinstein BIP, Gemmell J, Han J (2012) A Bayesian approach to discovering truth from conflicting sources for data integration. PVLDB 5(6):550–561
Metadata
Title
Quality of Web Data and Quality of Big Data: Open Problems
Authors
Monica Scannapieco
Laure Berti
Copyright Year
2016
DOI
https://doi.org/10.1007/978-3-319-24106-7_14

Premium Partner