Skip to main content
Erschienen in: Mobile Networks and Applications 3/2020

19.02.2020

A Cautionary Tale for Machine Learning Design: why we Still Need Human-Assisted Big Data Analysis

verfasst von: Marco Roccetti, Giovanni Delnevo, Luca Casini, Paola Salomoni

Erschienen in: Mobile Networks and Applications | Ausgabe 3/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Supervised Machine Learning (ML) requires that smart algorithms scrutinize a very large number of labeled samples before they can make right predictions. And this is not always true either. In our experience, in fact, a neural network trained with a huge database comprised of over fifteen million water meter readings had essentially failed to predict when a meter would malfunction/need disassembly based on a history of water consumption measurements. With a second step, we developed a methodology, based on the enforcement of a specialized data semantics, that allowed us to extract only those samples for training that were not noised by data impurities. With this methodology, we re-trained the neural network up to a prediction accuracy of over 80%. Yet, we simultaneously realized that the new training dataset was significantly different from the initial one in statistical terms, and much smaller, as well. We had reached a sort of paradox: We had alleviated the initial problem with a better interpretable model, but we had changed the replicated form of the initial data. To reconcile that paradox, we further enhanced our data semantics with the contribution of field experts. This has finally led to the extrapolation of a training dataset truly representative of regular/defective water meters and able to describe the underlying statistical phenomenon, while still providing an excellent prediction accuracy of the resulting classifier. At the end of this path, the lesson we have learnt is that a human-in-the-loop approach may significantly help to clean and re-organize noised datasets for an empowered ML design experience.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Weitere Produktempfehlungen anzeigen
Literatur
1.
Zurück zum Zitat Pettersen L (2018) Why artificial intelligence will not outsmart complex knowledge work. Work, Employment and Society. Sage. To appear Pettersen L (2018) Why artificial intelligence will not outsmart complex knowledge work. Work, Employment and Society. Sage. To appear
2.
Zurück zum Zitat Jordan MI, Mitchell TM (2015) Machine learning: trends, perspectives, and prospects. Science 349(6245):255–260MathSciNetCrossRef Jordan MI, Mitchell TM (2015) Machine learning: trends, perspectives, and prospects. Science 349(6245):255–260MathSciNetCrossRef
3.
Zurück zum Zitat Delnevo G, Roccetti M, Mirri S (2019) Intelligent and good machines? The role of domain and context codification, Mobile networks and applications, Elsevier. To appear Delnevo G, Roccetti M, Mirri S (2019) Intelligent and good machines? The role of domain and context codification, Mobile networks and applications, Elsevier. To appear
4.
Zurück zum Zitat Witten IH, Frank E, Hall MA, Pal CJ (2016) Data mining: practical machine learning tools and techniques. Morgan Kaufmann Witten IH, Frank E, Hall MA, Pal CJ (2016) Data mining: practical machine learning tools and techniques. Morgan Kaufmann
5.
Zurück zum Zitat Alkowaileet W, Alsubaiee S, Carey M, Li C, Ramampiaro H, Sinthong P, Wang X (2018) Enhancing big data with semantics: the AsterixDB approach. In Proc. of 12th IEEE international conference on semantic computing, 314-315. IEEE Alkowaileet W, Alsubaiee S, Carey M, Li C, Ramampiaro H, Sinthong P, Wang X (2018) Enhancing big data with semantics: the AsterixDB approach. In Proc. of 12th IEEE international conference on semantic computing, 314-315. IEEE
7.
Zurück zum Zitat Casini L, Delnevo G, Roccetti M, Zagni N, Cappiello G (2019, August) Deep water: predicting water meter failures through a human-machine intelligence collaboration. In international conference on human interaction and emerging technologies (pp. 688-694). Springer, Cham Casini L, Delnevo G, Roccetti M, Zagni N, Cappiello G (2019, August) Deep water: predicting water meter failures through a human-machine intelligence collaboration. In international conference on human interaction and emerging technologies (pp. 688-694). Springer, Cham
8.
Zurück zum Zitat Roccetti M, Delnevo G, Casini L, Zagni N, Cappiello G (2019, September). A paradox in ML design: less data for a smarter water metering cognification experience. In proceedings of the 5th EAI international conference on smart objects and Technologies for Social Good (pp. 201-206). ACM Roccetti M, Delnevo G, Casini L, Zagni N, Cappiello G (2019, September). A paradox in ML design: less data for a smarter water metering cognification experience. In proceedings of the 5th EAI international conference on smart objects and Technologies for Social Good (pp. 201-206). ACM
9.
Zurück zum Zitat Roccetti M, Delnevo G, Casini L, Cappiello G (2019) Is bigger always better? A controversial journey to the center of machine learning design, with uses and misuses of big data for predicting water meter failures. J Big Data 6(1):70CrossRef Roccetti M, Delnevo G, Casini L, Cappiello G (2019) Is bigger always better? A controversial journey to the center of machine learning design, with uses and misuses of big data for predicting water meter failures. J Big Data 6(1):70CrossRef
10.
Zurück zum Zitat Wang RY, Storey VC, Firth CP (1995) A framework for analysis of data quality research. IEEE Trans Knowl Data Eng 4:623–640CrossRef Wang RY, Storey VC, Firth CP (1995) A framework for analysis of data quality research. IEEE Trans Knowl Data Eng 4:623–640CrossRef
12.
Zurück zum Zitat Juran J, Godfrey AB (1999) Quality handbook. Republished McGraw-Hill, 173-178 Juran J, Godfrey AB (1999) Quality handbook. Republished McGraw-Hill, 173-178
13.
Zurück zum Zitat Kodra Y, De La Paz MP, Coi A, Santoro M, Bianchi F, Ahmed F, ... Taruscio D (2017) Data quality in rare diseases registries. In rare diseases epidemiology: update and overview (pp. 149–164). Springer, Cham Kodra Y, De La Paz MP, Coi A, Santoro M, Bianchi F, Ahmed F, ... Taruscio D (2017) Data quality in rare diseases registries. In rare diseases epidemiology: update and overview (pp. 149–164). Springer, Cham
14.
Zurück zum Zitat Scannapieco M, Missier P, Batini C (2005) Data quality at a glance. Datenbank-Spektrum, 14(January), 6–14 Scannapieco M, Missier P, Batini C (2005) Data quality at a glance. Datenbank-Spektrum, 14(January), 6–14
15.
Zurück zum Zitat Sidi F, Panahy PHS, Affendey LS, Jabar MA, Ibrahim H, Mustapha A (2012, March). Data quality: a survey of data quality dimensions. In 2012 international conference on Information Retrieval & Knowledge Management (pp. 300-304). IEEE Sidi F, Panahy PHS, Affendey LS, Jabar MA, Ibrahim H, Mustapha A (2012, March). Data quality: a survey of data quality dimensions. In 2012 international conference on Information Retrieval & Knowledge Management (pp. 300-304). IEEE
16.
Zurück zum Zitat Pipino LL, Lee YW, Wang RY (2002) Data quality assessment. Commun ACM 45(4):211–218CrossRef Pipino LL, Lee YW, Wang RY (2002) Data quality assessment. Commun ACM 45(4):211–218CrossRef
17.
Zurück zum Zitat Cai L, Zhu Y (2015) The challenges of data quality and data quality assessment in the big data era. Data Sci J 14 Cai L, Zhu Y (2015) The challenges of data quality and data quality assessment in the big data era. Data Sci J 14
19.
Zurück zum Zitat Chen JV, Su BC, Widjaja AE (2016) Facebook C2C social commerce: a study of online impulse buying. Decis Support Syst 83:57–69CrossRef Chen JV, Su BC, Widjaja AE (2016) Facebook C2C social commerce: a study of online impulse buying. Decis Support Syst 83:57–69CrossRef
20.
Zurück zum Zitat Von Mering C, Krause R, Snel B, Cornell M, Oliver SG, Fields S, Bork P (2002) Comparative assessment of large-scale data sets of protein–protein interactions. Nature 417(6887):399CrossRef Von Mering C, Krause R, Snel B, Cornell M, Oliver SG, Fields S, Bork P (2002) Comparative assessment of large-scale data sets of protein–protein interactions. Nature 417(6887):399CrossRef
21.
Zurück zum Zitat Burggräf P, Dannapfel M, Förstmann R, Adlon T, Fölling C (2018, January). Data quality-based process enabling: application to logistics supply processes in low-volume ramp-up context. In 2018 international conference on information management and processing (ICIMP) (pp. 36-41). IEEE Burggräf P, Dannapfel M, Förstmann R, Adlon T, Fölling C (2018, January). Data quality-based process enabling: application to logistics supply processes in low-volume ramp-up context. In 2018 international conference on information management and processing (ICIMP) (pp. 36-41). IEEE
22.
Zurück zum Zitat Breck E, Polyzotis N, Roy S, Whang SE, Zinkevich M (2018, January). Data Infrastructure for Machine Learning. In SysML Conference Breck E, Polyzotis N, Roy S, Whang SE, Zinkevich M (2018, January). Data Infrastructure for Machine Learning. In SysML Conference
23.
Zurück zum Zitat Sessions V, Valtorta M (2006) The effects of data quality on machine learning algorithms. ICIQ Sessions V, Valtorta M (2006) The effects of data quality on machine learning algorithms. ICIQ
24.
Zurück zum Zitat Foidl H, Felderer M (2019, August). Risk-based data validation in machine learning-based software systems. In proceedings of the 3rd ACM SIGSOFT international workshop on machine learning techniques for software quality evaluation (pp. 13-18). ACM Foidl H, Felderer M (2019, August). Risk-based data validation in machine learning-based software systems. In proceedings of the 3rd ACM SIGSOFT international workshop on machine learning techniques for software quality evaluation (pp. 13-18). ACM
25.
Zurück zum Zitat Wang RY, Strong DM (1996) Beyond accuracy: what data quality means to data consumers. J Manag Inf Syst 12(4):5–33CrossRef Wang RY, Strong DM (1996) Beyond accuracy: what data quality means to data consumers. J Manag Inf Syst 12(4):5–33CrossRef
Metadaten
Titel
A Cautionary Tale for Machine Learning Design: why we Still Need Human-Assisted Big Data Analysis
verfasst von
Marco Roccetti
Giovanni Delnevo
Luca Casini
Paola Salomoni
Publikationsdatum
19.02.2020
Verlag
Springer US
Erschienen in
Mobile Networks and Applications / Ausgabe 3/2020
Print ISSN: 1383-469X
Elektronische ISSN: 1572-8153
DOI
https://doi.org/10.1007/s11036-020-01530-6

Weitere Artikel der Ausgabe 3/2020

Mobile Networks and Applications 3/2020 Zur Ausgabe

Neuer Inhalt