Skip to main content
Erschienen in: Data Mining and Knowledge Discovery 2/2014

01.03.2014

Aggregative quantification for regression

verfasst von: Antonio Bella, Cèsar Ferri, José Hernández-Orallo, María José Ramírez-Quintana

Erschienen in: Data Mining and Knowledge Discovery | Ausgabe 2/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The problem of estimating the class distribution (or prevalence) for a new unlabelled dataset (from a possibly different distribution) is a very common problem which has been addressed in one way or another in the past decades. This problem has been recently reconsidered as a new task in data mining, renamed quantification when the estimation is performed as an aggregation (and possible adjustment) of a single-instance supervised model (e.g., a classifier). However, the study of quantification has been limited to classification, while it is clear that this problem also appears, perhaps even more frequently, with other predictive problems, such as regression. In this case, the goal is to determine a distribution or an aggregated indicator of the output variable for a new unlabelled dataset. In this paper, we introduce a comprehensive new taxonomy of quantification tasks, distinguishing between the estimation of the whole distribution and the estimation of some indicators (summary statistics), for both classification and regression. This distinction is especially useful for regression, since predictions are numerical values that can be aggregated in many different ways, as in multi-dimensional hierarchical data warehouses. We focus on aggregative quantification for regression and see that the approaches borrowed from classification do not work. We present several techniques based on segmentation which are able to produce accurate estimations of the expected value and the distribution of the output variable. We show experimentally that these methods especially excel for the relevant scenarios where training and test distributions dramatically differ.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
We use the same acronym for Regress & Splice and Regress & Sum, since both just aggregate the individual values with any further processing.
 
2
The example is elaborated, with some fictional elements, from the cars dataset in the UCI repository (Frank and Asuncion 2010).
 
3
The example is elaborated, with some fictional elements, from the lowbwt dataset in the UCI repository (Frank and Asuncion 2010), originally from Hosmer and Lemeshow (2000).
 
4
In this paper the methodology for indicators and distribution is the same (except for some minor specific techniques, mostly at the end of the process), but this could be different in the view that some indicators require less information and effort than the whole distribution.
 
5
As a mean-unbiased estimator minimises squared loss, a median-unbiased estimator is a different choice which minimises the absolute error.
 
6
The idea of segmenting the set of outputs is not new and has led to some classifier calibration techniques, such as binning (Zadrozny and Elkan 2002; Bella et al. 2009b). Calibration techniques are somewhat related to quantification techniques. In fact, \(RS\) would be optimal if the predictive model were perfectly calibrated—for the test set. This is a key point because calibration is always understood relative to a distribution or dataset. Given the quantification problems with distribution shift we are considering here, it is the test set distribution what we want to infer, so calibrating for the training set may be useless.
 
7
Note that this adjustment is performed with information from the training data exclusively. An alternative possibility would be to use a validation dataset, but this would reduce the available training data.
 
8
An alternative, more lightweight approach could be to introduce a normal jitter to each prediction \(\hat{y}\). While this may have a similar effect, it has random effects that may be important for small datasets. The smoothing approach presented here always leads to the same result, since it has no random components.
 
9
Some alternatives could be figure out here, such as the use of one-vs-previous or one-vs-adjacent schemes. This is left as a possibility for future work.
 
Literatur
Zurück zum Zitat Alonzo TA, Pepe MS, Lumley T (2003) Estimating disease prevalence in two-phase studies. Biostatistics 4(2):313–326CrossRefMATH Alonzo TA, Pepe MS, Lumley T (2003) Estimating disease prevalence in two-phase studies. Biostatistics 4(2):313–326CrossRefMATH
Zurück zum Zitat Anderson T (1962) On the distribution of the two-sample Cramer–von Mises criterion. Ann Math Stat 33(3):1148–1159CrossRefMATH Anderson T (1962) On the distribution of the two-sample Cramer–von Mises criterion. Ann Math Stat 33(3):1148–1159CrossRefMATH
Zurück zum Zitat Bakar AA, Othman ZA, Shuib NLM (2009) Building a new taxonomy for data discretization techniques. In: Proceedings of 2nd conference on data mining and optimization (DMO’09), pp 132–140 Bakar AA, Othman ZA, Shuib NLM (2009) Building a new taxonomy for data discretization techniques. In: Proceedings of 2nd conference on data mining and optimization (DMO’09), pp 132–140
Zurück zum Zitat Bella A, Ferri C, Hernández-Orallo J, Ramírez-Quintana MJ (2009a) Calibration of machine learning models. In: Handbook of research on machine learning applications. IGI Global, Hershey Bella A, Ferri C, Hernández-Orallo J, Ramírez-Quintana MJ (2009a) Calibration of machine learning models. In: Handbook of research on machine learning applications. IGI Global, Hershey
Zurück zum Zitat Bella A, Ferri C, Hernández-Orallo J, Ramírez-Quintana MJ (2009b) Similarity-binning averaging: a generalisation of binning calibration. In: International conference on intelligent data engineering and automated learning. LNCS, vol 5788. Springer, Berlin, pp 341–349 Bella A, Ferri C, Hernández-Orallo J, Ramírez-Quintana MJ (2009b) Similarity-binning averaging: a generalisation of binning calibration. In: International conference on intelligent data engineering and automated learning. LNCS, vol 5788. Springer, Berlin, pp 341–349
Zurück zum Zitat Bella A, Ferri C, Hernández-Orallo J, Ramírez-Quintana MJ (2010) Quantification via probability estimators. In: International conference on data mining, ICDM2010, pp 737–742 Bella A, Ferri C, Hernández-Orallo J, Ramírez-Quintana MJ (2010) Quantification via probability estimators. In: International conference on data mining, ICDM2010, pp 737–742
Zurück zum Zitat Chan Y, Ng H (2006) Estimating class priors in domain adaptation for word sense disambiguation. In: Proceedings of the 21st international conference on computational linguistics and the 44th annual meeting of the Association for Computational Linguistics, pp 89–96 Chan Y, Ng H (2006) Estimating class priors in domain adaptation for word sense disambiguation. In: Proceedings of the 21st international conference on computational linguistics and the 44th annual meeting of the Association for Computational Linguistics, pp 89–96
Zurück zum Zitat Chawla N, Japkowicz N, Kotcz A (2004) Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explor Newsl 6(1):1–6CrossRef Chawla N, Japkowicz N, Kotcz A (2004) Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explor Newsl 6(1):1–6CrossRef
Zurück zum Zitat Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30MATHMathSciNet Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30MATHMathSciNet
Zurück zum Zitat Dougherty J, Kohavi R, Sahami M (1995) Supervised and unsupervised discretization of continuous features. In: Prieditis A, Russell S (eds) Proceedings of the twelfth international conference on machine learning. Morgan Kaufmann, San Francisco, pp 194–202 Dougherty J, Kohavi R, Sahami M (1995) Supervised and unsupervised discretization of continuous features. In: Prieditis A, Russell S (eds) Proceedings of the twelfth international conference on machine learning. Morgan Kaufmann, San Francisco, pp 194–202
Zurück zum Zitat Ferri C, Hernández-Orallo J, Modroiu R (2009) An experimental comparison of performance measures for classification. Pattern Recogn Lett 30(1):27–38CrossRef Ferri C, Hernández-Orallo J, Modroiu R (2009) An experimental comparison of performance measures for classification. Pattern Recogn Lett 30(1):27–38CrossRef
Zurück zum Zitat Flach P (2012) Machine learning: the art and science of algorithms that make sense of data. Cambridge University Press, CambridgeCrossRef Flach P (2012) Machine learning: the art and science of algorithms that make sense of data. Cambridge University Press, CambridgeCrossRef
Zurück zum Zitat Forman G (2005) Counting positives accurately despite inaccurate classification. In: Proceedings of the 16th European conference on machine learning (ECML), pp 564–575 Forman G (2005) Counting positives accurately despite inaccurate classification. In: Proceedings of the 16th European conference on machine learning (ECML), pp 564–575
Zurück zum Zitat Forman G (2006) Quantifying trends accurately despite classifier error and class imbalance. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 157–166 Forman G (2006) Quantifying trends accurately despite classifier error and class imbalance. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 157–166
Zurück zum Zitat González-Castro V, Alaiz-Rodríguez R, Alegre E (2012) Class distribution estimation based on the Hellinger distance. Inf Sci 218(1):146–164 González-Castro V, Alaiz-Rodríguez R, Alegre E (2012) Class distribution estimation based on the Hellinger distance. Inf Sci 218(1):146–164
Zurück zum Zitat Hastie TJ, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction. Springer, BerlinCrossRef Hastie TJ, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction. Springer, BerlinCrossRef
Zurück zum Zitat Hernández-Orallo J, Flach P, Ferri C (2012) A unified view of performance metrics: translating threshold choice into expected classification loss. J Mach Learn Res (JMLR) 13:2813–2869 Hernández-Orallo J, Flach P, Ferri C (2012) A unified view of performance metrics: translating threshold choice into expected classification loss. J Mach Learn Res (JMLR) 13:2813–2869
Zurück zum Zitat Hwang JN, Lay SR, Lippman A (1994) Nonparametric multivariate density estimation: a comparative study. IEEE Trans Signal Process 42(10):2795–2810CrossRef Hwang JN, Lay SR, Lippman A (1994) Nonparametric multivariate density estimation: a comparative study. IEEE Trans Signal Process 42(10):2795–2810CrossRef
Zurück zum Zitat Hyndman RJ, Bashtannyk DM, Grunwald GK (1996) Estimating and visualizing conditional densities. J Comput Graph Stat 5(4):315–336MathSciNet Hyndman RJ, Bashtannyk DM, Grunwald GK (1996) Estimating and visualizing conditional densities. J Comput Graph Stat 5(4):315–336MathSciNet
Zurück zum Zitat Moreno-Torres J, Raeder T, Alaiz-Rodríguez R, Chawla N, Herrera F (2012) A unifying view on dataset shift in classification. Pattern Recogn 45(1):521–530CrossRef Moreno-Torres J, Raeder T, Alaiz-Rodríguez R, Chawla N, Herrera F (2012) A unifying view on dataset shift in classification. Pattern Recogn 45(1):521–530CrossRef
Zurück zum Zitat Neyman J (1938) Contribution to the theory of sampling human populations. J Am Stat Assoc 33(201):101–116CrossRefMATH Neyman J (1938) Contribution to the theory of sampling human populations. J Am Stat Assoc 33(201):101–116CrossRefMATH
Zurück zum Zitat Platt JC (1999) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Advances in large margin classifiers. MIT Press, Cambridge, pp 61–74 Platt JC (1999) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Advances in large margin classifiers. MIT Press, Cambridge, pp 61–74
Zurück zum Zitat Raeder T, Forman G, Chawla N (2012) Learning from imbalanced data: evaluation matters. Data Min 23:315–331MathSciNet Raeder T, Forman G, Chawla N (2012) Learning from imbalanced data: evaluation matters. Data Min 23:315–331MathSciNet
Zurück zum Zitat Sánchez L, González V, Alegre E, Alaiz R (2008) Classification and quantification based on image analysis for sperm samples with uncertain damaged/intact cell proportions. In: Proceedings of the 5th international conference on image analysis and recognition. LNCS, vol 5112. Springer, Heidelberg, pp 827–836 Sánchez L, González V, Alegre E, Alaiz R (2008) Classification and quantification based on image analysis for sperm samples with uncertain damaged/intact cell proportions. In: Proceedings of the 5th international conference on image analysis and recognition. LNCS, vol 5112. Springer, Heidelberg, pp 827–836
Zurück zum Zitat Sturges H (1926) The choice of a class interval. J Am Stat Assoc 21(153):65–66CrossRef Sturges H (1926) The choice of a class interval. J Am Stat Assoc 21(153):65–66CrossRef
Zurück zum Zitat Team R et al (2012) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna Team R et al (2012) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna
Zurück zum Zitat Tenenbein A (1970) A double sampling scheme for estimating from binomial data with misclassifications. J Am Stat Assoc 65(331):1350–1361CrossRef Tenenbein A (1970) A double sampling scheme for estimating from binomial data with misclassifications. J Am Stat Assoc 65(331):1350–1361CrossRef
Zurück zum Zitat Weiss G (2004) Mining with rarity: a unifying framework. ACM SIGKDD Explor Newsl 6(1):7–19CrossRef Weiss G (2004) Mining with rarity: a unifying framework. ACM SIGKDD Explor Newsl 6(1):7–19CrossRef
Zurück zum Zitat Weiss G, Provost F (2001) The effect of class distribution on classifier learning: an empirical study. Technical Report ML-TR-44 Weiss G, Provost F (2001) The effect of class distribution on classifier learning: an empirical study. Technical Report ML-TR-44
Zurück zum Zitat Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques with Java implementations. Elsevier, Amsterdam Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques with Java implementations. Elsevier, Amsterdam
Zurück zum Zitat Xiao Y, Gordon A, Yakovlev A (2006a) A C++ program for the Cramér–von Mises two-sample test. J Stat Softw 17:1–15 Xiao Y, Gordon A, Yakovlev A (2006a) A C++ program for the Cramér–von Mises two-sample test. J Stat Softw 17:1–15
Zurück zum Zitat Xiao Y, Gordon A, Yakovlev A (2006b) The L1-version of the Cramér-von Mises test for two-sample comparisons in microarray data analysis. EURASIP J Bioinform Syst Biol 2006:85769 Xiao Y, Gordon A, Yakovlev A (2006b) The L1-version of the Cramér-von Mises test for two-sample comparisons in microarray data analysis. EURASIP J Bioinform Syst Biol 2006:85769
Zurück zum Zitat Xue J, Weiss G (2009) Quantification and semi-supervised classification methods for handling changes in class distribution. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, pp 897–906 Xue J, Weiss G (2009) Quantification and semi-supervised classification methods for handling changes in class distribution. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, pp 897–906
Zurück zum Zitat Yang Y (2003) Discretization for naive-bayes learning. PhD thesis, Monash University Yang Y (2003) Discretization for naive-bayes learning. PhD thesis, Monash University
Zurück zum Zitat Zadrozny B, Elkan C (2001) Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In: Proceedings of the 8th international conference on machine learning (ICML), pp 609–616 Zadrozny B, Elkan C (2001) Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In: Proceedings of the 8th international conference on machine learning (ICML), pp 609–616
Zurück zum Zitat Zadrozny B, Elkan C (2002) Transforming classifier scores into accurate multiclass probability estimates. In: The 8th ACM SIGKDD international conference on knowledge discovery and data mining, pp 694–699 Zadrozny B, Elkan C (2002) Transforming classifier scores into accurate multiclass probability estimates. In: The 8th ACM SIGKDD international conference on knowledge discovery and data mining, pp 694–699
Metadaten
Titel
Aggregative quantification for regression
verfasst von
Antonio Bella
Cèsar Ferri
José Hernández-Orallo
María José Ramírez-Quintana
Publikationsdatum
01.03.2014
Verlag
Springer US
Erschienen in
Data Mining and Knowledge Discovery / Ausgabe 2/2014
Print ISSN: 1384-5810
Elektronische ISSN: 1573-756X
DOI
https://doi.org/10.1007/s10618-013-0308-z

Weitere Artikel der Ausgabe 2/2014

Data Mining and Knowledge Discovery 2/2014 Zur Ausgabe