nach oben

Data Mining and Knowledge Discovery

Erschienen in:

01.03.2014

Aggregative quantification for regression

verfasst von: Antonio Bella, Cèsar Ferri, José Hernández-Orallo, María José Ramírez-Quintana

Erschienen in: Data Mining and Knowledge Discovery | Ausgabe 2/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

The problem of estimating the class distribution (or prevalence) for a new unlabelled dataset (from a possibly different distribution) is a very common problem which has been addressed in one way or another in the past decades. This problem has been recently reconsidered as a new task in data mining, renamed quantification when the estimation is performed as an aggregation (and possible adjustment) of a single-instance supervised model (e.g., a classifier). However, the study of quantification has been limited to classification, while it is clear that this problem also appears, perhaps even more frequently, with other predictive problems, such as regression. In this case, the goal is to determine a distribution or an aggregated indicator of the output variable for a new unlabelled dataset. In this paper, we introduce a comprehensive new taxonomy of quantification tasks, distinguishing between the estimation of the whole distribution and the estimation of some indicators (summary statistics), for both classification and regression. This distinction is especially useful for regression, since predictions are numerical values that can be aggregated in many different ways, as in multi-dimensional hierarchical data warehouses. We focus on aggregative quantification for regression and see that the approaches borrowed from classification do not work. We present several techniques based on segmentation which are able to produce accurate estimations of the expected value and the distribution of the output variable. We show experimentally that these methods especially excel for the relevant scenarios where training and test distributions dramatically differ.

Vorheriger Artikel Affinity-driven blog cascade analysis and prediction

Nächster Artikel Exploiting domain knowledge to detect outliers

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

We use the same acronym for Regress & Splice and Regress & Sum, since both just aggregate the individual values with any further processing.

The example is elaborated, with some fictional elements, from the cars dataset in the UCI repository (Frank and Asuncion 2010).

The example is elaborated, with some fictional elements, from the lowbwt dataset in the UCI repository (Frank and Asuncion 2010), originally from Hosmer and Lemeshow (2000).

In this paper the methodology for indicators and distribution is the same (except for some minor specific techniques, mostly at the end of the process), but this could be different in the view that some indicators require less information and effort than the whole distribution.

As a mean-unbiased estimator minimises squared loss, a median-unbiased estimator is a different choice which minimises the absolute error.

The idea of segmenting the set of outputs is not new and has led to some classifier calibration techniques, such as binning (Zadrozny and Elkan 2002; Bella et al. 2009b). Calibration techniques are somewhat related to quantification techniques. In fact, \(RS\) would be optimal if the predictive model were perfectly calibrated—for the test set. This is a key point because calibration is always understood relative to a distribution or dataset. Given the quantification problems with distribution shift we are considering here, it is the test set distribution what we want to infer, so calibrating for the training set may be useless.

Note that this adjustment is performed with information from the training data exclusively. An alternative possibility would be to use a validation dataset, but this would reduce the available training data.

An alternative, more lightweight approach could be to introduce a normal jitter to each prediction \(\hat{y}\). While this may have a similar effect, it has random effects that may be important for small datasets. The smoothing approach presented here always leads to the same result, since it has no random components.

Some alternatives could be figure out here, such as the use of one-vs-previous or one-vs-adjacent schemes. This is left as a possibility for future work.

http://archive.ics.uci.edu/ml/.

http://tunedit.org/repo.

http://mldata.org/repository/data/.

Alonzo TA, Pepe MS, Lumley T (2003) Estimating disease prevalence in two-phase studies. Biostatistics 4(2):313–326CrossRefMATH

Anderson T (1962) On the distribution of the two-sample Cramer–von Mises criterion. Ann Math Stat 33(3):1148–1159CrossRefMATH

Bakar AA, Othman ZA, Shuib NLM (2009) Building a new taxonomy for data discretization techniques. In: Proceedings of 2nd conference on data mining and optimization (DMO’09), pp 132–140

Bella A, Ferri C, Hernández-Orallo J, Ramírez-Quintana MJ (2009a) Calibration of machine learning models. In: Handbook of research on machine learning applications. IGI Global, Hershey

Bella A, Ferri C, Hernández-Orallo J, Ramírez-Quintana MJ (2009b) Similarity-binning averaging: a generalisation of binning calibration. In: International conference on intelligent data engineering and automated learning. LNCS, vol 5788. Springer, Berlin, pp 341–349

Bella A, Ferri C, Hernández-Orallo J, Ramírez-Quintana MJ (2010) Quantification via probability estimators. In: International conference on data mining, ICDM2010, pp 737–742

Bella A, Ferri C, Hernández-Orallo J, Ramírez-Quintana MJ (2012) On the effect of calibration in classifier combination. Appl Intell. doi:10.1007/s10489-012-0388-2

Chan Y, Ng H (2006) Estimating class priors in domain adaptation for word sense disambiguation. In: Proceedings of the 21st international conference on computational linguistics and the 44th annual meeting of the Association for Computational Linguistics, pp 89–96

Chawla N, Japkowicz N, Kotcz A (2004) Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explor Newsl 6(1):1–6CrossRef

Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30MATHMathSciNet

Dougherty J, Kohavi R, Sahami M (1995) Supervised and unsupervised discretization of continuous features. In: Prieditis A, Russell S (eds) Proceedings of the twelfth international conference on machine learning. Morgan Kaufmann, San Francisco, pp 194–202

Ferri C, Hernández-Orallo J, Modroiu R (2009) An experimental comparison of performance measures for classification. Pattern Recogn Lett 30(1):27–38CrossRef

Flach P (2012) Machine learning: the art and science of algorithms that make sense of data. Cambridge University Press, CambridgeCrossRef

Forman G (2005) Counting positives accurately despite inaccurate classification. In: Proceedings of the 16th European conference on machine learning (ECML), pp 564–575

Forman G (2006) Quantifying trends accurately despite classifier error and class imbalance. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 157–166

Forman G (2008) Quantifying counts and costs via classification. Data Min Knowl Discov 17(2):164–206CrossRefMathSciNet

Frank A, Asuncion A (2010) UCI machine learning repository. http://archive.ics.uci.edu/ml

González-Castro V, Alaiz-Rodríguez R, Alegre E (2012) Class distribution estimation based on the Hellinger distance. Inf Sci 218(1):146–164

Hastie TJ, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction. Springer, BerlinCrossRef

Hernández-Orallo J, Flach P, Ferri C (2012) A unified view of performance metrics: translating threshold choice into expected classification loss. J Mach Learn Res (JMLR) 13:2813–2869

Hodges J, Lehmann E (1963) Estimates of location based on rank tests. Ann Math Stat 34(5):598–611CrossRefMATHMathSciNet

Hosmer DW, Lemeshow S (2000) Applied logistic regression. Wiley, New YorkCrossRefMATH

Hwang JN, Lay SR, Lippman A (1994) Nonparametric multivariate density estimation: a comparative study. IEEE Trans Signal Process 42(10):2795–2810CrossRef

Hyndman RJ, Bashtannyk DM, Grunwald GK (1996) Estimating and visualizing conditional densities. J Comput Graph Stat 5(4):315–336MathSciNet

Moreno-Torres J, Raeder T, Alaiz-Rodríguez R, Chawla N, Herrera F (2012) A unifying view on dataset shift in classification. Pattern Recogn 45(1):521–530CrossRef

Neyman J (1938) Contribution to the theory of sampling human populations. J Am Stat Assoc 33(201):101–116CrossRefMATH

Platt JC (1999) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Advances in large margin classifiers. MIT Press, Cambridge, pp 61–74

Raeder T, Forman G, Chawla N (2012) Learning from imbalanced data: evaluation matters. Data Min 23:315–331MathSciNet

Sánchez L, González V, Alegre E, Alaiz R (2008) Classification and quantification based on image analysis for sperm samples with uncertain damaged/intact cell proportions. In: Proceedings of the 5th international conference on image analysis and recognition. LNCS, vol 5112. Springer, Heidelberg, pp 827–836

Sturges H (1926) The choice of a class interval. J Am Stat Assoc 21(153):65–66CrossRef

Team R et al (2012) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna

Tenenbein A (1970) A double sampling scheme for estimating from binomial data with misclassifications. J Am Stat Assoc 65(331):1350–1361CrossRef

Weiss G (2004) Mining with rarity: a unifying framework. ACM SIGKDD Explor Newsl 6(1):7–19CrossRef

Weiss G, Provost F (2001) The effect of class distribution on classifier learning: an empirical study. Technical Report ML-TR-44

Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques with Java implementations. Elsevier, Amsterdam

Xiao Y, Gordon A, Yakovlev A (2006a) A C++ program for the Cramér–von Mises two-sample test. J Stat Softw 17:1–15

Xiao Y, Gordon A, Yakovlev A (2006b) The L1-version of the Cramér-von Mises test for two-sample comparisons in microarray data analysis. EURASIP J Bioinform Syst Biol 2006:85769

Xue J, Weiss G (2009) Quantification and semi-supervised classification methods for handling changes in class distribution. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, pp 897–906

Yang Y (2003) Discretization for naive-bayes learning. PhD thesis, Monash University

Zadrozny B, Elkan C (2001) Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In: Proceedings of the 8th international conference on machine learning (ICML), pp 609–616

Zadrozny B, Elkan C (2002) Transforming classifier scores into accurate multiclass probability estimates. In: The 8th ACM SIGKDD international conference on knowledge discovery and data mining, pp 694–699

Titel: Aggregative quantification for regression
verfasst von: Antonio Bella
Cèsar Ferri
José Hernández-Orallo
María José Ramírez-Quintana
Publikationsdatum: 01.03.2014
Verlag: Springer US
Erschienen in: Data Mining and Knowledge Discovery / Ausgabe 2/2014
Print ISSN: 1384-5810
Elektronische ISSN: 1573-756X
DOI: https://doi.org/10.1007/s10618-013-0308-z

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 2/2014

Adaptive evolutionary clustering

Exploiting domain knowledge to detect outliers

Repeated labeling using multiple noisy labelers

Conditional ordinal random fields for structured ordinal-valued label prediction

G-Tries: a data structure for storing and finding subgraphs

Affinity-driven blog cascade analysis and prediction