A Review on Quantification Learning

Authors:
Pablo González

University of Oviedo, Spain

University of Oviedo, Spain
View Profile

,
Alberto Castaño

University of Oviedo, Spain

University of Oviedo, Spain
View Profile

,
Nitesh V. Chawla

University of Notre Dame, USA

University of Notre Dame, USA
View Profile

,
Juan José Del Coz

University of Oviedo, Spain

University of Oviedo, Spain
View Profile

Authors Info & Claims

ACM Computing Surveys Volume 50 Issue 5Article No.: 74pp 1–40https://doi.org/10.1145/3117807

Published:26 September 2017Publication History

ACM Computing Surveys

Abstract

The task of quantification consists in providing an aggregate estimation (e.g., the class distribution in a classification problem) for unseen test sets, applying a model that is trained using a training set with a different data distribution. Several real-world applications demand this kind of method that does not require predictions for individual examples and just focuses on obtaining accurate estimates at an aggregate level. During the past few years, several quantification methods have been proposed from different perspectives and with different goals. This article presents a unified review of the main approaches with the aim of serving as an introductory tutorial for newcomers in the field.

References

Rocío Alaiz-Rodríguez, Enrique Alegre-Gutiérrez, Víctor González-Castro, and Lidia Sánchez. 2008. Quantifying the proportion of damaged sperm cells based on image analysis and neural networks. In Proceedings of the WSEAS International Conference on Simulation, Modelling and Optimization (SMO’08). WSEAS Press, 383--388Google Scholar
Rocio Alaiz-Rodríguez, Alicia Guerrero-Curieses, and Jesús Cid-Sueiro. 2011. Class and subclass probability re-estimation to adapt a classifier in the presence of concept drift. Neurocomputing 74, 16 (2011), 2614--2623.Google ScholarCross Ref
Giambattista Amati, Simone Angelini, Marco Bianchi, Luca Costantini, and Giuseppe Marcone. 2014a. A scalable approach to near real-time sentiment analysis on social networks. In Proceedings of the International Workshop on Information Filtering and Retrieval. 12--23.Google Scholar
Giambattista Amati, Marco Bianchi, and Giuseppe Marcone. 2014b. Sentiment estimation on twitter. In Proceedings of the 5th Italian Information Retrieval Workshop (2014). 39--50.Google Scholar
Jon Scott Armstrong. 1978. Long-range Forecasting: From Crystal Ball to Computer. Wiley: New York.Google Scholar
Hideki Asoh, Kazushi Ikeda, and Chihiro Ono. 2012. A fast and simple method for profiling a population of twitter users. In Proceedings of the 3rd International Workshop on Mining Ubiquitous and Social Environments. 19--26.Google Scholar
Jose Barranquero, Jorge Díez, and Juan José del Coz. 2015. Quantification-oriented learning based on reliable classifiers. Pattern Recogn. 48, 2 (2015), 591--604. Google ScholarDigital Library
Jose Barranquero, Pablo González, Jorge Díez, and Juan José del Coz. 2013. On the study of nearest neighbour algorithms for prevalence estimation in binary problems. Pattern Recogn. 46, 2 (2013), 472—482.Google ScholarDigital Library
Oscar Beijbom, Judy Hoffman, Evan Yao, Trevor Darrell, Alberto Rodriguez-Ramirez, Manuel Gonzalez-Rivero, and Ove Hoegh Guldberg. 2015. Quantification in-the-wild: Data-sets and baselines. In Proceedings of the Workshop on Transfer and Multi-Task Learning (NIPS’15).Google Scholar
Antonio Bella, Cèsar Ferri, José Hernández-Orallo, and María José Ramírez-Quintana. 2010. Quantification via probability estimators. In Proceedings of the IEEE International Conference on Data Mining (ICDM’10). IEEE, 737--742.Google ScholarDigital Library
Antonio Bella, Cèsar Ferri, José Hernández-Orallo, and María José Ramírez-Quintana. 2014. Aggregative quantification for regression. Data Min. Knowl. Discov. 28, 2 (2014), 475--518.Google ScholarDigital Library
J. Roger Bray and John T. Curtis. 1957. An ordination of the upland forest communities of southern Wisconsin. Ecol. Monogr. 27, 4 (1957), 325--349. Google ScholarCross Ref
Yee Seng Chan and Hwee Tou Ng. 2006. Estimating class priors in domain adaptation for word sense disambiguation. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 89--96. Google ScholarDigital Library
Hal Daume III and Daniel Marcu. 2006. Domain adaptation for statistical classifiers. J. Artif. Intell. Res. 26 (2006), 101--126.Google ScholarCross Ref
Marthinus Christoffel du Plessis and Masashi Sugiyama. 2012. Semi-supervised learning of class balance under class-prior change by distribution matching. In Proceedings of the International Conference on Machine Learning (ICML’12).Google Scholar
Marthinus Christoffel Du Plessis and Masashi Sugiyama. 2014a. Class prior estimation from positive and unlabeled data. IEICE Trans. Inf. Syst. 97, 5 (2014), 1358--1362. Google ScholarCross Ref
Marthinus Christoffel Du Plessis and Masashi Sugiyama. 2014b. Semi-supervised learning of class balance under class-prior change by distribution matching. Neur. Netw. 50 (2014), 110--119. Google ScholarDigital Library
Andrea Esuli and Fabrizio Sebastiani. 2010. Sentiment quantification. IEEE Intell. Syst. 25, 4 (2010), 72--75. Google ScholarDigital Library
Andrea Esuli and Fabrizio Sebastiani. 2015. Optimizing text quantifiers for multivariate loss functions. ACM Trans. Knowl. Discov. Data 9, 4 (2015), 27:1--27:27.Google ScholarDigital Library
Tom Fawcett. 2004. ROC graphs: Notes and practical considerations for researchers. Mach. Learn. 31 (2004), 1--38.Google Scholar
Tom Fawcett and Peter A. Flach. 2005. A response to webb and ting’s on the application of ROC analysis to predict classification performance under varying class distributions. Mach. Learn. 58, 1 (2005), 33--38. Google ScholarDigital Library
Aykut Firat. 2016. Unified framework for quantification. arXiv preprint arXiv:1606.00868 (2016).Google Scholar
George Forman. 2005. Counting positives accurately despite inaccurate classification. In Proceedings of the European Conference on Machine Learning (ECML’05). 564--575. Google ScholarDigital Library
George Forman. 2006. Quantifying trends accurately despite classifier error and class imbalance. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD’06). ACM, 157--166. Google ScholarDigital Library
George Forman. 2008. Quantifying counts and costs via classification. Data Min. Knowl. Discov. 17, 2 (2008), 164--206. Google ScholarDigital Library
George Forman, Evan Kirshenbaum, and Jaap Suermondt. 2006. Pragmatic text mining: Minimizing human effort to quantify many issues in call logs. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD’06). ACM, 852--861.Google ScholarDigital Library
James Foulds and Eibe Frank. 2010. A review of multi-instance learning assumptions. Knowl. Eng. Rev. 25, 01 (2010), 1--25. Google ScholarDigital Library
Eibe Frank and Mark Hall. 2001. A simple approach to ordinal classification. In Proceedings of the European Conference on Machine Learning. Springer, 145--156. Google ScholarDigital Library
João Gama, Indrė Žliobaitė, Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia. 2014. A survey on concept drift adaptation. ACM Comput. Surv. 46, 4 (2014), 44.Google ScholarDigital Library
Wei Gao and Fabrizio Sebastiani. 2015. Tweet sentiment: From classification to quantification. In Proceedings of the International Conference on Advances in Social Networks Analysis and Mining (ASONAM’15).Google ScholarDigital Library
Wei Gao and Fabrizio Sebastiani. 2016. From classification to quantification in tweet sentiment analysis. Soc. Netw. Anal. Min. 6, 1 (2016), 1--22. Google ScholarCross Ref
John J. Gart and Alfred A. Buck. 1966. Comparison of a screening test and a reference test in epidemiologic studies ii. A probabilistic model for the comparison of diagnostic tests. Am. J. Epidemiol. 83, 3 (1966), 593--602. Google ScholarCross Ref
Anastasia Giachanou and Fabio Crestani. 2016. Like it or not: A survey of twitter sentiment analysis methods. Comput. Surv. 49, 2 (2016), 28:1--28:41.Google ScholarDigital Library
Alec Go, Richa Bhayani, and Lei Huang. 2009. Twitter sentiment classification using distant supervision. CS224N Project Rep. Stanford 1 (2009), 12.Google Scholar
Pablo González, Eva álvarez, Jose Barranquero, Jorge Díez, Rafael González-Quirós, Enrique Nogueira, Angel López-Urrutia, and Juan José del Coz. 2013. Multiclass support vector machines with example-dependent costs applied to plankton biomass estimation. IEEE Trans. Neur. Netw. Learn. Syst. 24, 11 (2013), 1901--1905.Google ScholarCross Ref
Pablo González, Eva álvarez, Jorge Díez, ángel López-Urrutia, and Juan José del Coz. 2017. Validation methods for plankton image classification systems. Limnol. Oceanogr. Methods 15, 3 (2017), 221--237.Google ScholarCross Ref
Pablo González, Jorge Díez, Nitesh Chawla, and Juan José del Coz. 2017. Why is quantification an interesting learning problem?Progr. Artif. Intell. 6, 1 (2017), 53--58.Google Scholar
Víctor González-Castro, Rocío Alaiz-Rodríguez, and Enrique Alegre. 2013. Class distribution estimation based on the hellinger distance. Inf. Sci. 218 (2013), 146--164.Google ScholarDigital Library
Vera Hofer. 2015. Adapting a classification rule to local and global shift when only unlabelled data are available. Eur. J. Operat. Res. 243, 1 (2015), 177--189. Google ScholarCross Ref
Vera Hofer and Georg Krempl. 2013. Drift mining in data: A framework for addressing drift in classification. Comput. Stat. Data Anal. 57, 1 (2013), 377--391. Google ScholarDigital Library
Daniel J. Hopkins and Gary King. 2010. A method of automated nonparametric content analysis for social science. Am. J. Polit. Sci. 54, 1 (2010), 229--247. Google ScholarCross Ref
Jiayuan Huang, Alex J. Smola, Arthur Gretton, Karsten Borgwardt, and Bernhard Schölkopf. 2007. Correcting sample selection bias by unlabeled data. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS’07). The MIT Press, 601--608.Google Scholar
Arun Iyer, Saketha Nath, and Sunita Sarawagi. 2014. Maximum mean discrepancy for class ratio estimation: Convergence bounds and kernel selection. In Proceedings of the International Conference on Machine Learning (ICML’14). 530--538.Google Scholar
Thorsten Joachims. 2005. A support vector method for multivariate performance measures. In Proceedings of the International Conference on Machine Learning (ICML’05). ACM, 377--384. Google ScholarDigital Library
Hideko Kawakubo, Marthinus Christoffel Du Plessis, and Masashi Sugiyama. 2016. Computationally efficient class-prior estimation under class balance change using energy distance. Trans. Inf. Syst. 99, 1 (2016), 176--186. Google ScholarCross Ref
Gary King and Ying Lu. 2008. Verbal autopsy methods with multiple causes of death. Statist. Sci. 23, 1 (2008), 78--91. Google ScholarCross Ref
Meelis Kull and Peter Flach. 2014. Patterns of dataset shift. In Proceedings of the 1st International Workshop on Learning over Multiple Contexts (LMCE’14) at ECML-PKDD.Google Scholar
Paul S. Levy and Edward H. Kass. 1970. A three-population model for sequential screening for bacteriuria. Am. J. Epidemiol. 91, 2 (1970), 148--154. Google ScholarCross Ref
Giovanni Da San Martino, Wei Gao, and Fabrizio Sebastiani. 2016a. Ordinal text quantification. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 937--940. Google ScholarDigital Library
Giovanni Da San Martino, Wei Gao, and Fabrizio Sebastiani. 2016b. QCRI at SemEval-2016 Task 4: Probabilistic methods for binary and ordinal quantification. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval’16). Association for Computational Linguistics, A, 58--63.Google ScholarCross Ref
Letizia Milli, Anna Monreale, Giulio Rossetti, Fosca Giannotti, Dino Pedreschi, and Fabrizio Sebastiani. 2013. Quantification trees. In Proceedings of the IEEE International Conference on Data Mining (ICDM’13). 528--536. Google ScholarCross Ref
Letizia Milli, Anna Monreale, Giulio Rossetti, Dino Pedreschi, Fosca Giannotti, and Fabrizio Sebastiani. 2015. Quantification in social networks. In Proceedings of the IEEE International Conference on Data Science and Advanced Analytics. 1--10. Google ScholarCross Ref
José G. Moreno-Torres, Troy Raeder, Rocío Alaiz-Rodríguez, Nitesh V. Chawla, and Francisco Herrera. 2012. A unifying view on dataset shift in classification. Pattern Recogn. 45, 1 (2012), 521--530.Google ScholarDigital Library
Harikrishna Narasimhan, Shuai Li, Purushottam Kar, Sanjay Chawla, and Fabrizio Sebastiani. 2016. Stochastic optimization techniques for quantification performance measures. (unpublished).Google Scholar
Sinno Jialin Pan and Qiang Yang. 2010. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22, 10 (2010), 1345--1359. Google ScholarDigital Library
Pablo Pérez-Gállego, José Ramón Quevedo, and Juan José del Coz. 2017. Using ensembles for problems with characterizable changes in data distribution: A case study on quantification. Inf. Fusion 34 (2017), 87--100.Google ScholarDigital Library
Charles Peters and William A Coberly. 1976. The numerical evaluation of the maximum-likelihood estimate of mixture proportions. Commun. Stat.-Theory. Methods 5, 12 (1976), 1127--1135. Google ScholarCross Ref
Foster Provost and Tom Fawcett. 2001. Robust classification for imprecise environments. Mach. Learn. 42, 3 (2001), 203--231. Google ScholarDigital Library
Yossi Rubner, Carlo Tomasi, and Leonidas J. Guibas. 2000. The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vis. 40, 2 (2000), 99--121. Google ScholarDigital Library
Marco Saerens, Patrice Latinne, and Christine Decaestecker. 2002. Adjusting the outputs of a classifier to new a priori probabilities: A simple procedure. Neur. Comput. 14, 1 (2002), 21--41. Google ScholarDigital Library
Andrew Solow, Cabell Davis, and Qiao Hu. 2001. Estimating the taxonomic composition of a sample when individuals are classified with error. Mar. Ecol.: Prog. Ser. 216 (2001), 309--311. Google ScholarCross Ref
Heidi M. Sosik and Robert J. Olson. 2007. Automated taxonomic classification of phytoplankton sampled with imaging-in-flow cytometry. Limnol. Oceanogr.: Methods 5, 6 (2007), 204--216. Google ScholarCross Ref
Amos J. Storkey. 2009. Dataset Shift in Machine Learning. The MIT Press, 3--28.Google Scholar
Masashi Sugiyama, Takafumi Kanamori, Taiji Suzuki, Marthinus Christoffel du Plessis, Song Liu, and Ichiro Takeuchi. 2013. Density-difference estimation. Neur. Comput. 25, 10 (2013), 2734--2775. Google ScholarDigital Library
Masashi Sugiyama, Shinichi Nakajima, Hisashi Kashima, Paul von Bünau, and Motoaki Kawanabe. 2007. Direct importance estimation with model selection and its application to covariate shift adaptation. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS’07).Google Scholar
Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. 2012. Density Ratio Estimation in Machine Learning. Cambridge University Press. Google ScholarCross Ref
Masashi Sugiyama, Makoto Yamada, and Marthinus Christoffel du Plessis. 2013. Learning under nonstationarity: Covariate shift and class-balance change. Wiley Interdisc. Rev.: Comput. Stat. 5, 6 (2013), 465--477. Google ScholarDigital Library
Lei Tang, Huiji Gao, and Huan Liu. 2010. Network quantification despite biased labels. In Proceedings of the 8th Workshop on Mining and Learning with Graphs (MLG’10) at ACM SIGKDD’10. ACM, 147--154. Google ScholarDigital Library
Dirk Tasche. 2014. Exact fit of simple finite mixture models. J. Risk Financ. Manag. 7, 4 (2014), 150--164. Google ScholarCross Ref
Dirk Tasche. 2016. Does quantification without adjustments work?arXiv preprint arXiv:1602.08780 (2016).Google Scholar
Dirk Tasche. 2017. Fisher consistency for prior probability shift. arXiv preprint arXiv:1701.05512 (2017).Google Scholar
Chris Tofallis. 2014. A better measure of relative prediction accuracy for model selection and model estimation. J. Operat. Res. Soc. 66, 8 (2014), 1352--1362. Google ScholarCross Ref
Slobodan Vucetic and Zoran Obradovic. 2001. Classification on data with biased class distribution. In Proceedings of the European Conference on Machine Learning (ECML’01). Springer-Verlag, 527--538. Google ScholarDigital Library
Geoffrey I. Webb, Roy Hyde, Hong Cao, Hai Long Nguyen, and Francois Petitjean. 2015. Characterizing concept drift. Data Min. Knowl. Discov. (2015), 1--31.Google Scholar
Karl Weiss, Taghi M. Khoshgoftaar, and DingDing Wang. 2016. A survey of transfer learning. J. Big Data 3, 1 (2016), 1--40. Google ScholarCross Ref
Jack Chongjie Xue and Gary M. Weiss. 2009. Quantification and semi-supervised classification methods for handling changes in class distribution. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD’09). ACM, 897--906. Google ScholarDigital Library
Kun Zhang, Bernhard Schölkopf, Krikamol Muandet, and Zhikun Wang. 2013. Domain adaptation under target and conditional shift. In Proceedings of the International Conference on Machine Learning (ICML’13). 819--827.Google Scholar

Index Terms

A Review on Quantification Learning
1. Theory of computation
  1. Theory and algorithms for application domains
    1. Machine learning theory

Recommendations

Optimizing Text Quantifiers for Multivariate Loss Functions

We address the problem of quantification, a supervised learning task whose goal is, given a class, to estimate the relative frequency (or prevalence) of the class in a dataset of unlabeled items. Quantification has several applications in data and text ...
Read More
Quantifying counts and costs via classification

Many business applications track changes over time, for example, measuring the monthly prevalence of influenza incidents. In situations where a classifier is needed to identify the relevant incidents, imperfect classification accuracy can cause ...
Read More
Multi-Label Quantification
Quantification, variously called supervised prevalence estimation or learning to quantify, is the supervised learning task of generating predictors of the relative frequencies (a.k.a. prevalence values) of the classes of interest in unlabelled data ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Computing Surveys Volume 50, Issue 5
September 2018
573 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/3145473
Editor:
Sartaj Sahni
Department of Computer and Information Science and Engineering / University of Florida / Gainesville, FL 32611
Issue’s Table of Contents
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 September 2017
- Revised: 1 June 2017
- Accepted: 1 June 2017
- Received: 1 December 2016
Published in csur Volume 50, Issue 5

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Class distribution estimation
prevalence estimation
quantification
Qualifiers
- tutorial
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 64
  Total Citations
  View Citations
- 2,323
  Total Downloads
- Downloads (Last 12 months)574
- Downloads (Last 6 weeks)65
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A Review on Quantification Learning

ACM Computing Surveys

Abstract

References

Cited By

Index Terms

Recommendations

Optimizing Text Quantifiers for Multivariate Loss Functions

Quantifying counts and costs via classification

Multi-Label Quantification