Abstract
We critically re-examine the Saerens-Latinne-Decaestecker (SLD) algorithm, a well-known method for estimating class prior probabilities (“priors”) and adjusting posterior probabilities (“posteriors”) in scenarios characterized by distribution shift, i.e., difference in the distribution of the priors between the training and the unlabelled documents. Given a machine learned classifier and a set of unlabelled documents for which the classifier has returned posterior probabilities and estimates of the prior probabilities, SLD updates them both in an iterative, mutually recursive way, with the goal of making both more accurate; this is of key importance in downstream tasks such as single-label multiclass classification and cost-sensitive text classification. Since its publication, SLD has become the standard algorithm for improving the quality of the posteriors in the presence of distribution shift, and SLD is still considered a top contender when we need to estimate the priors (a task that has become known as “quantification”). However, its real effectiveness in improving the quality of the posteriors has been questioned. We here present the results of systematic experiments conducted on a large, publicly available dataset, across multiple amounts of distribution shift and multiple learners. Our experiments show that SLD improves the quality of the posterior probabilities and of the estimates of the prior probabilities, but only when the number of classes in the classification scheme is very small and the classifier is calibrated. As the number of classes grows, or as we use non-calibrated classifiers, SLD converges more slowly (and often does not converge at all), performance degrades rapidly, and the impact of SLD on the quality of the prior estimates and of the posteriors becomes negative rather than positive.
- Tuomo Alasalmi, Jaakko Suutala, Heli Koskimäki, and Juha Röning. 2020. Better classifier calibration for small data sets. ACM Trans. Knowl. Discov. Data 14, 3 (2020), 1--19. DOI:https://doi.org/10.1145/3385656Google Scholar
- Antonio Bella, Cèsar Ferri, José Hernández-Orallo, and María José Ramírez-Quintana. 2014. Aggregative quantification for regression. Data Mining Knowl. Discov. 28, 2 (2014), 475--518. DOI:https://doi.org/10.1007/s10618-013-0308-zGoogle ScholarDigital Library
- Artem Bequé, Kristof Coussement, Ross W. Gayler, and Stefan Lessmann. 2017. Approaches for credit scorecard calibration: An empirical analysis. Knowl.-based Syst. 134 (2017), 213--227. DOI:https://doi.org/10.1016/j.knosys.2017.07.034Google Scholar
- Glenn W. Brier. 1950. Verification of forecasts expressed in terms of probability. Month. Weath. Rev. 78, 1 (1950), 1--3. DOI:https://doi.org/10.1175/1520-0493(1950)078<0001:vofeit>2.0.co;2Google ScholarCross Ref
- Gordon V. Cormack. 2008. Email spam filtering: A systematic review. Found. Trends Inf. Retr. 1, 4 (2008), 335--455. DOI:https://doi.org/10.1561/9781601981479Google ScholarDigital Library
- Kristof Coussement and Wouter Buckinx. 2011. A probability-mapping algorithm for calibrating the posterior probabilities: A direct marketing application. Eur. J. Op. Res. 214, 3 (2011), 732--738. DOI:https://doi.org/10.1016/j.ejor.2011.05.027Google ScholarCross Ref
- Morris H. DeGroot and Stephen E. Fienberg. 1983. The comparison and evaluation of forecasters. The Statistician 32, 1/2 (1983), 12--22. DOI:https://doi.org/10.2307/2987588Google ScholarCross Ref
- Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. B 39, 1 (1977), 1--38.Google ScholarCross Ref
- Pedro M. Domingos and Michael J. Pazzani. 1996. Beyond independence: Conditions for the optimality of the simple Bayesian classifier. In Proceedings of the 13th International Conference on Machine Learning (ICML’96). 105--112.Google Scholar
- Andrea Esuli, Alejandro Moreo, and Fabrizio Sebastiani. 2018. A recurrent neural network for sentiment quantification. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM’18). 1775--1778. DOI:https://doi.org/10.1145/3269206.3269287Google ScholarDigital Library
- Andrea Esuli, Alejandro Moreo, and Fabrizio Sebastiani. 2020. Cross-lingual sentiment quantification. IEEE Intell. Syst. 35, 3 (2020), 106--114. DOI:https://doi.org/10.1109/MIS.2020.2979203Google ScholarDigital Library
- Tom Fawcett. 2006. An introduction to ROC analysis. Pattern Recog. Lett. 27 (2006), 861--874.Google ScholarDigital Library
- Tom Fawcett and Peter Flach. 2005. A response to Webb and Ting’s On the application of ROC analysis to predict classification performance under varying class distributions.’’ Mach. Learn. 58, 1 (2005), 33--38. DOI:https://doi.org/10.1007/s10994-005-5256-4Google ScholarDigital Library
- Afonso Fernandes Vaz, Rafael Izbicki, and Rafael Bassi Stern. 2019. Quantification under prior probability shift: The ratio estimator and its extensions. J. Mach. Learn. Res. 20 (2019), 79:1--79:33.Google Scholar
- Peter A. Flach. 2017. Classifier calibration. In Encyclopedia of Machine Learning (2nd ed.), Claude Sammut and Geoffrey I. Webb (Eds.). Springer, DE, 212--219.Google Scholar
- George Forman. 2008. Quantifying counts and costs via classification. Data Mining Knowl. Discov. 17, 2 (2008), 164--206. DOI:https://doi.org/10.1007/s10618-008-0097-yGoogle ScholarDigital Library
- Wei Gao and Fabrizio Sebastiani. 2016. From classification to quantification in tweet sentiment analysis. Soc. Netw. Anal. Mining 6, 19 (2016), 1--22. DOI:https://doi.org/10.1007/s13278-016-0327-zGoogle ScholarCross Ref
- Tilmann Gneiting and Adrian E. Raftery. 2007. Strictly proper scoring rules, prediction, and estimation. J. Amer. Statist. Assoc. 102, 477 (2007), 359--378. DOI:https://doi.org/10.1198/016214506000001437Google ScholarCross Ref
- Pablo González, Alberto Castaño, Nitesh V. Chawla, and Juan José del Coz. 2017. A review on quantification learning. Comput. Surveys 50, 5 (2017), 74:1--74:40. DOI:https://doi.org/10.1145/3117807Google Scholar
- Moshe Koppel, Jonathan Schler, and Shlomo Argamon. 2009. Computational methods in authorship attribution. J. Amer. Soc. Inf. Sci. Technol. 60, 1 (2009), 9--26. DOI:https://doi.org/10.1002/asi.20961Google ScholarDigital Library
- David D. Lewis and William A. Gale. 1994. A sequential algorithm for training text classifiers. In Proceedings of the 17th ACM International Conference on Research and Development in Information Retrieval (SIGIR’94). 3--12. DOI:https://doi.org/10.1007/978-1-4471-2099-5_1Google Scholar
- Alessio Molinari. 2019. Leveraging the transductive nature of e-discovery in cost-sensitive technology-assisted review. In Proceedings of the 8th BCS-IRSG Symposium on Future Directions in Information Access (FDIA’19). 72--78.Google Scholar
- Alessio Molinari. 2019. Risk Minimization Models for Technology-assisted Review and Their Application to e-discovery. Master’s thesis. Department of Computer Science, University of Pisa, Pisa, IT.Google Scholar
- Jose G. Moreno-Torres, Troy Raeder, Rocío Alaíz-Rodríguez, Nitesh V. Chawla, and Francisco Herrera. 2012. A unifying view on dataset shift in classification. Pattern Recog. 45, 1 (2012), 521--530. DOI:https://doi.org/10.1016/j.patcog.2011.06.019Google ScholarDigital Library
- Alejandro Moreo and Fabrizio Sebastiani. 2020. Tweet sentiment quantification: An experimental re-evaluation. Submitted for publication. https://arxiv.org/abs/2011.08091.Google Scholar
- Allan H. Murphy. 1973. A new vector partition of the probability score. J. Appl. Meteorol. 12, 4 (1973), 595--600.Google ScholarCross Ref
- Mahdi P. Naeini, Gregory F. Cooper, and Milos Hauskrecht. 2015. Obtaining well-calibrated probabilities using Bayesian binning. In Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI’15). 2901--2907.Google Scholar
- Alexandru Niculescu-Mizil and Rich Caruana. 2005. Obtaining calibrated probabilities from boosting. In Proceedings of the 21st Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI’05). 413--420.Google Scholar
- Alexandru Niculescu-Mizil and Rich Caruana. 2005. Predicting good probabilities with supervised learning. In Proceedings of the 22nd International Conference on Machine Learning (ICML’05). 625--632. DOI:https://doi.org/10.1145/1102351.1102430Google ScholarDigital Library
- Douglas W. Oard, Fabrizio Sebastiani, and Jyothi K. Vinjumur. 2018. Jointly minimizing the expected costs of review for responsiveness and privilege in e-discovery. ACM Trans. Inf. Syst. 37, 1 (2018), 11:1--11:35 pages. DOI:https://doi.org/10.1145/3268928Google Scholar
- John C. Platt. 2000. Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In Advances in Large Margin Classifiers, Alexander Smola, Peter Bartlett, Bernard Schölkopf, and Dale Schuurmans (Eds.). The MIT Press, Cambridge, MA, 61--74.Google Scholar
- Pablo Pérez-Gállego, Alberto Castaño, José Ramón Quevedo, and Juan José del Coz. 2019. Dynamic ensemble selection for quantification tasks. Inf. Fusion 45 (2019), 1--15. DOI:https://doi.org/10.1016/j.inffus.2018.01.001Google ScholarCross Ref
- Joaquin Quiñonero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D. Lawrence (Eds.). 2009. Dataset Shift in Machine Learning. The MIT Press, Cambridge, MA. DOI:https://doi.org/10.7551/mitpress/9780262170055.001.0001Google Scholar
- Marco Saerens, Patrice Latinne, and Christine Decaestecker. 2002. Adjusting the outputs of a classifier to new a priori probabilities: A simple procedure. Neur. Comput. 14, 1 (2002), 21--41. DOI:https://doi.org/10.1162/089976602753284446Google ScholarDigital Library
- Fabrizio Sebastiani. 2020. Evaluation measures for quantification: An axiomatic approach. Inf. Retr. J. 23, 3 (2020), 255--288. DOI:https://doi.org/10.1007/s10791-019-09363-yGoogle ScholarDigital Library
- David Spence, Christopher Inskip, Novi Quadrianto, and David Weir. 2019. Quantification under class-conditional dataset shift. In Proceedings of the 11th International Conference on Advances in Social Networks Analysis and Mining (ASONAM’19). 528--529. DOI:https://doi.org/10.1145/3341161.3342948Google ScholarDigital Library
- D. B. Stephenson, C. A. S. Coelho, and I. T. Jolliffe. 2008. Two extra components in the Brier score decomposition. Weath. Forecast. 23, 4 (2008), 752--757. DOI:https://doi.org/10.1175/2007WAF2006116.1Google ScholarCross Ref
- Meesun Sun and Sungzoon Cho. 2018. Obtaining calibrated probability using ROC binning. Pattern Anal. Applic. 21, 2 (2018), 307--322. DOI:https://doi.org/10.1007/s10044-016-0578-3Google ScholarDigital Library
- Vladimir Vapnik. 1998. Statistical Learning Theory. Wiley, New York, NY.Google ScholarCross Ref
- Ting-Fan Wu, Chih-Jen Lin, and Ruby C. Weng. 2004. Probability estimates for multi-class classification by pairwise coupling. J. Mach. Learn. Res. 5 (2004), 975--1005.Google ScholarDigital Library
- Bianca Zadrozny and Charles Elkan. 2002. Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the 8th ACM International Conference on Knowledge Discovery and Data Mining (KDD’02). 694--699. DOI:https://doi.org/10.1145/775107.775151Google ScholarDigital Library
Index Terms
- A Critical Reassessment of the Saerens-Latinne-Decaestecker Algorithm for Posterior Probability Adjustment
Recommendations
A probabilistic methodology for multilabel classification
Multilabel classification is a relatively recent subfield of machine learning. Unlike to the classical approach, where instances are labeled with only one category, in multilabel classification, an arbitrary number of categories is chosen to label an ...
Chinese text classification by the Naïve Bayes Classifier and the associative classifier with multiple confidence threshold values
Each type of classifier has its own advantages as well as certain shortcomings. In this paper, we take the advantages of the associative classifier and the Naive Bayes Classifier to make up the shortcomings of each other, thus improving the accuracy of ...
Biquality learning: a framework to design algorithms dealing with closed-set distribution shifts
AbstractTraining machine learning models from data with weak supervision and dataset shifts is still challenging. Designing algorithms when these two situations arise has not been explored much, and existing algorithms cannot always handle the most ...
Comments