Top

Published in:

2020 | OriginalPaper | Chapter

Different Strategies of Fitting Logistic Regression for Positive and Unlabelled Data

Authors : Paweł Teisseyre, Jan Mielniczuk, Małgorzata Łazęcka

Published in: Computational Science – ICCS 2020

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

In the paper we revisit the problem of fitting logistic regression to positive and unlabelled data. There are two key contributions. First, a new light is shed on the properties of frequently used naive method (in which unlabelled examples are treated as negative). In particular we show that naive method is related to incorrect specification of the logistic model and consequently the parameters in naive method are shrunk towards zero. An interesting relationship between shrinkage parameter and label frequency is established. Second, we introduce a novel method of fitting logistic model based on simultaneous estimation of vector of coefficients and label frequency. Importantly, the proposed method does not require prior estimation, which is a major obstacle in positive unlabelled learning. The method is superior in predicting posterior probability to both naive method and weighted likelihood method for several benchmark data sets. Moreover, it yields consistently better estimator of label frequency than other two known methods. We also introduce simple but powerful representation of positive and unlabelled data under Selected Completely at Random assumption which yields straightforwardly most properties of such model.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

next chapter Branch-and-Bound Search for Training Cascades of Classifiers

https://github.com/teisseyrep/PUlogistic.

https://archive.ics.uci.edu/ml/datasets.php.

Bekker, J., Davis, J.: Learning from positive and unlabeled data: a survey (2018)

Sechidis, K., Sperrin, M., Petherick, E.S., Lująn, M., Brown, G.: Dealing with under-reported variables: an information theoretic solution. Int. J. Approx. Reason. 85, 159–177 (2017)MathSciNetCrossRef

Onur, I., Velamuri, M.: The gap between self-reported and objective measures of disease status in India. PLOS ONE 13(8), 1–18 (2018)CrossRef

Liu, B., Dai, Y., Li, X., Lee, W.S., Yu, P.S.: Building text classifiers using positive and unlabeled examples. In: Proceedings of the Third IEEE International Conference on Data Mining, ICDM 2003, p. 179 (2003)

Fung, G.P.C., Yu, J.X., Lu, H., Yu, P.S.: Text classification without negative examples revisit. IEEE Trans. Knowl. Data Eng. 18(1), 6–20 (2006)CrossRef

Li, X., Liu, B.: Learning to classify texts using positive and unlabeled data. In: Proceedings of the 18th International Joint Conference on Artificial Intelligence, pp. 587–592 (2003)

Mordelet, F., Vert, J.-P.: ProDiGe: prioritization of disease genes with multitask machine learning from positive and unlabeled examples. BMC Bioinformatics 12(1), 389 (2011)CrossRef

Cerulo, L., Elkan, C., Ceccarelli, M.: Learning gene regulatory networks from only positive and unlabeled data. BMC Bioinformatics 11, 228 (2010)CrossRef

Elkan, C., Noto, K.: Learning classifiers from only positive and unlabeled data. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2008, pp. 213–220 (2008)

10.

du Plessis, M.C., Niu, G., Sugiyama, M.: Class-prior estimation for learning from positive and unlabeled data. Mach. Learn. 106(4), 463–492 (2016). https://doi.org/10.1007/s10994-016-5604-6MathSciNetCrossRefMATH

11.

Bekker, J., Davis, J.: Estimating the class prior in positive and unlabeled data through decision tree induction. In: Proceedings of the 32th AAAI Conference on Artificial Intelligence, February 2018

12.

Steinberg, D., Cardell, N.S.: Estimating logistic regression models when the dependent variable has no variance. Commun. Stat. Theory Methods 21(2), 423–450 (1992)CrossRef

13.

Lancaster, T., Imbens, G.: Case-control studies with contaminated controls. J. Econom. 71(1), 145–160 (1996)MathSciNetCrossRef

14.

Kiryo, R., Niu, G., du Plessis, M.C., Sugiyama, M.: Positive-unlabeled learning with non-negative risk estimator. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 2017, pp. 1674–1684 (2017)

15.

Denis, F., Gilleron, R., Letouzey, F.: Learning from positive and unlabeled examples. Theoret. Comput. Sci. 348(1), 70–83 (2005)MathSciNetCrossRef

16.

Chapelle, O., Schölkopf, B., Zien, A.: Semi-Supervised Learning. The MIT Press, Cambridge (2010)

17.

Candès, E., Fan, Y., Janson, L., Lv, J.: Panning for gold: model-x knockoffs for high-dimensional controlled variable selection. Manuscript (2018)

18.

Gottschalk, P.G., Dunn, J.R.: The five-parameter logistic: a characterization and comparison with the four-parameter logistic. Anal. Biochem. 343(1), 54–65 (2005)CrossRef

19.

Mielniczuk, J., Teisseyre, P.: What do we choose when we err? Model selection and testing for misspecified logistic regression revisited. In: Matwin, S., Mielniczuk, J. (eds.) Challenges in Computational Statistics and Data Mining. SCI, vol. 605, pp. 271–296. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-18781-5_15CrossRef

20.

Kubkowski, M., Mielniczuk, J.: Active set of predictors for misspecified logistic regression. Statistics 51, 1023–1045 (2017)MathSciNetCrossRef

21.

Sechidis, K., Brown, G.: Simple strategies for semi-supervised feature selection. Mach. Learn. 107(2), 357–395 (2017). https://doi.org/10.1007/s10994-017-5648-2MathSciNetCrossRefMATH

Title: Different Strategies of Fitting Logistic Regression for Positive and Unlabelled Data
Authors: Paweł Teisseyre
Jan Mielniczuk
Małgorzata Łazęcka
Publisher: Springer International Publishing
Book: Computational Science – ICCS 2020
Print ISBN: 978-3-030-50422-9

Electronic ISBN: 978-3-030-50423-6

Copyright Year: 2020
DOI: https://doi.org/10.1007/978-3-030-50423-6_1

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner