Skip to main content
Top

Hint

Swipe to navigate through the chapters of this book

Published in:
Cover of the book

2020 | OriginalPaper | Chapter

Different Strategies of Fitting Logistic Regression for Positive and Unlabelled Data

Authors : Paweł Teisseyre, Jan Mielniczuk, Małgorzata Łazęcka

Published in: Computational Science – ICCS 2020

Publisher: Springer International Publishing

share
SHARE

Abstract

In the paper we revisit the problem of fitting logistic regression to positive and unlabelled data. There are two key contributions. First, a new light is shed on the properties of frequently used naive method (in which unlabelled examples are treated as negative). In particular we show that naive method is related to incorrect specification of the logistic model and consequently the parameters in naive method are shrunk towards zero. An interesting relationship between shrinkage parameter and label frequency is established. Second, we introduce a novel method of fitting logistic model based on simultaneous estimation of vector of coefficients and label frequency. Importantly, the proposed method does not require prior estimation, which is a major obstacle in positive unlabelled learning. The method is superior in predicting posterior probability to both naive method and weighted likelihood method for several benchmark data sets. Moreover, it yields consistently better estimator of label frequency than other two known methods. We also introduce simple but powerful representation of positive and unlabelled data under Selected Completely at Random assumption which yields straightforwardly most properties of such model.
Literature
1.
go back to reference Bekker, J., Davis, J.: Learning from positive and unlabeled data: a survey (2018) Bekker, J., Davis, J.: Learning from positive and unlabeled data: a survey (2018)
2.
go back to reference Sechidis, K., Sperrin, M., Petherick, E.S., Lująn, M., Brown, G.: Dealing with under-reported variables: an information theoretic solution. Int. J. Approx. Reason. 85, 159–177 (2017) MathSciNetCrossRef Sechidis, K., Sperrin, M., Petherick, E.S., Lująn, M., Brown, G.: Dealing with under-reported variables: an information theoretic solution. Int. J. Approx. Reason. 85, 159–177 (2017) MathSciNetCrossRef
3.
go back to reference Onur, I., Velamuri, M.: The gap between self-reported and objective measures of disease status in India. PLOS ONE 13(8), 1–18 (2018) CrossRef Onur, I., Velamuri, M.: The gap between self-reported and objective measures of disease status in India. PLOS ONE 13(8), 1–18 (2018) CrossRef
4.
go back to reference Liu, B., Dai, Y., Li, X., Lee, W.S., Yu, P.S.: Building text classifiers using positive and unlabeled examples. In: Proceedings of the Third IEEE International Conference on Data Mining, ICDM 2003, p. 179 (2003) Liu, B., Dai, Y., Li, X., Lee, W.S., Yu, P.S.: Building text classifiers using positive and unlabeled examples. In: Proceedings of the Third IEEE International Conference on Data Mining, ICDM 2003, p. 179 (2003)
5.
go back to reference Fung, G.P.C., Yu, J.X., Lu, H., Yu, P.S.: Text classification without negative examples revisit. IEEE Trans. Knowl. Data Eng. 18(1), 6–20 (2006) CrossRef Fung, G.P.C., Yu, J.X., Lu, H., Yu, P.S.: Text classification without negative examples revisit. IEEE Trans. Knowl. Data Eng. 18(1), 6–20 (2006) CrossRef
6.
go back to reference Li, X., Liu, B.: Learning to classify texts using positive and unlabeled data. In: Proceedings of the 18th International Joint Conference on Artificial Intelligence, pp. 587–592 (2003) Li, X., Liu, B.: Learning to classify texts using positive and unlabeled data. In: Proceedings of the 18th International Joint Conference on Artificial Intelligence, pp. 587–592 (2003)
7.
go back to reference Mordelet, F., Vert, J.-P.: ProDiGe: prioritization of disease genes with multitask machine learning from positive and unlabeled examples. BMC Bioinformatics 12(1), 389 (2011) CrossRef Mordelet, F., Vert, J.-P.: ProDiGe: prioritization of disease genes with multitask machine learning from positive and unlabeled examples. BMC Bioinformatics 12(1), 389 (2011) CrossRef
8.
go back to reference Cerulo, L., Elkan, C., Ceccarelli, M.: Learning gene regulatory networks from only positive and unlabeled data. BMC Bioinformatics 11, 228 (2010) CrossRef Cerulo, L., Elkan, C., Ceccarelli, M.: Learning gene regulatory networks from only positive and unlabeled data. BMC Bioinformatics 11, 228 (2010) CrossRef
9.
go back to reference Elkan, C., Noto, K.: Learning classifiers from only positive and unlabeled data. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2008, pp. 213–220 (2008) Elkan, C., Noto, K.: Learning classifiers from only positive and unlabeled data. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2008, pp. 213–220 (2008)
11.
go back to reference Bekker, J., Davis, J.: Estimating the class prior in positive and unlabeled data through decision tree induction. In: Proceedings of the 32th AAAI Conference on Artificial Intelligence, February 2018 Bekker, J., Davis, J.: Estimating the class prior in positive and unlabeled data through decision tree induction. In: Proceedings of the 32th AAAI Conference on Artificial Intelligence, February 2018
12.
go back to reference Steinberg, D., Cardell, N.S.: Estimating logistic regression models when the dependent variable has no variance. Commun. Stat. Theory Methods 21(2), 423–450 (1992) CrossRef Steinberg, D., Cardell, N.S.: Estimating logistic regression models when the dependent variable has no variance. Commun. Stat. Theory Methods 21(2), 423–450 (1992) CrossRef
13.
14.
go back to reference Kiryo, R., Niu, G., du Plessis, M.C., Sugiyama, M.: Positive-unlabeled learning with non-negative risk estimator. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 2017, pp. 1674–1684 (2017) Kiryo, R., Niu, G., du Plessis, M.C., Sugiyama, M.: Positive-unlabeled learning with non-negative risk estimator. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 2017, pp. 1674–1684 (2017)
15.
go back to reference Denis, F., Gilleron, R., Letouzey, F.: Learning from positive and unlabeled examples. Theoret. Comput. Sci. 348(1), 70–83 (2005) MathSciNetCrossRef Denis, F., Gilleron, R., Letouzey, F.: Learning from positive and unlabeled examples. Theoret. Comput. Sci. 348(1), 70–83 (2005) MathSciNetCrossRef
16.
go back to reference Chapelle, O., Schölkopf, B., Zien, A.: Semi-Supervised Learning. The MIT Press, Cambridge (2010) Chapelle, O., Schölkopf, B., Zien, A.: Semi-Supervised Learning. The MIT Press, Cambridge (2010)
17.
go back to reference Candès, E., Fan, Y., Janson, L., Lv, J.: Panning for gold: model-x knockoffs for high-dimensional controlled variable selection. Manuscript (2018) Candès, E., Fan, Y., Janson, L., Lv, J.: Panning for gold: model-x knockoffs for high-dimensional controlled variable selection. Manuscript (2018)
18.
go back to reference Gottschalk, P.G., Dunn, J.R.: The five-parameter logistic: a characterization and comparison with the four-parameter logistic. Anal. Biochem. 343(1), 54–65 (2005) CrossRef Gottschalk, P.G., Dunn, J.R.: The five-parameter logistic: a characterization and comparison with the four-parameter logistic. Anal. Biochem. 343(1), 54–65 (2005) CrossRef
20.
go back to reference Kubkowski, M., Mielniczuk, J.: Active set of predictors for misspecified logistic regression. Statistics 51, 1023–1045 (2017) MathSciNetCrossRef Kubkowski, M., Mielniczuk, J.: Active set of predictors for misspecified logistic regression. Statistics 51, 1023–1045 (2017) MathSciNetCrossRef
Metadata
Title
Different Strategies of Fitting Logistic Regression for Positive and Unlabelled Data
Authors
Paweł Teisseyre
Jan Mielniczuk
Małgorzata Łazęcka
Copyright Year
2020
DOI
https://doi.org/10.1007/978-3-030-50423-6_1