ABSTRACT
This paper presents a boosting based algorithm for learning a bipartite ranking function (BRF) with partially labeled data. Until now different attempts had been made to build a BRF in a transductive setting, in which the test points are given to the methods in advance as unlabeled data. The proposed approach is a semi-supervised inductive ranking algorithm which, as opposed to transductive algorithms, is able to infer an ordering on new examples that were not used for its training. We evaluate our approach using the TREC-9 Ohsumed and the Reuters-21578 data collections, comparing against two semi-supervised classification algorithms for ROCArea (AUC), uninterpolated average precision (AUP), mean precision@50 (TP) and Precision-Recall (PR) curves. In the most interesting cases where there are an unbalanced number of irrelevant examples over relevant ones, we show our method to produce statistically significant improvements with respect to these ranking measures.
- S. Agarwal. Ranking on Graph Data. In Proceedings of the 23rd International Conference on Machine Learning, pages 25--32, 2006. Google ScholarDigital Library
- M.-R. Amini and P. Gallinari. The Use of Unlabeled Data to Improve Supervised Learning for Text Summarization. In Proceedings of the 25th annual international ACM SIGIR, pages 105--112, 2002. Google ScholarDigital Library
- M.-R. Amini and P. Gallinari. Semi-Supervised Learning with Explicit Misclassification Modeling. In Proceedings of the 18th International Joint Conference on Artificial Intelligence, pages 555--560, 2003. Google ScholarDigital Library
- P. L. Bartlett, M. I. Jordan and Jon D. McAuliffe. Large Margin Classifiers: convex Loss, Low Noise and Convergence Rates. In Advances in Neural Information Processing Systems 16, pages 1173--1180, 2004.Google Scholar
- A. Blum and T. Mitchell. Combining labeled and unlabeled data with Co-training. In Proceedings of the Workshop on Computational Learning Theory, pages 92--100, Wisconsin, USA, 1998. Google ScholarDigital Library
- A.P. Bradley. The use of the Area under the ROC curve in the Evaluation of Machine Learning Algorithms. Pattern Recognition, 30:1145--1159, 1997. Google ScholarDigital Library
- R. Caruana and A. Niculescu-Mizil. Data mining in metric space: an empirical analysis of supervised learning performance criteria. In Proceedings of the 10th ACM SIGKDD, pages 69--78, 2004. Google ScholarDigital Library
- O. Chapelle, B. Schölkopf and A. Zien. Semi-Supervised Learning, MIT Press, Cambridge, MA, 2006 Google ScholarDigital Library
- C. Cortes and M. Mohri. AUC optimization vs. error rate minimization. In Advances in Neural Information Processing Systems 16, pages 313--320, 2004.Google Scholar
- Y. Freund, R. Iyer, R.E. Schapire and Y. Singer. An efficient Boosting Algorithm for Combining Preferences. Journal of Machine Learning Research, 4:933--969, 2004. Google ScholarDigital Library
- E. Gaussier and C. Goutte. Learning with Partially Labelled Data - with Confidence. In ICML'05 Workshop on Learning from Partially Classified Training Data (ICML'05-LPCT), pages 29--36, 2005.Google Scholar
- W. Hersh, C. Buckley, T. J. Leone and David Hickam. OHSUMED: an interactive Retrieval Evaluation and new Large Test Collection for Research. In Proceedings of the 17th annual international ACM SIGIR, pages 192--201, 1994. Google ScholarDigital Library
- T. Joachims. Transductive Inference for Text Classification using Support Vector Machines. In Proceedings of 16th International Conference on Machine Learning, pages 200--209, 1999. Google ScholarDigital Library
- E.L. Lehmann. Nonparametric Statistical Methods Based on Ranks. McGraw-Hill, New York, 1975.Google Scholar
- D. D. Lewis. Reuters-21578, distribution 1.0 http://www.daviddlewis.com/resources/testcollections/reuters21578/. January 1997.Google Scholar
- K. Nigam, A.K. McCallum, S. Thrun and Tom Mitchell. Text classification from labeled and unlabeled documents using EM. Machine Learning 39(2/3):127--163, 2000. Google ScholarDigital Library
- S. Robertson and D.A. Hull. The TREC-9 Filtering Track Final Report. In Proceedings of the 9th Text REtrieval Conference (TREC-9). pages 25--40, 2001.Google Scholar
- I. Soboroff and S. Robertson. Building a Filtering Test Collection for TREC 2002. In Proceedings of the 26th annual international ACM SIGIR. pages 243--250, 2003. Google ScholarDigital Library
- C. van Rijsbergen, Information Retrieval, Butterworths, London, 1979. Google ScholarDigital Library
- J.-N. Vittaut, M.-R. Amini and P. Gallinari, Learning Classification with Both Labeled and Unlabeled Data. In Proceedings of the 13th European Conference on Machine Learning. pages 468--476, 2002. Google ScholarDigital Library
- C.C. Vogt and G.W. Cottrell, Fusion Via a Linear Combination of Scores. Information Retrieval, 1(3):151--173, 1999. Google ScholarDigital Library
- J. Weston, R. Kuang, C. Leslie and W.S. Noble. Protein ranking by semi-supervised network propagation. BMC Bioinformatics, special issue, 2006.Google Scholar
- D. Zhou, J. Weston, A. Gretton, O. Bousquet and B. Schölkopf. Ranking on Data Manifolds. In Advances in Neural Information Processing Systems 16, pages 169--176, 2004.Google Scholar
- D. Zhou , C.J.C. Burges and T. Tao. Transductive link spam detection. In Proceedings of the 3rd international workshop on Adversarial information retrieval on the web. pages 21--28, 2007. Google ScholarDigital Library
Index Terms
- A boosting algorithm for learning bipartite ranking functions with partially labeled data
Recommendations
Learning to rank with partially-labeled data
SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrievalRanking algorithms, whose goal is to appropriately order a set of objects/documents, are an important component of information retrieval systems. Previous work on ranking algorithms has focused on cases where only labeled data is available for training (...
Boosting for multiclass semi-supervised learning
We present an algorithm for multiclass semi-supervised learning, which is learning from a limited amount of labeled data and plenty of unlabeled data. Existing semi-supervised learning algorithms use approaches such as one-versus-all to convert the ...
Non-linear dictionary learning with partially labeled data
While recent techniques for discriminative dictionary learning have demonstrated tremendous success in image analysis applications, their performance is often limited by the amount of labeled data available for training. Even though labeling images is ...
Comments