research-article

Training Data Optimization for Pairwise Learning to Rank

Authors:
Hojae Han

Yonsei University, Seoul, Republic of Korea

Yonsei University, Seoul, Republic of Korea
View Profile

,
Seung-won Hwang

Yonsei University, Seoul, Republic of Korea

Yonsei University, Seoul, Republic of Korea
View Profile

,
Young-In Song

NAVER, Seongnam, Republic of Korea

NAVER, Seongnam, Republic of Korea
View Profile

,
Siyeon Kim

NAVER, Seongnam, Republic of Korea

NAVER, Seongnam, Republic of Korea
View Profile

ICTIR '20: Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information RetrievalSeptember 2020Pages 13–20https://doi.org/10.1145/3409256.3409824

Published:14 September 2020Publication History

ICTIR '20: Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval

Pages 13–20

ABSTRACT

This paper studies data optimization for Learning to Rank (LtR), by dropping training labels to increase ranking accuracy. Our work is inspired by data dropout, showing some training data do not positively influence learning and are better dropped out, despite a common belief that a larger training dataset is beneficial. Our main contribution is to extend this intuition for noisy- and semi- supervised LtR scenarios: some human annotations can be noisy or out-of-date, and so are machine-generated pseudo-labels in semi- supervised scenarios. Dropping out such unreliable labels would contribute to both scenarios. State-of-the-arts propose Influence Function (IF) for estimating how each training instance affects learn- ing, and we identify and overcome two challenges specific to LtR. 1) Non-convex ranking functions violate the assumptions required for the robustness of IF estimation. 2) The pairwise learning of LtR incurs quadratic estimation overhead. Our technical contributions are addressing these challenges: First, we revise estimation and data optimization to accommodate reduced reliability; Second, we devise a group-wise estimation, reducing cost yet keeping accuracy high. We validate the effectiveness of our approach in a wide range of ad-hoc information retrieval benchmarks and real-life search engine datasets in both noisy- and semi-supervised scenarios.

Supplemental Material

3409256.3409824.mp4

mp4

96.8 MB

Download

References

Qingyao Ai, Keping Bi, Cheng Luo, Jiafeng Guo, and W Bruce Croft. 2018. Unbiased learning to rank with unbiased propensity estimation. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 385--394.Google ScholarDigital Library
Samyadeep Basu, Xuchen You, and Soheil Feizi. 2019. Second-Order Group Influence Functions for Black-Box Predictions. arXiv preprint arXiv:1911.00418 (2019).Google Scholar
Sebastian Bruch. 2019. An Alternative Cross Entropy Loss for Learning-to-Rank. arXiv preprint arXiv:1911.09798 (2019).Google Scholar
Christopher Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Gregory N Hullender. 2005. Learning to rank using gradient descent. In Proceedings of the 22nd International Conference on Machine learning (ICML-05). 89--96.Google ScholarDigital Library
Christopher JC Burges. 2010. From ranknet to lambdarank to lambdamart: An overview. Learning 11, 23--581 (2010), 81.Google Scholar
Olivier Chapelle and Ya Zhang. 2009. A dynamic bayesian network click model for web search ranking. In Proceedings of the 18th international conference on World wide web. 1--10.Google ScholarDigital Library
Charles L Clarke, Nick Craswell, and Ian Soboroff. 2009. Overview of the trec 2009 web track. Technical Report. WATERLOO UNIV (ONTARIO).Google Scholar
Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. 2015. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289 (2015).Google Scholar
Faïza Dammak, Hager Kammoun, and Abdelmajid Ben Hamadou. 2017. Improving pairwise learning to rank algorithms for document retrieval. In 2017 IEEE Symposium Series on Computational Intelligence (SSCI). IEEE, 1--8.Google ScholarCross Ref
Dany Haddad and Joydeep Ghosh. 2019. Learning More From Less: Towards Strengthening Weak Supervision for Ad-Hoc Retrieval. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 857--860.Google ScholarDigital Library
Thorsten Joachims. 2002. Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 133--142.Google ScholarDigital Library
Pang Wei Koh and Percy Liang. 2017. Understanding black-box predictions via influence functions. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 1885--1894.Google ScholarDigital Library
Pang Wei W Koh, Kai-Siang Ang, Hubert Teo, and Percy S Liang. 2019. On the accuracy of influence functions for measuring group effects. In Advances in Neural Information Processing Systems. 5254--5264.Google Scholar
Zhigao Miao, Juan Wang, Aimin Zhou, and Ke Tang. 2015. Regularized boost for semi-supervised ranking. In Proceedings of the 18th Asia Pacific Symposium on Intelligent and Evolutionary Systems, Volume 1. Springer, 643--651.Google ScholarCross Ref
Dae Hoon Park and Yi Chang. 2019. Adversarial Sampling and Training for Semi-Supervised Information Retrieval. In TheWorld WideWeb Conference. ACM, 1443--1453.Google Scholar
Tao Qin and Tie-Yan Liu. 2013. Introducing LETOR 4.0 Datasets. CoRRabs/1306.2597 (2013). http://arxiv.org/abs/1306.2597Google Scholar
Tao Qin and Tie-Yan Liu. 2013. Introducing LETOR 4.0 datasets. arXiv preprint arXiv:1306.2597 (2013).Google Scholar
Jun Wang, Lantao Yu, Weinan Zhang, Yu Gong, Yinghui Xu, Benyou Wang, Peng Zhang, and Dell Zhang. 2017. Irgan: A minimax game for unifying generative and discriminative information retrieval models. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval. ACM, 515--524.Google ScholarDigital Library
Tianyang Wang, Jun Huan, and Bo Li. 2018. Data dropout: Optimizing training data for convolutional neural networks. In 2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI). IEEE, 39--46.Google ScholarCross Ref
XuanhuiWang, Cheng Li, Nadav Golbandi, Michael Bendersky, and Marc Najork. 2018. The lambdaloss framework for ranking metric optimization. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. 1313--1322.Google Scholar
Zifeng Wang, Hong Zhu, Zhenhua Dong, Xiuqiang He, and Shao-Lun Huang. 2019. Less Is Better: Unweighted Data Subsampling via Influence Function. arXiv preprint arXiv:1912.01321 (2019).Google Scholar
Jingfang Xu, Chuanliang Chen, Gu Xu, Hang Li, and Elbio Renato Torres Abib. 2010. Improving quality of training data for learning to rank using click-through data. In Proceedings of the third ACM international conference on Web search and data mining. 171--180.Google ScholarDigital Library
Minjie Xu and Gary Kazantsev. 2019. Understanding Goal-Oriented Active Learning via Influence Functions. arXiv preprint arXiv:1905.13183 (2019).Google Scholar
Shipeng Yu, Deng Cai, Ji-Rong Wen, and Wei-Ying Ma. 2003. Improving pseudorelevance feedback in web information retrieval using web page segmentation. In Proceedings of the 12th international conference on World Wide Web. ACM, 11--18.Google ScholarDigital Library

Index Terms

Training Data Optimization for Pairwise Learning to Rank
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
      1. Learning to rank

Recommendations

Training query filtering for semi-supervised learning to rank with pseudo labels

Semi-supervised learning is a machine learning paradigm that can be applied to create pseudo labels from unlabeled data for learning a ranking model, when there is only limited or no training examples available. However, the effectiveness of semi-...
Read More
Inductive Semi-supervised Multi-Label Learning with Co-Training
KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

In multi-label learning, each training example is associated with multiple class labels and the task is to learn a mapping from the feature space to the power set of label space. It is generally demanding and time-consuming to obtain labels for training ...
Read More
Learning to Rank from Noisy Data

Learning to rank, which learns the ranking function from training data, has become an emerging research area in information retrieval and machine learning. Most existing work on learning to rank assumes that the training data is clean, which is not ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICTIR '20: Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval
September 2020
207 pages
ISBN:9781450380676
DOI:10.1145/3409256
General Chairs:
Krisztian Balog
University of Stavanger, Norway
,
Vinay Setty
University of Stavanger, Norway
,
Program Chairs:
Christina Lioma
University of Copenhagen, Denmark
,
Yiqun Liu
Tsinghua University, China
,
Min Zhang
Tsinghua University, China
,
Klaus Berberich
HTW Saar & MPI for Informatics, Germany
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 September 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
influence functions
learning to rank
noisy data
semi-supervised learning
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate209of482submissions,43%
Upcoming Conference
ICTIR '24

Sponsor:

sigir

The 2024 ACM SIGIR International Conference on the Theory of Information Retrieval

July 13, 2024

Washington DC , DC , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 337
  Total Downloads
- Downloads (Last 12 months)36
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Training Data Optimization for Pairwise Learning to Rank

ICTIR '20: Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Training query filtering for semi-supervised learning to rank with pseudo labels

Inductive Semi-supervised Multi-Label Learning with Co-Training

Learning to Rank from Noisy Data