ABSTRACT
Knowing users' views and demographic traits offers a great potential for personalizing web search results or related services such as query suggestion and query completion. Such signals however are often only available for a small fraction of search users, namely those who log in with their social network account and allow its use for personalization of search results. In this paper, we offer a solution to this problem by showing how user demographic traits such as age and gender, and even political and religious views can be efficiently and accurately inferred based on their search query histories. This is accomplished in two steps; we first train predictive models based on the publically available myPersonality dataset containing users' Facebook Likes and their demographic information. We then match Facebook Likes with search queries using Open Directory Project categories. Finally, we apply the model trained on Facebook Likes to large-scale query logs of a commercial search engine while explicitly taking into account the difference between the traits distribution in both datasets. We find that the accuracy of classifying age and gender, expressed by the area under the ROC curve (AUC), are 77% and 84% respectively for predictions based on Facebook Likes, and only degrade to 74% and 80% when based on search queries. On a US state-by-state basis we find a Pearson correlation of 0.72 for political views between the predicted scores and Gallup data, and 0.54 for affiliation with Judaism between predicted scores and data from the US Religious Landscape Survey. We conclude that it is indeed feasible to infer important demographic data of users from their query history based on labelled Likes data and believe that this approach could provide valuable information for personalization and monetization even in the absence of demographic data.
- A. Arnold, R. Nallapati, and W. W. Cohen. A comparative study of methods for transductive transfer learning. In Proceedings of the Seventh IEEE International Conference on Data Mining Workshops, ICDMW '07, pages 77--82, 2007. Google ScholarDigital Library
- Y. Bachrach, M. Kosinski, T. Graepel, P. Kohli, and D. Stillwell. Personality and patterns of Facebook usage. In Proceedings of the 3rd Annual ACM Web Science Conference, WebSci '12, pages 24--32, Evanston, IL, 2012. ACM. Google ScholarDigital Library
- P. N. Bennett, F. Radlinski, R. W. White, and E. Yilmaz. Inferring and using location metadata to personalize web search. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, SIGIR '11, pages 135--144, Beijing, China, 2011. ACM. Google ScholarDigital Library
- P. N. Bennett, K. Svore, and S. T. Dumais. Classification-enhanced ranking. In Proceedings of the 19th international conference on World wide web, WWW '10, pages 111--120, Raleigh, NC, 2010. ACM. Google ScholarDigital Library
- D. Carmel, N. Zwerdling, I. Guy, S. Ofek-Koifman, N. Har'el, I. Ronen, E. Uziel, S. Yogev, and S. Chernov. Personalized social search based on the user's social network. In Proceedings of the 18th ACM conference on Information and knowledge management, CIKM '09, pages 1227--1236, Hong Kong, China, 2009. ACM. Google ScholarDigital Library
- A. Culotta. Towards detecting influenza epidemics by analyzing Twitter messages. In Proceedings of the First Workshop on Social Media Analytics, SOMA '10, pages 115--122, Washington, DC, 2010. ACM. Google ScholarDigital Library
- W. Dai, G.-R. Xue, Q. Yang, and Y. Yu. Transferring naive Bayes classifiers for text classification. In Proceedings of the 22nd national conference on Artificial intelligence - Volume 1, AAAI'07, pages 540--545, Vancouver, BC, 2007. AAAI Press. Google ScholarDigital Library
- H. Daume, III and D. Marcu. Domain adaptation for statistical classifiers. J. Artif. Int. Res., 26(1):101--126, May 2006. Google ScholarDigital Library
- M. Ettredge, J. Gerdes, and G. Karuga. Using web-based search data to predict macroeconomic statistics. Commun. ACM, 48(11):87--92, Nov. 2005. Google ScholarDigital Library
- W. Fan, I. Davidson, B. Zadrozny, and P. S. Yu. An improved categorization of classifier's sensitivity on sample selection bias. In Proceedings of the Fifth IEEE International Conference on Data Mining, ICDM '05, pages 605--608, Washington, DC, USA, 2005. Google ScholarDigital Library
- J. Ginsberg, M. H. Mohebbi, R. S. Patel, L. Brammer, M. S. Smolinski, and L. Brilliant. Detecting influenza epidemics using search engine query data. Nature, 457(7232):1012--1014, Feb. 2009.Google ScholarCross Ref
- S. Goel, J. M. Hofman, S. Lahaie, D. M. Pennock, and D. J. Watts. Predicting consumer behavior with Web search. Proceedings of the National Academy of Sciences, 107(41):17486--17490, Oct. 2010.Google ScholarCross Ref
- J. Hu, H.-J. Zeng, H. Li, C. Niu, and Z. Chen. Demographic prediction based on user's browsing behavior. In Proceedings of the 16th international conference on World Wide Web, WWW '07, pages 151--160, Banff, AB, 2007. ACM. Google ScholarDigital Library
- B. J. Jansen and L. Solomon. Gender demographic targeting in sponsored search. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI '10, pages 831--840, Atlanta, GA, 2010. Google ScholarDigital Library
- R. Jones, R. Kumar, B. Pang, and A. Tomkins. "I know what you did last summer": query logs and user privacy. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, CIKM '07, pages 909--914, Lisbon, Portugal, 2007. ACM. Google ScholarDigital Library
- E. Kharitonov and P. Serdyukov. Gender-aware re-ranking. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, SIGIR '12, pages 1081--1082, Portland, OR, 2012. ACM. Google ScholarDigital Library
- W. Kong, Y. Liu, S. Ma, and L. Ru. Detecting epidemic tendency by mining search logs. In Proceedings of the 19th international conference on World wide web, WWW '10, pages 1133--1134, Raleigh, NC, 2010. ACM. Google ScholarDigital Library
- M. Kosinski, P. Kohli, D. Stillwell, Y. Bachrach, and T. Graepel. Personality and website choice. In Proceedings of the 3rd Annual ACM Web Science Conference, WebSci '12, Evanston, IL, 2012.Google Scholar
- L. Lorigo, B. Pan, H. Hembrooke, T. Joachims, L. Granka, and G. Gay. The influence of task and gender on search and evaluation behavior using google. Inf. Process. Manage., 42(4):1123--1131, July 2006. Google ScholarDigital Library
- J. Otterbacher. Inferring gender of movie reviewers: exploiting writing style, content and metadata. In Proceedings of the 19th ACM international conference on Information and knowledge management, CIKM '10, pages 369--378, Toronto, ON, 2010. ACM. Google ScholarDigital Library
- M. Pennacchiotti and A.-M. Popescu. Democrats, Republicans and Starbucks afficionados: user classification in Twitter. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '11, pages 430--438, San Diego, CA, 2011. ACM. Google ScholarDigital Library
- D. Quercia, M. Kosinski, D. Stillwell, and J. Crowcroft. Our Twitter profiles, our selves: Predicting personality with Twitter. In PASSAT/SocialCom 2011, pages 180--185, Boston, MA, 2011. IEEE.Google ScholarCross Ref
- S. Torres and I. Weber. What and how children search on the web. In Proceedings of the 20th ACM international conference on Information and knowledge management, CIKM '11, pages 393--402, Glasgow, UK, 2011. ACM. Google ScholarDigital Library
- I. Weber and C. Castillo. The demographics of web search. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, SIGIR '10, pages 523--530, Geneva, Switzerland, 2010. ACM. Google ScholarDigital Library
- I. Weber, V. R. K. Garimella, and E. Borra. Mining web query logs to analyze political issues. In Proceedings of the 3rd Annual ACM Web Science Conference, WebSci '12, pages 330--334, Evanston, IL, 2012. ACM. Google ScholarDigital Library
- I. Weber, V. R. K. Garimella, and E. Borra. Political search trends. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, SIGIR '12, pages 1012--1012, Portland, OR, 2012. ACM. Google ScholarDigital Library
- I. Weber and A. Jaimes. Demographic information flows. In Proceedings of the 19th ACM international conference on Information and knowledge management, CIKM '10, pages 1521--1524, Toronto, ON, 2010. ACM. Google ScholarDigital Library
- I. Weber and A. Jaimes. Who uses web search for what: and how. In Proceedings of the fourth ACM international conference on Web search and data mining, WSDM '11, pages 15--24, Hong Kong, China, 2011. ACM. Google ScholarDigital Library
- em Proceedings of the 22nd national conference on Artificial J. J.-C. Ying, Y.-J. Chang, C.-M. Huang, and V. S. Tseng. Demographic prediction based on users mobile behaviors. In Mobile Data Challenge 2012 (by Nokia) Workshop, Newcastle, UK., 2012.Google Scholar
- B. Zadrozny. Learning and evaluating classifiers under sample selection bias. In Proceedings of the twenty-first international conference on Machine learning, ICML '04, pages 114--, Banff, AB, 2004. ACM. Google ScholarDigital Library
Index Terms
- Inferring the demographics of search users: social data meets search queries
Recommendations
Search personalization through query and page topical analysis
Thousands of users issue keyword queries to the Web search engines to find information on a number of topics. Since the users may have diverse backgrounds and may have different expectations for a given query, some search engines try to personalize ...
Automatic identification of user interest for personalized search
WWW '06: Proceedings of the 15th international conference on World Wide WebOne hundred users, one hundred needs. As more and more topics are being discussed on the web and our vocabulary remains relatively stable, it is increasingly difficult to let the search engine know what we want. Coping with ambiguous queries has long ...
Exploiting social bookmarking services to build clustered user interest profile for personalized search
Search engine users tend to write short queries, generally comprising of two or three query words. As these queries are often ambiguous or incomplete, search engines tend to return results whose rankings reflect a community of intent. Moreover, search ...
Comments