Leveraging linguistic traits and semi-supervised learning to single out informational content across how-to community question-answering archives
Introduction
The most massively popular community Question Answering (cQA) services keep over 100 million resolved questions. As of December 2015, Yahoo! Answers1 had about one hundred million members, and its knowledge base maintained more than four hundred million questions prompted since its inception (i.e., almost 250 million asked in English). As a logical consequence, cQA repositories are now perceived as vast sources of reusable information conveyed in natural language enriched with some metadata [33], [43]. Take for instance, search engines attempt at reusing best answers embodied in these knowledge bases for producing snippets from their best answers, whenever they detect that a question-like query was submitted by a web user. Needless to say, there are different types of cQA sites [9], for example some communities lack moderation, other are strongly regulated. In essence, many of these cQA platforms are conceived as the synergy of an information-seeking and a social network [32], because their fellows can ask any sort of simple, complex or detailed question, and they can additionally seek opinions and conduct polls. All these elements expect to receive several responses coming from multiple users of the community, reflecting different walks of life, word-of-mouth tips, solutions and opinions as well as answers.
Moreover, members yield social capital when they participate in the dynamic of these communities. For instance, they make comments, assess the quality or express their sympathy towards questions and answers (e.g., thumbs-up/thumbs-down). By means of these social interactions, users of these sites disseminate information (e.g., facts and word-of-mouth tips) resulting in a first-rate, quickly growing repository of questions and answers evaluated by humans. Nonetheless, there is an inherent lag between when questions are published and when first and good responses are posted. Hence, cQA platforms target at responding to the pressing need for automatically reusing past resolved questions, and as a by-product, they also lessen duplicate-asking.
By and large, Yahoo! Answers has been categorized as a social cQA, which facilitates the voluntary discussion among peers for questioning and answering [15]. Such sites encourage and thrive on communities built around information exchange, introducing a social aspect to information seeking [36]. This type of cQA site has demonstrated to be more attractive at responding complex, especially how-to, questions (e.g.,“How to remove pimples?” and “How to break up with my boyfriend?”) [17]. In effect, our exploratory analysis indicates that across manner questions the synergy between an information-seeking and a social network becomes more conspicuous. On the one hand, this sort of question is highly likely to be informational, but on the other hand, its answers show an almost even distribution across informational and non-informational questions. Therefore, when retrieving and revitalizing elements stored in the knowledge base, it is relevant to match the intent behind new posted questions with the goal of archived answers that will be presented to the asker. Put differently, informational manner questions, such as “How to make beer?” and “How to make easy home-made mayonnaise?”, require informational good answers, but in a similar way, non-informational procedural questions, including “How to recover from a broken heart?” and “How can I get beautiful hair?”, need good, but non-informational, answers. Textbook cases are questions aiming at attracting or finding answers bearing great social capital.
In this work, we seek to effectively separate the informational from the non-informational content across how-to questions and answers, and show how sorting this out cooperates on improving the retrieval, and thus the reuse, of their past best answers. We base this separation on discriminant models constructed on top of linguistically motivated properties. In other words, our reasoning is that good answers can be fetched from both groups, ergo their archival value depends on matching the intent of the question. This premise predicates upon the idea that there are linguistic traits characterizing both groups. To be more exact, this paper makes the following contributions with respect to our forerunners:
- 1.
Questions across cQA archives are composed of a title and a body. The former is utilized for: a) conveying the gist of the question; b) sometimes it is used as an attention catcher; or c) summarizing the whole question. Despite of the fact that the latter is likely to go into the details of the question, current studies have focused their attention on models to discriminate the informational/non-informational nature of questions on grounds of their titles only. This work furthers these studies by examining the contribution of their bodies, and it additionally aims at automatically discriminating informational/non-informational answers.
- 2.
In deed, manually annotating full questions and answers demands considerable efforts in time and human resources, especially if question bodies are taken into account. As a means of lowering this cost, our work tests two state-of-the-art semi-supervised approaches for the classification of both questions and answers into informational or non-informational. Both methods are compared to their supervised counterparts, in a scenario, where few training material is available.
- 3.
In order to generalize effective classifiers, we explored distinct models for questions and answers enriched with assorted linguistically-oriented properties. Contrary to previous works, we do not only take into account surface features such as frequent words across question titles, but also we leverage attributes distilled from natural language processing tools including sentiment and morphological analysis as well as dependency parsing.
- 4.
Lastly, we carry out several experiments that empirically prove and quantify how matching the intent of questions and answers, via these obtained semi-supervised models, aids in significantly improving the retrieval of best answers to how-to questions from cQA archives.
In a nutshell, this piece of research advances this task in the following aspects: a) a thorough investigation into which kind of linguistic processing is more effective when combining few labelled training instances with large volumes of unlabelled material; b) it extends earlier studies by examining the contribution of question bodies; and c) it also digs deeper into linguistically-motivated models to recognize informational/non-informational answers.
At the expense of few training samples, our outcomes show that a semi-supervised variation of Support Vector Machines (SVM) models enriched with attributes produced by linguistic processing, such as lemmatization and sentiment analysis, can considerably enhance the classification rate, and consequently, the retrieval of best answers. Our empirical results also indicate that question bodies aid in enhancing the performance, when they are available. Overall, we show that the obtained models assist in increasing the retrieval of best answers by 4.12%. The reminder of this paper is organized as follows. Section 2 outlines the related work, next Section 3 dissects our semi-supervised models, Section 4 describe our experiments and findings. Eventually, Section 5 draws some conclusions and sketches future work.
Section snippets
Related work
In essence, cQA platforms are defined as places where their members socially interact by posting questions aimed frequently at more open-ended responses conveyed by multiple fellows of the community, e.g. thoughts, opinions and recommendations [9], fostering distinct kinds of relationships between their members. As a matter of fact, [22] discovered that users of Baidu Zhidao2 are more likely to perceive their close contacts as more reliable sources of good answers, and that
Leveraging linguistic traits and semi-supervised learning for identifying informational /non-informational manner questions and answers
In this paper, we propose to capitalize on linguistically-oriented semi-supervised models for tackling the lack of annotated corpora head-on. In so doing, we profited from the publicly available large-scale corpus of manner questions and answers provided by the Yahoo! Webscope Program.3 This corpus contains question and answer texts together with a small amount of meta-data: the best answer, the top-level category and sub-categories
Experiments
In our empirical settings, we split our corpus of annotated questions into 1752 testing and 48 training by randomly singling out two elements from each category. This way we attempt to avoid a bias towards a particular topic. In a similar manner, we divided answers into 10,106 testing and 48 training samples. Here, we also chose two random answers per category. Accordingly, the remaining 140,827 questions and 809,459 answers contained in the Yahoo! Webscope How-to corpus were considered as
Conclusions and future work
In this paper, we study two state-of-the-art semi-supervised models equipped with assorted linguistically motivated attributes for separating the informational from the non-informational content across manner questions and answers. With few training material, we show that it is possible to accomplish significant classification rates, which can be useful for enhancing user experience by improving the search across cQA archives. On the whole, our empirical outcomes point out to the fact that
Acknowledgements
This work was partially supported by the project Fondecyt “Bridging the Gap between Askers and Answers in Community Question Answering Services” (11130094) funded by the Chilean Government.
References (44)
- et al.
Evolutionary optimization for ranking how-to questions based on user-generated contents
Expert Syst. Appl.
(2013) - et al.
Selection of relevant features and examples in machine learning
Artif. Intell.
(1997) - et al.
An empirical study of sentence features for subjectivity and polarity classification
Inf. Sci.
(2014) - et al.
Search clicks analysis for discovering temporally anchored questions in community question answering
Expert Syst. Appl.
(2016) - et al.
Wrappers for feature subset selection
Artif. Intell.
(1997) - et al.
Predicting the quality of user-generated answers using co-training in community-based question answering portals
Pattern Recognit. Lett.
(2015) - et al.
Understanding and summarizing answers in community-based question answering services
Proceedings of the 22nd International Conference on Computational Linguistics
(2008) - et al.
Sentiment analysis: a review and comparative analysis of web services
Inf. Sci.
(2015) - et al.
Research agenda for social q&a
Libr. Inf. Sci. Res.
(2009) - et al.
Leveraging social q&a collections for improving complex question answering
Comput. Speech. Lang.
(2015)