Elsevier

Information Sciences

Volume 381, March 2017, Pages 20-32
Information Sciences

Leveraging linguistic traits and semi-supervised learning to single out informational content across how-to community question-answering archives

https://doi.org/10.1016/j.ins.2016.11.006Get rights and content

Abstract

Community Question-Answering sites (e.g., Yahoo! Answers) have become large-scale knowledge bases of natural language questions formulated by their own members. In other to provide quick answers, these sites are compelled to make the best out of the content stored in their repository. Researchers have discovered that, on the one hand, many of these services are the confluence of an information-seeking and a social network that are constantly overlapping, and on the other hand, how-to questions are frequently published across these platforms. By and large, informational procedural questions are highly likely to expect informational answers, while non-informational manner questions target at socially interacting with other members of the community. In order to enhance user experience by reducing the delay in answering, these services are heartened to identify, retrieve and revitalize the content maintained in their knowledge bases. For this purpose, it is key to match the intent of new posted questions with the intention of archived answers that will be presented to the asker.

By manually annotating a reduced number of how-to questions and answers, we carried out an exploratory analysis that unveils a dichotomy between the interaction of these two networks. More precisely, we corroborate previous findings indicating that procedural questions are more likely to bear an informational goal, but our analysis is also extended to their answers, and it reveals that they exhibit a more conspicuous confluence. In substance, we find out that informational and non-informational answers are very likely to show up regardless of the end of the question. Then, we take advantage of this tagged set and of massive unlabelled material for exploiting two state-of-the-art single-view semi-supervised approaches aimed at discriminating informational from non-informational how-to content.

Moreover, our proposed models leverage assorted linguistically-motivated features, such as sentiment analysis and dependency parsing as well as named entity recognition. Our outcomes show that attributes, harvested from morphological and sentiment analysis, proven to be effective under a semi-supervised framework. At the expenses of low annotation costs, these linguistically-motivated semi-supervised models reached an accuracy of 84.25% and 74.41% for classifying questions and answers, respectively. In addition, we quantify the impact of automatically detecting informational/non-informational intents on the retrieval of best answers, i.e., an improvement of 4.12% in terms of precision at one.

Introduction

The most massively popular community Question Answering (cQA) services keep over 100 million resolved questions. As of December 2015, Yahoo! Answers1 had about one hundred million members, and its knowledge base maintained more than four hundred million questions prompted since its inception (i.e., almost 250 million asked in English). As a logical consequence, cQA repositories are now perceived as vast sources of reusable information conveyed in natural language enriched with some metadata [33], [43]. Take for instance, search engines attempt at reusing best answers embodied in these knowledge bases for producing snippets from their best answers, whenever they detect that a question-like query was submitted by a web user. Needless to say, there are different types of cQA sites [9], for example some communities lack moderation, other are strongly regulated. In essence, many of these cQA platforms are conceived as the synergy of an information-seeking and a social network [32], because their fellows can ask any sort of simple, complex or detailed question, and they can additionally seek opinions and conduct polls. All these elements expect to receive several responses coming from multiple users of the community, reflecting different walks of life, word-of-mouth tips, solutions and opinions as well as answers.

Moreover, members yield social capital when they participate in the dynamic of these communities. For instance, they make comments, assess the quality or express their sympathy towards questions and answers (e.g., thumbs-up/thumbs-down). By means of these social interactions, users of these sites disseminate information (e.g., facts and word-of-mouth tips) resulting in a first-rate, quickly growing repository of questions and answers evaluated by humans. Nonetheless, there is an inherent lag between when questions are published and when first and good responses are posted. Hence, cQA platforms target at responding to the pressing need for automatically reusing past resolved questions, and as a by-product, they also lessen duplicate-asking.

By and large, Yahoo! Answers has been categorized as a social cQA, which facilitates the voluntary discussion among peers for questioning and answering [15]. Such sites encourage and thrive on communities built around information exchange, introducing a social aspect to information seeking [36]. This type of cQA site has demonstrated to be more attractive at responding complex, especially how-to, questions (e.g.,“How to remove pimples?” and “How to break up with my boyfriend?”) [17]. In effect, our exploratory analysis indicates that across manner questions the synergy between an information-seeking and a social network becomes more conspicuous. On the one hand, this sort of question is highly likely to be informational, but on the other hand, its answers show an almost even distribution across informational and non-informational questions. Therefore, when retrieving and revitalizing elements stored in the knowledge base, it is relevant to match the intent behind new posted questions with the goal of archived answers that will be presented to the asker. Put differently, informational manner questions, such as “How to make beer?” and “How to make easy home-made mayonnaise?”, require informational good answers, but in a similar way, non-informational procedural questions, including “How to recover from a broken heart?” and “How can I get beautiful hair?”, need good, but non-informational, answers. Textbook cases are questions aiming at attracting or finding answers bearing great social capital.

In this work, we seek to effectively separate the informational from the non-informational content across how-to questions and answers, and show how sorting this out cooperates on improving the retrieval, and thus the reuse, of their past best answers. We base this separation on discriminant models constructed on top of linguistically motivated properties. In other words, our reasoning is that good answers can be fetched from both groups, ergo their archival value depends on matching the intent of the question. This premise predicates upon the idea that there are linguistic traits characterizing both groups. To be more exact, this paper makes the following contributions with respect to our forerunners:

  • 1.

    Questions across cQA archives are composed of a title and a body. The former is utilized for: a) conveying the gist of the question; b) sometimes it is used as an attention catcher; or c) summarizing the whole question. Despite of the fact that the latter is likely to go into the details of the question, current studies have focused their attention on models to discriminate the informational/non-informational nature of questions on grounds of their titles only. This work furthers these studies by examining the contribution of their bodies, and it additionally aims at automatically discriminating informational/non-informational answers.

  • 2.

    In deed, manually annotating full questions and answers demands considerable efforts in time and human resources, especially if question bodies are taken into account. As a means of lowering this cost, our work tests two state-of-the-art semi-supervised approaches for the classification of both questions and answers into informational or non-informational. Both methods are compared to their supervised counterparts, in a scenario, where few training material is available.

  • 3.

    In order to generalize effective classifiers, we explored distinct models for questions and answers enriched with assorted linguistically-oriented properties. Contrary to previous works, we do not only take into account surface features such as frequent words across question titles, but also we leverage attributes distilled from natural language processing tools including sentiment and morphological analysis as well as dependency parsing.

  • 4.

    Lastly, we carry out several experiments that empirically prove and quantify how matching the intent of questions and answers, via these obtained semi-supervised models, aids in significantly improving the retrieval of best answers to how-to questions from cQA archives.

In a nutshell, this piece of research advances this task in the following aspects: a) a thorough investigation into which kind of linguistic processing is more effective when combining few labelled training instances with large volumes of unlabelled material; b) it extends earlier studies by examining the contribution of question bodies; and c) it also digs deeper into linguistically-motivated models to recognize informational/non-informational answers.

At the expense of few training samples, our outcomes show that a semi-supervised variation of Support Vector Machines (SVM) models enriched with attributes produced by linguistic processing, such as lemmatization and sentiment analysis, can considerably enhance the classification rate, and consequently, the retrieval of best answers. Our empirical results also indicate that question bodies aid in enhancing the performance, when they are available. Overall, we show that the obtained models assist in increasing the retrieval of best answers by 4.12%. The reminder of this paper is organized as follows. Section 2 outlines the related work, next Section 3 dissects our semi-supervised models, Section 4 describe our experiments and findings. Eventually, Section 5 draws some conclusions and sketches future work.

Section snippets

Related work

In essence, cQA platforms are defined as places where their members socially interact by posting questions aimed frequently at more open-ended responses conveyed by multiple fellows of the community, e.g. thoughts, opinions and recommendations [9], fostering distinct kinds of relationships between their members. As a matter of fact, [22] discovered that users of Baidu Zhidao2 are more likely to perceive their close contacts as more reliable sources of good answers, and that

Leveraging linguistic traits and semi-supervised learning for identifying informational /non-informational manner questions and answers

In this paper, we propose to capitalize on linguistically-oriented semi-supervised models for tackling the lack of annotated corpora head-on. In so doing, we profited from the publicly available large-scale corpus of manner questions and answers provided by the Yahoo! Webscope Program.3 This corpus contains question and answer texts together with a small amount of meta-data: the best answer, the top-level category and sub-categories

Experiments

In our empirical settings, we split our corpus of annotated questions into 1752 testing and 48 training by randomly singling out two elements from each category. This way we attempt to avoid a bias towards a particular topic. In a similar manner, we divided answers into 10,106 testing and 48 training samples. Here, we also chose two random answers per category. Accordingly, the remaining 140,827 questions and 809,459 answers contained in the Yahoo! Webscope How-to corpus were considered as

Conclusions and future work

In this paper, we study two state-of-the-art semi-supervised models equipped with assorted linguistically motivated attributes for separating the informational from the non-informational content across manner questions and answers. With few training material, we show that it is possible to accomplish significant classification rates, which can be useful for enhancing user experience by improving the search across cQA archives. On the whole, our empirical outcomes point out to the fact that

Acknowledgements

This work was partially supported by the project Fondecyt “Bridging the Gap between Askers and Answers in Community Question Answering Services” (11130094) funded by the Chilean Government.

References (44)

  • Y. Yao et al.

    Detecting high-quality posts in community question answering sites

    Inf. Sci.

    (2015)
  • V. Barash et al.

    Distinguishing knowledge vs social capital in social media with roles and context.

    ICWSM

    (2009)
  • D. Carmel et al.

    Improving term weighting for community question answering search using syntactic analysis

    Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management

    (2014)
  • L. Chen

    Understanding and Exploiting User Intent in Community Question Answering

    (2014)
  • L. Chen et al.

    Understanding user intent in community question answering

    Proceedings of the 21st International Conference Companion on World Wide Web

    (2012)
  • E. Choi, Motivations and expectations for asking questions within online q&a,...
  • E. Choi et al.

    Developing a typology of online q&a models and recommending the right model for each question type

    Proc. Am. Soc. Inf. Sci. Technol.

    (2012)
  • E. Choi et al.

    User motivations for asking questions in online Q&A services

    J. Assoc. Inf. Sci. Technol.

    (2016)
  • A.Y. Chua et al.

    Answers or no answers: studying question answerability in stack overflow

    J. Inf. Sci.

    (2015)
  • P. Devijver et al.

    Pattern Recognition: A Statistical Approach

    (1982)
  • A. Figueroa et al.

    Learning to rank effective paraphrases from query logs for community question answering

    AAAI 2013

    (2013)
  • R. Gazan

    Social q&a

    J. Am. Soc. Inf. Sci. Technol.

    (2011)
  • Cited by (0)

    View full text