In this section, we present a discussion following the obtained results. We also talk about the potential application of our work in real scenarios. Finally, we raise the limitations of our work.
5.1 Obtained results
During our work, we firstly proved that it is possible to develop models that can identify expert, non-expert and out-of-scope comments peaking the AUC score at 0.93, accuracy at 0.83, MAE at 0.15 degrees and R2 score at 0.69. Based on these results and the respective best-selected features, we predicted the type of 100,226 comments. Next, we discussed the most representative features of expert and out-of-scope comments. Finally, we analysed 3,945 users, grouped them into the groups of experts, non-experts and out-of-scope users, and highlighted their common characteristics. We can conclude that it is a feasible task to detect not only expert comments but also experts who tend to provide helpful content in the Reddit community and who are active thread contributors. Contrastingly, we have a much larger sample of non-expert comments and users, which require manual verification of the reasons why they did not fall into the expert group. At last, the characteristics for out-of-scope comments that we presented are representative and can clearly distinguish them from the rest of the comments. We believe that by answering the initially stated RQs, we build future work directions for recognising not only data science expert comments but also spammers and malicious users whose influence is enormous nowadays.
We also would like to outline the contributions of our work which are based on the state of the art expert identification methods presented in Table
2. Firstly, as far as we are aware, this is the first time that expert identification in such an active Q &A platform as Reddit was done with manual labelling of comments by experts. We have not found any work applying supervised learning model and user features for addressing expert identification problem. We are of the opinion that it is a significant novelty since they provide an additional source for making better predictions. Finally, we did not find any characterisation of expert users in the literature and by doing it, we not only filled this gap but also facilitated several important applications which are described next.
5.2 Application in real scenarios
Our final ML model that can identify experts in the data science field can have various applications since several possible domains can benefit from such a study. First of all, we can follow the work of Yan et al. [
43] that developed a framework of collecting validations for members’ skill expertise in the LinkedIn
12 professional social network. This work proved the importance of estimating the users’ skill expertise at a large scale which, respectively, can serve as a base for predicting who is/can be hired for a job requiring a particular skill. Reddit also has become an emerging resource for talent recognition in recent years. In this way, the results of our work can be used to fill the gap between recruiters and candidates so that recruiters can find relevant candidates that fulfil the job description.
Secondly, it would be helpful if the Reddit community could display the expertise of every user next to the nickname. In this way, even the controversial posts and comments would be evaluated beforehand so the readers can rely more on the users who have already proved to be trustworthy. Moreover, it can be possible to rank expert users of each subreddit.
Thirdly, we expect it to be easy to extrapolate our work, focused on data science authorities, to a different topic. We are of the opinion that all user and crowdsourced, and many of NLP features can be re-used to identify experts in other fields. However, it would be useful to operate with topic-related dictionaries to count features related to the lexicon count.
Moreover, the analysis of the common characteristics of expert comments and users can serve as a suggestion for teachers and professors of all levels to get new ideas about what skills are to be developed and practised for the successful career prospects of their students. In a typical Q &A community, every question has one or more tags indicating the required skills to answer this question. Correspondingly, these tags can be considered as skill areas that professors are interested in.
Another application is the recommendation system, which provides personalisation mechanisms by suggesting helpful answers to a questioner and interesting questions to a potential respondent. For instance, a user could be notified about new questions that are pertinent to his/her interests and expertise. This could minimise the percentage of the questions that are poorly or not answered at all. Furthermore, we have seen that we can build user reputation schemes that can capture a user’s impact and significance in the system. Such schemes could help provide the right incentives to users to be fruitfully and meaningfully active, reducing noise and low-quality questions and answers. Searching and ranking answered questions based on reputation/quality or user interests would help to find more easily answers and avoid posting similar questions.
Finally, in this modern era, the level of concern for security against malicious attacks has reached an unprecedented high [
44]. Accordingly, by examining the prevalent traits in the out-of-scope group of comments, we can identify potentially malicious and unreliable users or social bots in the early stage and reduce their influence. As stated by Parra-Arnau et al., even though social networks provide an easy and immediate way of communication, there exist significant privacy threats provoked by inexperienced or even irresponsible users recklessly publishing sensitive material [
45]. For example, Pastor-Galindo et al. [
46] analysed the presence and behaviour of social bots in Twitter in the context of the Spanish general election. The authors classified users as social bots or humans, concluding that a non-negligible amount of bots actively participated in the election. This, in return, could affect the belief of the social media users while deciding whom to vote.
On the other hand, this study has several theoretical implications. Firstly, it contributes to the literature on social media analysis, online communities, and expertise identification by revealing behaviors of different types of users in the data science domain and deepening our understanding of the dynamics and structure of such communities. Also, by introducing the concept of out-of-scope comments and classifying them alongside expert and non-expert comments, this research provides a novel perspective on expertise identification and expands the scope of the existing literature. Besides, this work can potentially inspire collaboration between fields such as data science, natural language processing, social network analysis, and human-computer interaction. It highlights the importance of interdisciplinary research in addressing complex problems and fostering innovation.
5.3 Limitations
However, our work has some limitations that we would like to acknowledge. First of all, the scope of our work was restricted to the identification of experts only in one subreddit. Respectively, the obtained results are limited to the data science subreddit that we chose based on the selected metrics. There are several reasons for not expanding the dataset to include annotations from multiple subreddits. Firstly, manual annotation of comments by data science experts is a time-consuming and labor-intensive process. Also, subreddits differ in terms of the quality of discussions and the level of moderation. Including multiple subreddits could introduce variability in the data, making it harder to identify clear patterns and features that distinguish experts from non-experts. To mitigate the impact of the bias on the framework’s performance, several approaches can be considered for future work. They include cross-domain adaptation by fine-tuning the model on additional data from related subreddits, allowing it to learn features that are more representative of the broader data science community and augmentation with external sources such as other social media platforms or publication records.
Secondly, despite the fact that two raters who classified the comments have experience and education related to the data science field and that the obtained Cohen’s kappa agreement proved to be high, the chance of the human factor could not be excluded, which is a natural consequence arising from a discontinuity between human capabilities and system demands [
47]. Moreover, there are threats to validity because it is a challenging task to develop a perfect coding schema with no overlaps among the categories. Moreover, we labelled 1113 comments out of 101,339 (1.1%). Thus, despite being sure that our coding schema produced reliable results, further studies are required to confirm and generalise the coding process. Lastly, while Reddit is a large and, moreover, the biggest Q &A site, it would be useful to repeat the study with other portals. It would be valuable because these sites should, preferably, span users with more backgrounds and interests that differ from those of Reddit users.
Finally, while the manual coding approach described in the article is a valid method to create a labeled dataset for supervised learning, there are alternative approaches and modifications that could potentially scale better or improve the efficiency of the process. One of them is active learning implying that the human annotators work iteratively with the ML model, refining the labeled dataset by focusing on examples that the model is most uncertain about. Also, instead of relying solely on a few experts, the labeling task can be distributed among a larger group of annotators, potentially including domain experts and non-experts. Also, applying transfer learning techniques [
48] can reduce the amount of labeled data needed to achieve good performance. Transfer learning could be used to adapt the model to different topics or communities without the need for a completely new dataset or training procedure. Accordingly, by leveraging knowledge from related tasks or domains, this technique can help reduce the burden on human annotators and scale the manual coding process. However, more experiments are needed to evaluate the performance with these modifications.
Moreover, there are several annotation tools available that can assist human annotators in the labeling process. However, in this work, we did not use any particular tool for manual coding because we wanted to maintain a high level of control over the annotation process and ensure that the annotators followed specific guidelines and instructions closely, which could be difficult to enforce or monitor when using an external tool. Moreover, we were concerned about the potential for bias or error introduced by a specific tool or its suggestions.