1 Introduction
2 The CROSSMINER project
2.1 CROSSMINER as a set of recommendation systems
Data Preprocessing
module contains tools that extract metadata from OSS repositories (see the middle part of Fig. 2). Data can be of different types, such as source code, configuration, or cross-project relationships. Natural language processing (NLP) tools are also deployed to analyze developer forums and discussions. The collected data is used to populate a knowledge base which serves as the core for the mining functionalities. By capturing developers’ activities (Capturing Context
), an IDE is able to generate and display recommendations (Producing Recommendations
and Presenting Recommendations
). In particular, the developer context is used as a query sent to the knowledge base that answers with recommendations that are relevant to the developer contexts (see the lower side of Fig. 2). Machine learning techniques are used to infer knowledge underpinning the creation of relevant real-time recommendations.
knowledge base
allows developers to gain insights into raw data produced by different mining tools, which are the following ones:
-
Source code miners to extract and store actionable knowledge from the source code of a collection of open-source projects;
-
NLP miners to extract quality metrics related to the communication channels, and bug tracking systems of OSS projects by using Natural Language Processing and text mining techniques;
-
Configuration miners to gather and analyze system configuration artefacts and data to provide an integrated DevOps-level view of a considered open source project;
-
Cross-project miners to infer cross-project relationships and additional knowledge underpinning the provision of real-time recommendations;
-
CrossRec (Nguyen et al. 2019) – It is a framework that makes use of C ross Projects R elationships among O pen S ource S oftware Repositories to build a library Rec ommendation System on top of CrossSim;
-
MNBN (Di Sipio et al. 2020) – It is an approach based on a Multinomial Naive Bayesian network technique to automatically recommend topics given the README file(s) of an input repository.
No. | Artifact | Description | Developed tool |
---|---|---|---|
Similar OSS projects | We crawl data from OSS platforms to find similar projects to the system being developed, with respect to different criteria, e.g., external dependencies, or API usage (Nguyen et al. 2020). This type of recommendation is beneficial to the development since it helps developer learn how similar projects are implemented. | CrossSim | |
Additional components | CrossRec, FOCUS | ||
Code snippets | Well-defined snippets showing how an API is used in practice are extremely useful. These snippets provide developers with a deeper insight into the usage of the APIs being included (Nguyen et al. 2019). | FOCUS | |
Relevant topics | GitHub uses tags as a means to narrow down the search scope. The goal is to help developers approach repositories, and thus increasing the possibility of contributing to their development and widespread their usage. | MNBN |
Knowledge Base
component. Moreover, it is important to remark that even though such tools can be used in an integrated manner directly from the Developer IDE
, their combined usage is not mandatory. They are different services that developers can even use separately according to their needs.-
The recommendation systems have been developed in a real context to cope with industrial needs of different use-case partners;
-
According to the evaluation procedure performed towards the end of the CROSSMINER project, the industrial partners have been particularly satisfied by the developed recommendation systems, which have been mainly graded as excellent by most of the partners that were asked to express their judgement in the range insufficient, sufficient, good, excellent (see the public deliverable D8.164 for more details).
2.2 The CROSSMINER development process
-
Requirement elicitation: identification of the expected features provided by the CROSSMINER platform in terms of recommendations and development support;
-
Development: implementation of the needed recommendation systems to accommodate the requirements defined in the previous step;
-
Evaluation: assess the performance of the produced recommendations by using properly defined evaluation procedures and selected metrics.
3 Challenges and lessons learned from eliciting the requirements of the CROSSMINER recommendation systems
3.1 Challenges
getScoresFromLivescore
shown in Listing 1. The method should be designed so as being able to collect the football scores listed in the livescore.com home page. To this end, a JSON document is initialized with a connection to the site URL in the first line. By using the JSOUP facilities, the list of HTML element of the class sco
is stored in the variable score
in the second line. Finally, the third line updates the scores with all of the parents and ancestors of the selected scores elements.
userAgent
method is to prevent sites from blocking HTTP requests, and to predict the next jsoup invocation. Furthermore, some recommendations could be related to API function calls of a competitor library or extension. For this reason, the green and red boxes contain invocations of HTMLUnit,12 a direct competitor of jsoup that includes different browser user agent implementations, and jsoupcrawler a custom extension of jsoup. FOCUS has been conceptualized to suggest to developers recommendations consisting of a list of API method calls that should be used next. Furthermore, it also recommends real code snippets that can be used as a reference to support developers in finalizing the method definition under development. More code examples provided by FOCUS are available in an online appendix.13
3.2 Lessons learned
-
Requirement elicitation: The final user identifies use cases that are representative and that identify the functionalities that the wanted recommendation systems should implement. By considering such use cases, a list of requirements is produced;
-
Requirement prioritization: The list of requirements produced in the previous step can be very long, because users tend to add all the wanted and ideal functionalities even those that might be less crucial and important for them. For this reason, it can be useful to give a priority to each requirement in terms of the modalities shall, should, and may. Shall is used to denote essential requirements, which are of highest priority for validation of the wanted recommendation systems. Should is used to denote a requirement that would be not essential even though would make the wanted recommendation systems working better. May is used to denote requirements that would be interesting to satisfy and explore even though irrelevant for validating the wanted technologies;
-
Requirement analysis by R&D partners: The prioritized list of requirements is analyzed by the research and development partners with the aim of identifying the major components that need to be developed. Possible technological challenges that might compromise the satisfaction of some requirements are identified in this step and considered in the next step of the process;
-
Requirement consolidation and final agreement: By considering the results of the analysis done by the R&D partners, the list of requirements is further refined and consolidated. After this step, user case partners have ensured highest priority requirements, which will be implemented by R&D partners.
4 Challenges and lessons learned from developing the CROSSMINER recommendation systems
4.1 Main design features
ASTParsing
involves the analysis of structured data, typically the source code of a given software project. Several libraries and tools are available to properly perform operations on ASTs, e.g., fetching function calls, retrieving the employed variables, and analyzing the source code dependencies. Additionally, snippets of code can be analyzed using Fingerprints
, i.e., a technique that maps every string to a unique sequence of bits. Such a strategy is useful to uniquely identify the input data and compute several operations on it, i.e., detect code plagiarism as shown in Zheng et al. (2018).Tensors
can encode mutual relationships among data, typically users’ preferences. Such a feature is commonly exploited by collaborative filtering approaches as well as by heavy computation on the input data to produce recommendations. Plain text is the most spread type of unstructured data and it includes heterogeneous content, i.e., API documentation, repository’s description, Q&A posts, to mention a few. A real challenge is to extract valuable elements without losing any relevant information. Natural processing language (NLP
) techniques are employed to perform this task by means of both syntactic and semantic analysis. Stemming, lemmatization, and tokenization are the main strategies successfully applied in existing recommendation systems. Even the MNBN approach previously presented employs such techniques as preparatory task before the training phase. Similarly to tensors, GraphRepresentation
is useful to model reciprocal associations among considered elements. Furthermore, graph-based data encodings can be used to find peculiar patterns considering nodes and edges semantic.FeatureExtraction
to concisely represent the developer’s context. Principal Component Analysis (PCA) and Latent Semantic analysis (LDA) are just two of such techniques employed for such a purpose. Keyword extraction
and APICallExtraction
are two techniques mostly used when the Capturing Context phase has to analyze source code. Capturing context often involves the search over big software projects. Thus, a way to store and access a large amount of data is necessary to speed up the recommendation item delivery. Indexing
is a technique mainly used by the code search engines to retrieve relevant elements in a short time.Data Mining
techniques, some of them are based on pattern detection algorithms, i.e., Clustering
, FrequentItemsetMining
, and AssociationRuleMining
. Clustering
is usually applied to group objects according to some similarity functions. The most common algorithm is the K-means based on minimizing the distance among the items. A most representative element called centroid is calculated through a linkage function. After such a computation, this algorithm can represent a group of elements by referring to the most representative value. FrequentItemsetMining
aims to group items with the same frequencies, whereas AssociationRuleMining
uses a set of rules to discover possible semantic relationships among the analysed elements. Similarly, the EventStreamMining
technique aims to find recurrent patterns in data streams. A stream is defined as a sequence of events usually represented by a Markov chain. Through this model, the algorithm can exploit the probability of each event to establish relationships and predict a specific pattern. Finally, TextMining
techniques often involve information retrieval concepts such as entropy, latent semantic analysis (LSA), or extended boolean model. In the context of producing recommendations, such strategies can be used to find similar terms by exploiting different probabilistic models that analyze the correlation among textual documents.Filtering
strategies dramatically exploit the user data, e.g., their ratings assigned to purchased products. ContentBasedFiltering
(CBF) employs historical data referring to items with positive ratings. It is based on the assumption that items with similar features have the same score. Enabling this kind of filtering requires the extraction of the item attributes as the initial step. Then, CBF compares the set of active items, namely the context, with possible similar items using a similarity function to detect the closer ones to the user’s needs. DemographicFiltering
compares attributes coming from the users themselves instead of the purchased items. These two techniques can be combined in HybridFiltering
techinques to achieve better results.CollaborativeFiltering
(CF) approaches analyze the user’s behaviour directly through its interaction with the system, i.e., the rating activity. UserBased
CF relies on explicit feedback coming from the users even though this approach suffers from scalability issues in case of extensive data. The ItemBased
CF technique solves this issue by exploiting users’ ratings to compute the item similarity. Finally, ContextAwareFiltering
involves information coming from the environment, i.e., temperature, geolocalization, and time, to name a few. Though this kind of filtering goes beyond the software engineering domain, we list it to complete the filtering approaches landscape.MemoryBased
approach acts typically on user-item matrixes to compute their distance involving two different methodologies, i.e., SimilarityMeasure
and AggregatationApproach
. The former involves the evaluation of the matrix similarity using various concepts of similarity. For instance, JaccardDistance
measures the similarity of two sets of items based on common elements, whereas the LevenshteinDistance
is based on the edit distance between two strings. Similarly, the CosineSimilarity
measures the euclidean distance between two elements. Besides the concept of similarity, techniques based on matrix factorization are employed to make the recommendation engine more scalable. Singular value decomposition (SVD
) is a technique being able to reduce the dimension of the matrix and summarize its features. Such a strategy is used to cope with a large amount of data, even though it is computationally expensive. AggregatationApproach
es analyze relevant statistical information of the dataset such as the variance, mean, and the least square. To mitigate bias lead by the noise in the data, the computation of such indexes use adjusted weigths as a coefficient to rescale the results.MemoryBased
approaches require the direct usage of the input data that cannot be available under certain circumstances. Thus, ModelBased
strategies can overcome this limit by generating a model from the data itself. MachineLearning
offers several models that can support the recommendation activity. NeuralNetwork
models can learn a set of features and recognize items after a training phase. By exploiting different layers of neurons, the input elements are labeled with different weights. Such values are recomputed during different training rounds in which the model learns how to classify each element according to a predefined loss function. Depending on the number of layers, the internal structure of the network, and other parameters, it is possible to use different kinds of neural networks including Deep Neural Networks (DNN
), Recurrent Neural Networks (RNN
), Feed-forward Neural Networks (FNN
), or Convolutional Neural Networks (CNN
). Besides ML models, a recommendation system can employ several models to suggest relevant items. GeneticAlgorithm
s are based on evolutionary principles that hold in the biology domain, i.e., natural species selection. FuzzyLogic
relies on a logic model that extends classical boolean operators using continuous variables. In this way, this model can represent the real situation accurately. Several probabilistic models can be used in a recommendation system. BayesianNetwork
is mostly employed to classify unlabeled data, although it is possible to employ it in recommendation activities.Heuristics
techniques to encode the knowhow of domain experts. Heuristics employ different approaches and techniques together to obtain better results as well as to overcome the limitations of other techniques. On the one hand, heuristics are easy to implement as they do not rely on a complex structure. On the other hand, they may produce results that are sub-optimal compared to more sophisticated techniques.IDEIntegration
offers several advantages, i.e., auto-complete shortcuts and dedicated views showing the recommended items. The integration is usually performed by the development of a plug-in, as shown in existing recommendation systems (Lv et al. 2015; Ponzanelli et al. 2016). Nevertheless, developing such an artifact requires much effort, and the integration must take into account possible incompatibilities among all deployed components. A more flexible solution is represented by WebInterface
s in which the recommendation system can be used as a stand-alone platform. Even though the setup phase is more accessible rather than the IDE solution, presenting recommendations through a web service must handle some issues, including server connections, and suitable response times. For presentation purposes, interactive data structures might be useful in navigating the recommended items. TraversableGraph
is just one successful example of this category. Strathcona (Holmes et al. 2005) makes use of this technique to show the snippets of code rather than simply retrieving them as ranked lists. In this way, final users can figure out additional details about the recommended items.4.2 Development challenges for CrossSim and CrossRec
4.3 Development challenges for FOCUS
pom.xml
file. Because of such constraints, we ended up with a dataset consisting of 610 Java projects. Thus, we had to create a dataset ten times bigger than the used one for the evaluation.4.4 Development challenges of MNBN
4.5 Lessons learned
5 Challenges and lessons learned from the evaluation of the CROSSMINER recommendation systems
-
Which evaluation methodology is suitable? Assessing RSSE can be done in different ways. Conducting a user study has been accepted as the de facto method to analyze the outcome of a recommendation process by several studies (McMillan et al. 2012; Moreno et al. 2015; Ponzanelli et al. 2016; Zhang et al. 2017; Zhong et al. 2009). However, user studies are cumbersome, and they may take a long time to finish. Furthermore, the quality of a user study’s outcome depends very much on the participants’ expertise and willingness to participate. In this sense, setting up an automated evaluation, in which the manual intervention is not required (or preferably limited), is greatly helpful.
-
Which metric(s) can be used? Choosing suitable metrics accounts for an important part of the whole evaluation process. While accuracy metrics, such as success rate, precision and recall have been widely used to measure the prediction performance, we suppose that additional metrics should be incorporated into the evaluation (Ge et al. 2010; Nguyen et al. 2019), aiming to study RSSE better.
-
How to prepare/identify datasets for the evaluation? One needs to take into account different parameters when it comes to choosing a dataset for evaluation. Moreover, the data used to evaluate a system depends very much on the underpinning algorithms. In this sense, advanced techniques and methods for curating suitable data are highly desirable.
-
What could be a representative baseline for comparison? To show the features of a new conceived tool and give evidence of its novelty and advantages, it is necessary to compare it with existing approaches with similar characteristics. Since the solution space is vast, comparing and evaluating candidate approaches can be a daunting task.
5.1 Challenges
-
the developer is at an early stage of the development process, and the active method is almost empty;
-
the developer is at an early stage of the development process, and the active method implementation is well defined;
-
the developer is near to the end of the development process, and the active method is almost empty;
-
the project is in an advanced development phase, and the active method implementation is well defined.
-
N is the cut-off value for the list of recommended items;
-
for a testing project p, the ground-truth dataset is named as GT(p);
-
REC(p) is the top-N items, it is a ranked list in descending order of real scores, with RECr(p) being the item in the position r;
-
if a recommended item i ∈ REC(p) for a testing project p is found in the ground truth of p, i.e., GT(p), hereafter we call this as a match or hit.
-
Which format? Depending on the employed recommendation techniques (e.g., collaborative filtering, CNNs, etc.) we had to identify the proper ways to encode the created datasets. For instance, to enable the application of a graph-based similarity algorithm underpinning CrossSim, we had to encode the different features of OSS projects in a graph-based representation. The same datasets needed to be represented in a TF-IDF format to enable the application of FOCUS;
-
Which preprocessing process should be applied to create the dataset? To minimize the size of the input datasets and thus to make their manipulation efficient, we had to perform different data filtering tasks. For instance, in the case of CrossSim, to enable the application of the employed graph-similarity algorithm, we identified the features that are relevant for the task. For example, information about software developers, source code, and GitHub topics was filtered out from the available datasets even though it was easy to encode all of them as elements in the input graphs. Similar data filtering phases were also performed in CrossRec to enable the recommendation of third-party libraries that might be added in the project under development. Indeed, such data filtering phases have to be performed without compromising the performance (in terms of accuracy, precision, recall, etc.) of the approach under evaluation;
-
Which limitations should we tackle when collecting the dataset? The primary limitations we experienced when evaluating the CROSSMINER recommendation systems were related to the GitHub APIs restrictions. Unfortunately, the adoption of alternative sources like GHTorrent (Gousios 2013) was not enough due to the lack of needed artifacts such as source code. Knowing such limitations in advance, when collecting projects from GitHub, we decided to save as much data as possible for every single project. The goal was to enable the reuse of the collected data even for perspective evaluations to be done for future recommendation systems to be developed in the context of CROSSMINER.
CrossSim | CrossRec | FOCUS | MNBN | ||
---|---|---|---|---|---|
(Nguyen et al. 2018) | (Nguyen et al. 2019) | (Di Sipio et al. 2020) | |||
Methodology | Cross-Val. | ||||
User study | |||||
Metric | Success rate | ||||
Precision | |||||
Recall | |||||
nDCG | |||||
TopRank | |||||
Coverage | |||||
Entropy | |||||
Novelty | |||||
Confidence | |||||
Ranking | |||||
Time | |||||
Dataset | Source | GitHub | GitHub | GitHub, Maven central repository | GitHub |
Size | 580 projects | 1,200 projects | 3,600 projects | 13,400 projects | |
Artifact | Metadata | Metadata | Source code | README files | |
None |
5.2 Lessons learned
pom.xml
file. Because of such constraints, we ended up with a dataset consisting of 610 Java projects. Thus, we had to create a dataset ten times bigger than the used one for the evaluation.6 Related work
-
Change task: this type of recommendation system aims to support the developer in managing the evolution of the current programming task;
-
API usage: this type of RSSEs supports the usage of external third-party libraries;
-
Refactoring task: recommendation systems that support refactoring activities fall in this category;
-
Solving exception, failure, and bug: this kind of RSSEs handles the exception and unexpected behaviours of the considered software systems;
-
Recommending software components and components’ design: it recommends entire software components to be integrated into the software projects under development;
-
Exploring local codebases and visited source locations: this type of system supports the information search over different data sources, i.e., online datasets
-
Hotspot recommender: it provides recommendations about methods and classes that belong to the current context;
-
Navigation recommender: it suggests locations where the developer can find hints related to the current task;
-
Snippet recommender: it produces snippets of code related to the developer’s context;
-
Documentation recommender: it aggregates posts coming from websites and A&Q forums to enhance the documentation of the APIs of interest.
-
Heuristic approaches make effort on the implementation and usually derive from empirical evaluation;
-
Data mining and machine learning techniques are adopted when a large amount of data are available, by exploiting different algorithm and models;
-
Collaborative filtering typically employs user-item matrices to filter data and find similar items.