Architectural knowledge discovery with latent semantic analysis: Constructing a reading guide for software product audits

doi:10.1016/j.jss.2007.12.815

Journal of Systems and Software

Volume 81, Issue 9, September 2008, Pages 1456-1469

https://doi.org/10.1016/j.jss.2007.12.815 Get rights and content

Abstract

Architectural knowledge is reflected in various artifacts of a software product. In a software product audit this architectural knowledge needs to be uncovered and its effects assessed in order to evaluate the quality of the software product. A particular problem is to find and comprehend the architectural knowledge that resides in the software product documentation. In this article, we discuss how the use of a technique called Latent Semantic Analysis can guide auditors through the documentation to the architectural knowledge they need. We validate the use of Latent Semantic Analysis for discovering architectural knowledge by comparing the resulting vector-space model with the mental model of documentation that auditors possess.

Introduction

The architectural design of a software product and the architectural design decisions taken play a key role in software product audits. Architectural design decisions and their rationale provide, for instance, insight into the trade-offs that were considered, the forces that influenced the decisions, and the constraints that were in place. The architectural design that is the result of these decisions allows for comprehension of such matters as the structure of the software product, its interactions with external systems, and the enterprise environment in which the software product is to be deployed. Following a recent trend in software architecture research (e.g., Bosch, 2004, Jansen and Bosch, 2005, Kruchten et al., 2006, van der Ven et al., 2006) we refer to the collection of architectural design decisions and the resulting architectural design as ‘architectural knowledge’.

For a given software product there is no single source that contains or provides all relevant architectural knowledge. Instead, architectural knowledge is reflected in various artifacts such as source code, data models, and documentation. A complicating factor in distilling relevant architectural knowledge from software product documentation is the fact that there are often many different documents. Each of these documents is tailored to specific stakeholders and different documents can therefore reflect architectural knowledge at different levels of abstraction. A high-level project management summary, for instance, will reflect architectural design decisions and their effects differently than a document describing detailed technical design.

The ISO/IEC 14598-1 international standard (ISO/IEC, 1999) defines a software product as ‘the set of computer programs, procedures, and possibly associated documentation and data’. Quality is defined as ‘the totality of characteristics of an entity that bear on its ability to satisfy stated and implied needs’, while quality evaluation is ‘a systematic examination of the extent to which an entity is capable of fulfilling specified requirements’. Consequently, when we refer in this article to a software product quality audit – i.e., an audit in which the quality of a software product is evaluated – we refer to ‘the systematic examination of the extent to which a set of computer programs, procedures, and possibly associated documentation and data are capable of fulfilling specified requirements’.

We have conducted a study at a company that has broad experience in performing software product audits. This company conducts independent quality audits of software products. Its customers range from large private companies to governmental institutions. In this study we have investigated the use of architectural knowledge in software product audits. To this end we observed an audit that was being conducted for one of the company’s customers. We attended and observed the audit team meetings and had discussions with the audit team members on their use of architectural knowledge in the audit. In addition, we held more general interviews on this topic with five employees who had been involved in various audits, two of whom were also directly involved in the observed audit. The interviewed employees possess different levels of experience and have different focal points when conducting an audit. The problem of finding relevant architectural knowledge sketched above corresponds to a problem that is perceived by all auditors as being difficult to deal with. In short, the auditors need a reading guide that guides them through the documentation.

In this article we outline the problem of discovering architectural knowledge in software product documentation and present a technique that can be used to alleviate this problem. This technique, Latent Semantic Analysis, uses a mathematical technique called Singular Value Decomposition to discover the semantic structure underlying a set of documents. We employ this latent semantic structure to guide the auditors through the documentation to the architectural knowledge needed. A comparison of the discovered semantic structure with the ideas auditors have of software product documentation shows that Latent Semantic Analysis produces a good approximation of the auditors’ mental models.

The remainder of this article is organized as follows. The next section discusses the use of architectural knowledge in software product audits based on our observations in the case study we conducted. Section 3 presents Latent Semantic Analysis (LSA) and its mathematical background. Section 4 discusses the application of LSA to a set of documents that contain software product documentation and shows how we can employ the semantic structure uncovered by LSA to guide the auditor to relevant architectural knowledge. In Section 5 we validate the LSA results through a comparison with auditors’ mental models of software product documentation. Section 6 contains a discussion on related work regarding the application of LSA to similar problems as well as related work in the area of research into architectural knowledge. Section 7 outlines research areas that are still open for further study. In Section 8 we sketch the use of Architectural Knowledge Discovery in a broader scope, and Section 9 contains concluding remarks on this article.

Section snippets

Architectural knowledge in a software product audit

In a software product audit, two types of architectural knowledge can be distinguished. On the one hand there is architectural knowledge pertaining to the current state of the software product; this knowledge reflects the architectural decisions made. On the other hand there is architectural knowledge pertaining to the desired state of the software product; this knowledge reflects the architectural decisions demanded (or expected). It is the auditor’s job to compare the current state with the

Latent semantic analysis

A method that can be used to capture the meaning of a collection of documents is the construction of a vector-space model. Vector-space models are based on the assumption that the meaning of a document can be derived from the terms that are used in that document. In a vector-space model, a document d is represented as a vector of terms $d = (t_{1}, t_{2}, \dots, t_{n})$ , with $t_{i} (i = 1, 2, \dots, n)$ being the number of occurrences of term i in document d (Letsche and Berry, 1997).

Fig. 1 depicts a matrix based on the

Constructing a reading guide: A case study

The LSA technique introduced in Section 3 forms the basis of a detailed case study in which we examine how the semantic structure discovered by LSA can be employed to guide the auditors through the documentation. This section presents the results of this case study.

Fig. 5 depicts the interactive process by which an auditor is guided through the documentation. Initially, auditors start with a set of unread documents. Although the content of these documents is still unknown, the auditors have a

Validation of the use of LSA

The previous section shows how the application of LSA delivers results that support auditors in finding a route through the documentation. The auditors indicate that the results show correspondence to their preferences for selecting and reading documents. In this section we empirically validate this correspondence.

The knowledge discovered by using LSA can only be regarded valid if it fits the expectations of the auditor. In other words, the discovered semantic structure must conform to the

Related work

The application of Latent Semantic Analysis to architectural knowledge discovery discussed in this article bears some relation to other work, both within and outside of the software engineering research domain. The origin of LSA lies in information retrieval. LSA was presented in 1990 by Deerwester et al. as ‘a new method for automatic indexing and retrieval’ of documents (Deerwester et al., 1990). Later research also focused on the psycholinguistic significance of LSA. Landauer and Dumais, for

Future work

The work presented in this article gives rise to a number of issues that warrant further research. An overall issue that remains to be investigated is the scalability of our approach. LSA proved to be feasible for a corpus of 80 documents, but in practice software product documentation might comprise many more documents. Document sets of several hundreds of documents are not uncommon.

Furthermore, the selection of the right number of reduced dimensions is still difficult. In this area, a

Architectural knowledge discovery in a broader scope

This article considers Architectural Knowledge Discovery (AKD) as a means to construct a reading guide for software product audits. Although this application is undoubtedly valuable, we believe AKD has merit in a broader scope.

We envision AKD as one particular technique used in a broad range of architectural knowledge management tools and methods. The role of AKD would mainly be to refine existing (codified) architectural knowledge from such diverse sources as documents, email, meeting minutes,

Conclusion

Document inspection is a method used in software product audits to distill architectural knowledge from the software product documentation. Unfortunately, document inspection is often hard to perform. Auditors are in need of a reading guide that tells them where to start reading, how to progress reading, and which documents to consult for more detail on a particular topic.

We have demonstrated how auditors can be guided through the documentation in a case study in which we reconstructed the

Acknowledgement

This research has been partially sponsored by the Dutch Joint Academic and Commercial Quality Research & Development (Jacquard) program on Software Engineering Research via contract 638.001.406 GRIFFIN: a GRId For inFormatIoN about architectural knowledge. The authors would like to thank Eefje Cuppen for helpful discussions on the repertory grid technique

Remco de Boer is a PhD researcher in Software Engineering at the VU University, Amsterdam, The Netherlands. He obtained his MSc in business informatics from the Erasmus University Rotterdam. His research interests include software architecture, knowledge management, and knowledge technologies. Prior to joining the Vrije Universiteit, he worked as a software developer and later as a researcher in knowledge technologies. He has been involved in various Dutch and EU research and development

References (30)

T.A. Letsche et al.
Large-Scale Information Retrieval with Latent Semantic Indexing
Information Sciences
(1997)
G. Salton et al.
Term-Weighting Approaches in Automatic Text Retrieval
Information Processing & Management
(1988)
Ali Babar, M., de Boer, R.C., Dingsøyr, T., Farenhorst, R., 2007. Architectural Knowledge Management Strategies:...
Babu T., L., Seetha Ramaiah, M., Prabhakar, T., Rambabu, D., 2007. ArchVoc–Towards an Ontology for Software...
M.W. Berry et al.
Matrices Vector Spaces and Information Retrieval
SIAM Review
(1999)
Berry, M.W., Dumais, S.T., O’Brien, G.W., 1994. Using Linear Algebra for Intelligent Information Retrieval. Tech. Rep....
E. Bonnet et al.
zt: A Software Tool for Simple and Partial Mantel Tests
Journal of Statistical Software
(2002)
Booch, G., http://www.booch.com/architecture/. Handbook of Software...
Bosch, J., 2004. Software Architecture: The Next Step. In: Oquendo, F., Warboys, B., Morrison, R. (Eds.), Software...
de Boer, R.C., 2006. Architectural Knowledge Discovery: Why and How? In: First Workshop on SHAring and Reusing...

R.C. de Boer et al.

Constructing a Reading Guide for Software Product Audits

S. Deerwester et al.

Indexing by Latent Semantic Analysis

Journal of the American Society for Information Science (JASIS)

(1990)

F. Fransella et al.

A Manual for Repertory Grid Technique

(1977)

G.H. Golub et al.

Matrix Computations

(1996)

J.H. Hayes et al.

Improving After-the-Fact Tracing and Mapping: Supporting Software Quality Predictions

IEEE Software

(2005)

Cited by (25)

A systematic mapping study on text analysis techniques in software architecture
2018, Journal of Systems and Software
Citation Excerpt :
Architectural Impact Analysis (AIA) intends to identify the elements in architecture affected by a change scenario. The identified elements include the components affected directly as well as the components affected indirectly by the change scenario (Bengtsson et al., 2004). Architectural Reuse (ARu) is to reuse the existing architectural designs, decisions, patterns, styles, and so on (IEEE, 2010).
Information from artifacts in each phase of the software development life cycle can potentially be mined to enhance architectural knowledge. Many text analysis techniques have been proposed for mining such artifacts. However, there is no comprehensive understanding of what artifacts these text analysis techniques analyze, what information they are able to extract or how they enhance architecting activities.
This systematic mapping study aims to study text analysis techniques for mining architecture-related artifacts and how these techniques have been used, and to identify the benefits and limitations of these techniques and tools with respect to enhancing architecting activities.
We conducted a systematic mapping study and defined five research questions. We analyzed the results using descriptive statistics and qualitative analysis methods.
Fifty-five studies were finally selected with the following results: (1) Current text analysis research emphasizes on architectural understanding and recovery. (2) A spectrum of text analysis techniques have been used in textual architecture information analysis. (3) Five categories of benefits and three categories of limitations were identified.
This study shows a steady interest in textual architecture information analysis. The results give clues for future research directions on improving architecture practice through using these text analysis techniques.
Usage-based chunking of Software Architecture information to assist information finding
2016, Journal of Systems and Software
Citation Excerpt :
In keyword-based searching, items retrieved are related because they contain the same or similar terms as the searched terms. In query-initiated discovery of the semantic structure of documents based on words in the documents (de Boer, 2006; de Boer and van Vliet, 2008), the documents or the units of texts retrieved are related because of their semantic structures. In the retrieval of architectural information chained by underlying models (de Boer and van Vliet, 2011; de Graaf et al., 2012; Jansen et al., 2009; Su et al., 2009; Tang et al., 2011), architectural elements or knowledge instances retrieved are related because of the pre-defined relations in the underlying models.
One of the key problems with Software Architecture Documents (ADs)² is the difficulty of finding information required from them. Most existing studies focus on the production of ADs or Architectural Knowledge (AK)³, to allow them to support information finding. However, there has been little focus placed on the consumption of ADs. To address this, we postulate the existence of a concept of “usage-based chunks” of architectural information discoverable from consumers’ usage of ADs when they engage in information-seeking tasks. In a set of user studies, we have found evidence that such usage-based chunks exist and that useful chunks can be identified from one type of usage data, namely, consumer's ratings of sections of ADs. This has implications for tool design to support the effective reuse of AK.
How organisation of architecture documentation affects architectural knowledge retrieval
2016, Science of Computer Programming
Citation Excerpt :
Locating relevant architectural knowledge Knowledge is often spread over multiple documents [20] which makes it hard to locate AK, especially if documents lack finer details. Support for traceability between different entities
A common approach to software architecture documentation in industry projects is the use of file-based documents. This approach offers a single-dimensional arrangement of the architectural knowledge. Knowledge retrieval from file-based architecture documentation is efficient if the organisation of knowledge supports the needs of the readers; otherwise it can be difficult. In this paper, we compare the organisation and retrieval of architectural knowledge in a file-based documentation approach and an ontology-based documentation approach. The ontology-based approach offers a multi-dimensional organisation of architectural knowledge by means of a software ontology and semantic wiki, whereas file-based documentation typically uses hierarchical organisation by directory structure and table of content. We conducted case studies in two companies to study the efficiency and effectiveness of retrieving architectural knowledge from the different organisations of knowledge. We found that the use of better knowledge organisation correlates with the efficiency and effectiveness of AK retrieval. Professionals who used the knowledge organisation found this beneficial.
Enriching software architecture documentation
2009, Journal of Systems and Software
Citation Excerpt :
Finding relevant AK in (large) software architecture documentation is often problematic. The knowledge needed is often spread around multiple documents (de Boer and van Vliet, 2008). The first obstacle is to find the relevant documents in the big set of documents accompanying a system.
The effective documentation of Architectural Knowledge (AK) is one of the key factors in leveraging the paradigm shift toward sharing and reusing AK. However, current documentation approaches have severe shortcomings in capturing the knowledge of large and complex systems and subsequently facilitating its usage. In this paper, we propose to tackle this problem through the enrichment of traditional architectural documentation with formal AK. We have developed an approach consisting of a method and an accompanying tool suite to support this enrichment. We evaluate our approach through a quasi-controlled experiment with the architecture of a real, large, and complex system. We provide empirical evidence that our approach helps to partially solve the problem and indicate further directions in managing documented AK.
A bimodal approach for the discovery of a view of the implementation platform of legacy object-oriented systems under modernization process
2020, EPiC Series in Computing
An Empirical Study on the Architecture Instability of Software Projects
2019, International Journal of Software Engineering and Knowledge Engineering

View all citing articles on Scopus

Hans van Vliet is Professor in Software Engineering at the VU University, Amsterdam, The Netherlands. He got his PhD from the University of Amsterdam. His research interests include software architecture and empirical software engineering. Before joining the VU University, he worked as a researcher at the Centrum voor Wiskunde en Informatica (Amsterdam). He is the author of “Software Engineering: Principles and Practice”, published by Wiley (3rd Edition, 2008). He is the Editor in Chief of the Journal of Systems and Software.

^☆: This article has been based on earlier work by the authors, presented at the 6th Working IEEE/IFIP Conference on Software Architecture in January 2007 in Mumbai, India (de Boer and van Vliet, 2007).

View full text

Architectural knowledge discovery with latent semantic analysis: Constructing a reading guide for software product audits☆