Automatic clustering of construction project documents based on textual similarity

doi:10.1016/j.autcon.2014.02.006

Automation in Construction

Volume 42, June 2014, Pages 36-49

https://doi.org/10.1016/j.autcon.2014.02.006 Get rights and content

Highlights

•
Hybrid approach for clustering semantically-related project documents is proposed.
•
For clustering, inverse relationship between dimensionality & similarity threshold.
•
tf–idf weighting method results in high precision, average recall outcomes.
•
Refining clustering outcome using supervised learning improves accuracy.
•
Textual similarities can be used to reveal semantic relations between documents.

Abstract

Text classifiers, as supervised learning methods, require a comprehensive training set that covers all classes in order to classify new instances. This limits the use of text classifiers for organizing construction project documents since it is not guaranteed that sufficient samples are available for all possible document categories. To overcome the restriction imposed by the all-inclusive requirement, an unsupervised learning method was used to automatically cluster documents together based on textual similarities. Repeated evaluations using different randomizations of the dataset revealed a region of threshold/dimensionality values of consistently high precision values and average recall values. Accordingly, a hybrid approach was proposed which initially uses an unsupervised method to develop core clusters and then trains a text classifier on the core clusters to classify outlier documents in a consequent refinement step. Evaluation of the hybrid approach demonstrated a significant improvement in recall values, resulting in an overall increase in F-measure scores.

Introduction

Automatic classification of documents as a supervised learning method requires a set of class labels and samples of each class in order to conduct the learning process before being able to perform predictions for new document instances. Usually, the classification procedure assumes that the classes are all inclusive (that they make a complete set of all the possible outcomes for any new instance) and that they are mutually exclusive (any new instance can belong to one and only one class). Where classes are static and predefined, the use of text classifiers for automatically organizing documents is appropriate. Documents are traditionally organized in construction projects according to fixed, abstract categories based on document metadata [1]. Examples of studies investigating the use of automatic text classification of construction documents include identifying the corresponding project division for minutes of meeting items [2] and classifying product documents to their relevant division in a construction information classification system [3].

While traditional methods of organizing construction project documents are simple and easy to use, they are not very useful for information retrieval unless the information seeker has thorough knowledge of the document body [1]. Information regarding a researched knowledge topic is almost always distributed over multiple categories thus requiring understanding of document content, not just metadata, to determine relevancy of a document to the researched topic; a time-consuming task that entails the application of human semantic capabilities. Also, the above-mentioned restrictions that constrain the use of classifiers do not apply with unsupervised methods: unsupervised methods do not require previous identification of all possible classes nor are they trained from sample data. The objective of this study is to evaluate the performance of an unsupervised learning text analysis technique in organizing project documents into groups of semantically similar documents; each group defined by its relation to a specific searchable knowledge topic. It is hypothesized that textual similarity between project documents accurately reflects semantic relationships between the documents and, when applied in document management and information retrieval tasks, can achieve results comparable to what humans recognize using their semantic capabilities. In the next section, the text analysis technique used in the study is presented along with several of its applications in previous works. Then the methodology implemented for the evaluation is presented, followed by a detailed analysis of the results. The study is concluded with a summary of the main results and a discussion on practical uses and limitations of implementing the proposed technique.

Section snippets

Clustering

Research on clustering methods for information retrieval dates back to the second half of the twentieth century. The main objective of clustering is to provide structure to a large dataset by organizing similar data together thus facilitating search and retrieval tasks. Clustering methods can be categorized according to the structure they generate into flat clustering and hierarchical clustering [4]. With flat- or non-hierarchical-clustering, the dataset is divided into a number of subsets of

Methodology

Since the objective is to organize construction project documents into semantically related groups, a hierarchical clustering structure is not warranted, especially given the associated computational complexity of agglomerative clustering. For the current task, flat clustering is more suitable and economical. The use of K-means requires pre-defining the number of clusters (cardinality) before implementing the algorithm. It is up to the users to judge cardinality based on their knowledge of the

Results and analysis

A better understanding of the clustering performance is achieved by adopting a baseline to compare the results with. A baseline gives perspective to the results by representing the lower boundary below which results are considered meaningless and unacceptable. The probability of a random correct result is a common criteria used in classification evaluations for specifying a baseline. However, using the random approach for evaluating clustering performance will grossly underestimate the

Clustering using a hybrid approach

Fig. 7 displays two different clustering outcomes with an almost identical F-measure score, one for each weighting method. The general characteristics of fragmentation and impurity discussed in the previous section apply to both cases. If the small clusters – the group of outliers – in the tf–idf outcome are ignored, the remaining large clusters with minimal impurity can still make an acceptable representation of every true class in the dataset. For example, class A is represented by cluster

Summary and conclusion

When the project document corpus is complete and appropriately organized (e.g. for previously completed projects), in such case the use of text classifiers for document retrieval is suitable. However in many cases, the document corpus is gradually and continuously developing (such as the case of an ongoing project) and the classes required for training in a supervised learning method are not readily available. Particularly when classes are not predetermined and do not cover the whole spectrum

References (17)

C.H. Caldas et al.
Automating hierarchical document classification for construction management information systems
Automation in Construction
(2003)
M. Al Qady et al.
Document management in construction—practices and opinions
Journal of Construction Engineering and Management
(2013)
C.H. Caldas et al.
Automated classification of construction project documents
Journal of Computing in Civil Engineering
(2002)
W.B. Frakes et al.
Information Retrieval: Data Structure and Algorithms
(1992)
C.D. Manning et al.
Introduction to Information Retrieval
(2008)
S. Saitta et al.
Improving system identification using clustering
Journal of Computing in Civil Engineering
(2008)
T. Cheng et al.
Modeling tower crane operator visibility to minimize the risk of limited situational awareness
Journal of Computing in Civil Engineering
(Dec. 14 2012)
H.S. Ng et al.
Knowledge discovery in a facility condition assessment database using text clustering
Journal of Infrastructure Systems
(2006)

There are more references available in the full text version of this article.

Cited by (53)

Deep learning-based text knowledge classification for whole-process engineering consulting standards
2024, Journal of Engineering Research (Kuwait)
The knowledge classification technology has significant implications for the intelligent research of industries. In the field of whole-process engineering consulting, manually reading and processing large amounts of text data is both time-consuming and laborious. Knowledge classification technology can automatically classify these text data and extract key information, which can improve industry work efficiency. In this study, a deep learning-based text knowledge classification method is proposed to address the large-scale text classification problem in the whole-process engineering consulting field. Firstly, pre-trained language models such as RoBERTa, BERT, and Longformer-RoBERTa are used to extract features from text. Secondly, a multi-label classification model is used to classify the text. Experimental results show that the proposed method performs better than other commonly used models in both overall classification performance and individual category classification performance. Moreover, when the text knowledge classification model is integrated as a text representation module with common classification models such as CNN and LSTM, its performance is inferior to that of a pure classification model. The proposed text knowledge classification method is of great significance for the application in the field of whole-process engineering consulting and provides an effective solution for intelligent research in engineering consulting.
Overview and analysis of the text mining applications in the construction industry
2022, Heliyon
The data generation in the construction industry has increased dramatically. The major portion of the data in the architecture, engineering and construction (AEC) domain are unstructured textual documents. Text mining (TM) has been introduced to the construction industry to extract underlying knowledge from unstructured data. However, few articles have comprehensively reviewed applications of TM in the AEC domain. Thus, this study adopts a qualitative-quantitative method to conduct a state-of-the-art survey on the articles related to applications of TM in the construction industry which published between the year of 2000 and 2021. VOSviewer software was applied to provide an overview of TM applications regarding to the publication trend, active countries and regions, productive authors, and co-occurrence of keywords perspectives. Eight prime application fields of TM were discussed and analyzed in detail. Five key challenges and three future directions have been proposed. This review can help the research community to grasp the state-of-the-art of TM applications in the construction industry and identify the directions of further research.
Automated detection of contractual risk clauses from construction specifications using bidirectional encoder representations from transformers (BERT)
2022, Automation in Construction
Detecting contractual risk information from construction specifications is crucial to succeeding in construction projects. This paper describes clause classification using the Bidirectional Encoder Representations from Transformers (BERT) method in natural language processing. Seven risk categories are determined from a literature review, including payment, temporal, procedure, safety, role and responsibility, definition, and reference. Using 2807 clauses from 56 construction specifications, the BERT-based clause classification model returns noticeable performances with 0.889 accuracy for validation and a 0.934 F1 score on testing. The model is evaluated by comparing the clause classification performance with other machine learning methods, including the support vector machine and a simple deep neural network, and shows dominant performance on every risk category. Practitioners in the construction industry are the primary beneficiaries of the research as the model will contribute to improving the construction specification review process and risk management during construction projects.
Applications of natural language processing in construction
2022, Automation in Construction
In the construction industry under “Industry 4.0”, Natural Language Processing (NLP) has been widely used to process and analyze text data to achieve construction intelligence. However, there lacks a comprehensive review of NLP application in construction-related areas, raising bar of research entry and setting obstacles for the rapid development in this fields. Ninety one NLP-related research articles in construction-related fields were retrieved to conduct a scientometric analysis using CiteSpace and VOSViewer, and summarized from the perspectives of anchordatasets/data sources, technologies/tools, and applications and progress. The results show that data isolation causing non-reproducibility of research is one of the severe problems to be solved. Besides, pure NLP application studies will no longer meet the future industry development needs and more cross-modal interdisciplinary research based on the end-to-end pre-trained neural network model framework is needed. This study helps readers gain an in-depth understanding of the NLP application and development in construction.
Automated system for construction specification review using natural language processing
2022, Advanced Engineering Informatics
Existing attempts to automate construction document analysis are limited in understanding the varied semantic properties of different documents. Due to the semantic conflicts, the construction specification review process is still conducted manually in practice despite the promising performance of the existing approaches. This research aimed to develop an automated system for reviewing construction specifications by analyzing the different semantic properties using natural language processing techniques. The proposed method analyzed varied semantic properties of 56 different specifications from five different countries in terms of vocabulary, sentence structure, and the organizing styles of provisions. First, the authors developed a semantic thesaurus for construction terms including 208 word-replacement rules based on Word2Vec embedding to understand the different vocabularies. Second, the authors developed a named entity recognition model based on bi-directional long short-term memory with a conditional random field layer, which identified the required keywords from given provisions with an averaged F1 score of 0.928. Third, the authors developed a provision-pairing model based on Doc2Vec embedding, which identified the most relevant provisions with an average accuracy of 84.4%. The web-based prototype demonstrated that the proposed system can facilitate the construction specification review process by reducing the time spent, supplementing the reviewer’s experience, enhancing accuracy, and achieving consistency. The results contribute to risk management in the construction industry, with practitioners being able to review construction specifications thoroughly in spite of tight schedules and few available experts.
Mapping textual descriptions to condition ratings to assist bridge inspection and condition assessment using hierarchical attention
2021, Automation in Construction
Current bridge management strategies rely on experience-driven manually assigned condition ratings that are vulnerable to human subjectivity and experience variance. To improve the consistency of the condition rating practices, this study identifies narrative descriptions from bridge inspection reports as an untapped data source and proposes a data-driven framework as a supportive tool for two applications: automated condition recommendation and real-time quality control. A hierarchical architecture employing recurrent neural network encoders with an attention mechanism was developed using a collection of reports from the Virginia Department of Transportation. The condition recommendation application performed a classification task and demonstrated improved performance over a variety of baseline systems. The quality control application learns a data-driven decision threshold to decide whether to accept or reject an inspector-provided rating, which provides a cyber-human collaboration route for condition assessment. Visualization of the resulting attention patterns was shown to provide interpretable insights which highlight potentially-overlooked indicators.

View all citing articles on Scopus

¹: Tel.: + 1 765 494 2246.

View full text

Automatic clustering of construction project documents based on textual similarity

Highlights

Abstract

Introduction

Section snippets

Clustering

Methodology

Results and analysis

Clustering using a hybrid approach

Summary and conclusion

Automation in Construction

Document management in construction—practices and opinions

Journal of Construction Engineering and Management

Automated classification of construction project documents

Journal of Computing in Civil Engineering

Information Retrieval: Data Structure and Algorithms

Introduction to Information Retrieval

Improving system identification using clustering

Journal of Computing in Civil Engineering

Modeling tower crane operator visibility to minimize the risk of limited situational awareness

Journal of Computing in Civil Engineering

Knowledge discovery in a facility condition assessment database using text clustering

Journal of Infrastructure Systems