Elsevier

Automation in Construction

Volume 42, June 2014, Pages 36-49
Automation in Construction

Automatic clustering of construction project documents based on textual similarity

https://doi.org/10.1016/j.autcon.2014.02.006Get rights and content

Highlights

  • Hybrid approach for clustering semantically-related project documents is proposed.

  • For clustering, inverse relationship between dimensionality & similarity threshold.

  • tf–idf weighting method results in high precision, average recall outcomes.

  • Refining clustering outcome using supervised learning improves accuracy.

  • Textual similarities can be used to reveal semantic relations between documents.

Abstract

Text classifiers, as supervised learning methods, require a comprehensive training set that covers all classes in order to classify new instances. This limits the use of text classifiers for organizing construction project documents since it is not guaranteed that sufficient samples are available for all possible document categories. To overcome the restriction imposed by the all-inclusive requirement, an unsupervised learning method was used to automatically cluster documents together based on textual similarities. Repeated evaluations using different randomizations of the dataset revealed a region of threshold/dimensionality values of consistently high precision values and average recall values. Accordingly, a hybrid approach was proposed which initially uses an unsupervised method to develop core clusters and then trains a text classifier on the core clusters to classify outlier documents in a consequent refinement step. Evaluation of the hybrid approach demonstrated a significant improvement in recall values, resulting in an overall increase in F-measure scores.

Introduction

Automatic classification of documents as a supervised learning method requires a set of class labels and samples of each class in order to conduct the learning process before being able to perform predictions for new document instances. Usually, the classification procedure assumes that the classes are all inclusive (that they make a complete set of all the possible outcomes for any new instance) and that they are mutually exclusive (any new instance can belong to one and only one class). Where classes are static and predefined, the use of text classifiers for automatically organizing documents is appropriate. Documents are traditionally organized in construction projects according to fixed, abstract categories based on document metadata [1]. Examples of studies investigating the use of automatic text classification of construction documents include identifying the corresponding project division for minutes of meeting items [2] and classifying product documents to their relevant division in a construction information classification system [3].

While traditional methods of organizing construction project documents are simple and easy to use, they are not very useful for information retrieval unless the information seeker has thorough knowledge of the document body [1]. Information regarding a researched knowledge topic is almost always distributed over multiple categories thus requiring understanding of document content, not just metadata, to determine relevancy of a document to the researched topic; a time-consuming task that entails the application of human semantic capabilities. Also, the above-mentioned restrictions that constrain the use of classifiers do not apply with unsupervised methods: unsupervised methods do not require previous identification of all possible classes nor are they trained from sample data. The objective of this study is to evaluate the performance of an unsupervised learning text analysis technique in organizing project documents into groups of semantically similar documents; each group defined by its relation to a specific searchable knowledge topic. It is hypothesized that textual similarity between project documents accurately reflects semantic relationships between the documents and, when applied in document management and information retrieval tasks, can achieve results comparable to what humans recognize using their semantic capabilities. In the next section, the text analysis technique used in the study is presented along with several of its applications in previous works. Then the methodology implemented for the evaluation is presented, followed by a detailed analysis of the results. The study is concluded with a summary of the main results and a discussion on practical uses and limitations of implementing the proposed technique.

Section snippets

Clustering

Research on clustering methods for information retrieval dates back to the second half of the twentieth century. The main objective of clustering is to provide structure to a large dataset by organizing similar data together thus facilitating search and retrieval tasks. Clustering methods can be categorized according to the structure they generate into flat clustering and hierarchical clustering [4]. With flat- or non-hierarchical-clustering, the dataset is divided into a number of subsets of

Methodology

Since the objective is to organize construction project documents into semantically related groups, a hierarchical clustering structure is not warranted, especially given the associated computational complexity of agglomerative clustering. For the current task, flat clustering is more suitable and economical. The use of K-means requires pre-defining the number of clusters (cardinality) before implementing the algorithm. It is up to the users to judge cardinality based on their knowledge of the

Results and analysis

A better understanding of the clustering performance is achieved by adopting a baseline to compare the results with. A baseline gives perspective to the results by representing the lower boundary below which results are considered meaningless and unacceptable. The probability of a random correct result is a common criteria used in classification evaluations for specifying a baseline. However, using the random approach for evaluating clustering performance will grossly underestimate the

Clustering using a hybrid approach

Fig. 7 displays two different clustering outcomes with an almost identical F-measure score, one for each weighting method. The general characteristics of fragmentation and impurity discussed in the previous section apply to both cases. If the small clusters – the group of outliers – in the tf–idf outcome are ignored, the remaining large clusters with minimal impurity can still make an acceptable representation of every true class in the dataset. For example, class A is represented by cluster

Summary and conclusion

When the project document corpus is complete and appropriately organized (e.g. for previously completed projects), in such case the use of text classifiers for document retrieval is suitable. However in many cases, the document corpus is gradually and continuously developing (such as the case of an ongoing project) and the classes required for training in a supervised learning method are not readily available. Particularly when classes are not predetermined and do not cover the whole spectrum

References (17)

  • C.H. Caldas et al.

    Automating hierarchical document classification for construction management information systems

    Automation in Construction

    (2003)
  • M. Al Qady et al.

    Document management in construction—practices and opinions

    Journal of Construction Engineering and Management

    (2013)
  • C.H. Caldas et al.

    Automated classification of construction project documents

    Journal of Computing in Civil Engineering

    (2002)
  • W.B. Frakes et al.

    Information Retrieval: Data Structure and Algorithms

    (1992)
  • C.D. Manning et al.

    Introduction to Information Retrieval

    (2008)
  • S. Saitta et al.

    Improving system identification using clustering

    Journal of Computing in Civil Engineering

    (2008)
  • T. Cheng et al.

    Modeling tower crane operator visibility to minimize the risk of limited situational awareness

    Journal of Computing in Civil Engineering

    (Dec. 14 2012)
  • H.S. Ng et al.

    Knowledge discovery in a facility condition assessment database using text clustering

    Journal of Infrastructure Systems

    (2006)
There are more references available in the full text version of this article.

Cited by (53)

View all citing articles on Scopus
1

Tel.: + 1 765 494 2246.

View full text