Skip to main content

About this book

This book covers the state of the art in learning algorithms with an inclusion of semi-supervised methods to provide a broad scope of clustering and classification solutions for big data applications. Case studies and best practices are included along with theoretical models of learning for a comprehensive reference to the field. The book is organized into eight chapters that cover the following topics: discretization, feature extraction and selection, classification, clustering, topic modeling, graph analysis and applications. Practitioners and graduate students can use the volume as an important reference for their current and future research and faculty will find the volume useful for assignments in presenting current approaches to unsupervised and semi-supervised learning in graduate-level seminar courses. The book is based on selected, expanded papers from the Fourth International Conference on Soft Computing in Data Science (2018).

Includes new advances in clustering and classification using semi-supervised and unsupervised learning;Address new challenges arising in feature extraction and selection using semi-supervised and unsupervised learning;Features applications from healthcare, engineering, and text/social media mining that exploit techniques from semi-supervised and unsupervised learning.

Table of Contents




Chapter 1. A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Machine learning is as growing as fast as concepts such as Big data and the field of data science in general. The purpose of the systematic review was to analyze scholarly articles that were published between 2015 and 2018 addressing or implementing supervised and unsupervised machine learning techniques in different problem-solving paradigms. Using the elements of PRISMA, the review process identified 84 scholarly articles that had been published in different journals. Of the 84 articles, 6 were published before 2015 despite their metadata indicating that they were published in 2015. The existence of the six articles in the final papers was attributed to errors in indexing. Nonetheless, from the reviewed papers, decision tree, support vector machine, and Naïve Bayes algorithms appeared to be the most cited, discussed, and implemented supervised learners. Conversely, k-means, hierarchical clustering, and principal component analysis also emerged as the commonly used unsupervised learners. The review also revealed other commonly used algorithms that include ensembles and reinforce learners, and future systematic reviews can focus on them because of the developments that machine learning and data science is undergoing at the moment.
Mohamed Alloghani, Dhiya Al-Jumeily, Jamila Mustafina, Abir Hussain, Ahmed J. Aljaaf

Chapter 2. Overview of One-Pass and Discard-After-Learn Concepts for Classification and Clustering in Streaming Environment with Constraints

With the advancement of internet technology and sensor networks, tremendous amount of data have been generated beyond our imagination. These data contain valuable and possibly relevant information for various fields of applications. Learning these data online by using current neural learning techniques is not so simple due to many technical constraints including data overflow, uncontrollable learning epochs, arbitrary class drift, and dynamic imbalanced class ratio. Recently, we have been attempted to tackle this neural learning problem under the non-stationary environment. In this article, we summarize the new concept of One-Pass-Learning-and-Discard and also new structures, called Malleable Hyper-ellipsoid and Hyper-cylinder, of neural network recently introduced to cope with supervised as well as unsupervised learning under the constraints of data overflow, preserving polynomial time and space complexities of learning process, arbitrary class drift, life of data, and dynamic imbalanced class ratio. Both structures are rotatable, transposable, and expandable according to the distribution and location of data cluster.
Chidchanok Lursinsap

Chapter 3. Distributed Single-Source Shortest Path Algorithms with Two-Dimensional Graph Layout

Single-source shortest path (SSSP) is a well-known graph computation that has been studied for more than half a century. It is one of the most common graph analytical analyses in many research areas such as networks, communication, transportation, electronics, and so on. In this chapter, we propose scalable SSSP algorithms for distributed memory systems. Our algorithms are based on a ∆-stepping algorithm with the use of a two-dimensional (2D) graph layout as an underlying graph data structure to reduce communication overhead and improve load balancing. The detailed evaluation of the algorithms on various large-scale, real-world graphs is also included. Our experiments show that the algorithm with the 2D graph layout delivers up to three times the performance (in TEPS), and uses only one-fifth of the communication time of the algorithm with a one-dimensional layout.
Thap Panitanarak

Chapter 4. Using Non-negative Tensor Decomposition for Unsupervised Textual Influence Modeling

Documents are seldom created in a vacuum. In all literature, there exists some influencing factor either in the form of cited documents, collaboration, or documents which authors have read. This influence can be seen within their works, and is present as a latent variable. This chapter demonstrates a novel method for quantifying these influences and representing them in a semantically understandable fashion. The model is constructed by representing documents as tensors, decomposing them into a set of factors, and then searching the corpus factors for similarity.
Robert E. Lowe, Michael W. Berry



Chapter 5. Survival Support Vector Machines: A Simulation Study and Its Health-Related Application

The support vector machine (SVM) has become a state-of-the-art classification method. Extensive developments of SVM ensure that support vector regression (SVR) is employed in many fields. In particular, in health applications, one of the most popular methods is survival data analysis. This paper describes the use of survival least square support vector machine (SURLS-SVM) applied to cervical cancer (CC) data with the benchmark Cox proportional hazards model (Cox PHM). The Cox PHM has assumptions that, unfortunately, cannot always be met in real cases. The SURLS-SVM overcomes this drawback. The SURLS-SVM cannot inform which predictors are significant, as the Cox PHM does. To address this issue, the feature selection using backward elimination is employed utilizing a concordance index increment. Moreover, the simulation study was conducted to know the effect of the number of features, sample size, and censoring percentage on the performance of the SURLS-SVM.
Dedy Dwi Prastyo, Halwa Annisa Khoiri, Santi Wulan Purnami, Suhartono, Soo-Fen Fam, Novri Suhermi

Chapter 6. Semantic Unsupervised Learning for Word Sense Disambiguation

The identification of the particular meaning for a word based on the context of its usage is commonly referred to as Word Sense Disambiguation (or WSD). Although considered a complex task, WSD is an important component of language processing and information analysis systems in several fields. Current methods for WSD rely on human input and are usually limited to a finite set of words. Complicating matters further, language is dynamic with (current) word usage changes and the introduction of new words. Static definitions created by previously defined analyses become outdated or are inadequate to deal with current usage. Fully automated methods for WSD are needed both for sense discovery and for distinguishing the sense being used for a word in context. Latent Semantic Analysis (LSA) is a candidate automated unsupervised learning system that has not been widely applied in this area. In this chapter, advanced LSA techniques are deployed as an unsupervised learning approach to the WSD tasks of sense discovery and distinguishing senses in use.
Dian I. Martin, Michael W. Berry, John C. Martin

Chapter 7. Enhanced Tweet Hybrid Recommender System Using Unsupervised Topic Modeling and Matrix Factorization-Based Neural Network

Recommender systems (RS) were created to recommend interesting items to users. There are two recommendation techniques: content-based filtering (CBF) and collaborative filtering (CF). CBF makes recommendations based on a user’s past behavior, whereas CF makes a recommendation using past neighbors’ opinions who have similar behavior to the target user. Nowadays, there are many data on social networks, including Tweets on Twitter. Thus, many researchers have studied RS based on Tweets using latent Dirichlet allocation (LDA) to extract latent data from observed data. Nevertheless, those researchers use either CBF or CF with LDA only. However, CBF provides recommendations that are too specific, whereas CF has sparsity and a cold-start problem. Therefore, this research proposes a new method of recommending Tweets based on hybrid RS with LDA (unsupervised topic modeling) and generalized matrix factorization (supervised learning-based neural network). From experimental results, the proposed method outperforms on mean absolute error and coverage.
Arisara Pornwattanavichai, Prawpan Brahmasakha na sakolnagara, Pongsakorn Jirachanchaisiri, Janekhwan Kitsupapaisan, Saranya Maneeroj

Chapter 8. New Applications of a Supervised Computational Intelligence (CI) Approach: Case Study in Civil Engineering

In this study, artificial neural network (ANN) techniques are used in an attempt to predict the nonlinear hyperbolic soil stress–strain relationship parameters (k and R f). Two ANN models are developed and trained to achieve the planned target, in an attempt at making the experimental test (unconsolidated undrained triaxial test) unnecessary. The first is logarithm of modulus number (log k), and the second is failure ratio (R f). A database of laboratory measurements comprises a total of (83) case records for modulus number (k) and failure ratio (R f). Four parameters are considered to have the most significant impact on the nonlinear soil stress–strain relationship parameters, which are used as an independent input variables (IIVs) to the developed the proposed ANNs models. These comprise of: Plasticity index (PI), Dry unit weight (γ dry), Water content (ω o), and Confining stress (σ 3), the output models are respectively, (log k), and (R f). Multilayer perceptron trainings using back-propagation algorithm are used in this work. The effect of a number of issues in relation to ANN construction such as ANN geometry and internal parameters on the performance of ANN models is investigated. Information on the relative importance of the factors affecting the (log k), and (R f) is presented, and practical equations for their prediction are proposed.
Ameer A. Jebur, Dhiya Al-Jumeily, Khalid R. Aljanabi, Rafid M. Al Khaddar, William Atherton, Zeinab I. Alattar, Adel H. Majeed, Jamila Mustafina


Additional information