Skip to main content
Top

2015 | Book

Soft Computing in Data Science

First International Conference, SCDS 2015, Putrajaya, Malaysia, September 2-3, 2015, Proceedings

Editors: Michael W. Berry, Azlinah Mohamed, Bee Wah Yap

Publisher: Springer Singapore

Book Series : Communications in Computer and Information Science

insite
SEARCH

About this book

This book constitutes the refereed proceedings of the International Conference on Soft Computing in Data Science, SCDS 2015, held in Putrajaya, Malaysia, in September 2015.

The 25 revised full papers presented were carefully reviewed and selected from 69 submissions. The papers are organized in topical sections on data mining; fuzzy computing; evolutionary computing and optimization; pattern recognition; human machine interface; hybrid methods.

Table of Contents

Frontmatter

Data Mining

Frontmatter
An Improved Particle Swarm Optimization via Velocity-Based Reinitialization for Feature Selection
Abstract
The performance of feature selection method is typically measured based on the accuracy and the number of selected features. The use of particle swarm optimization (PSO) as the feature selection method was found to be competitive than its optimization counterpart. However, the standard PSO algorithm suffers from premature convergence, a condition whereby PSO tends to get trapped in a local optimum that prevents it from being converged to a better position. This paper attempts to improve the velocity-based initialization (VBR) method on the feature selection problem using support vector machine classifier following the wrapper method strategy. Five benchmark datasets were used to implement the method. The results were analyzed based on classifier performance and the selected number of features. It was found that on average, the accuracy of the particle swarm optimization with an improved velocity-based initialization method is higher than the existing VBR method and generally generates a lesser number of features.
Shuzlina Abdul-Rahman, Azuraliza Abu Bakar, Zeti-Azura Mohamed-Hussein
Classifying Forum Questions Using PCA and Machine Learning for Improving Online CQA
Abstract
As one of the most popular e-Business models, community question answering (CQA) services increasingly gather large amount of knowledge through the voluntary services of the online community across the globe. While most questions in CQA usually receive an answer posted by the peer users, it is found that the number of unanswered or ignored questions soared up high in the past few years. Understanding the factors that contribute to questions being answered as well as questions remain ignored can help the forum users to improve the quality of their questions and increase their chances of getting answers from the forum. In this study, feature selection method called Principal Component Analysis was used to extract the factors or components of the features. Then data mining techniques was used to identify the relevant features that will help predict the quality of questions.
Simon Fong, Yan Zhuang, Kexing Liu, Shu Zhou
Data Projection Effects in Frequent Itemsets Mining
Abstract
Nowadays, there are a number of algorithms that have been proposed in frequent itemsets mining (FIM). Data projection is one of the key features in FIM that affects the overall performance. The aim is to speed up the searching process by rearranging the items in a more compact form and to fit all the items in the data set in main memory efficiently without losing any information. The data refer to how the data set is stored in the main memory before the mining process begin. This paper explores the effects of data projection on frequent itemset mining from three different data projection types which are FP-Tree (tree-based), H-Struct (array-based) and FP-Graph (graph-based). The time construction and memory consumption are used to evaluate the parse and the dense of the data set. The result showed the construction of H-Struct is the fastest, but it consumes more time to mine frequent itemsets compared with FP-Tree and FP-Graph.
Mohammad Arsyad Mohd Yakop, Sofianita Mutalib, Shuzlina Abdul-Rahman
Data Quality Issues in Data Migration
Abstract
The main criterion of a successful data migration project is the data quality. Quality of data can be compromised depending upon how the data are received, integrated, maintained, processed and loaded. The data migration project requires the data to be extracted from multiple sources before being cleansed and transformed. Once the data are cleansed and transformed, the data will be loaded into a new system. Therefore, data cleansing is the most important activity in a data migration project. Data cleansing is the process of detecting and removing errors, inconsistencies and redundancies in order to improve the quality of data
Nurhidayah Muhamad Zahari, Wan Ya Wan Hussin, Mohd Yunus Mohd Yussof, Fauzi Mohd Saman
Reviewing Classification Approaches in Sentiment Analysis
Abstract
The advancement of web technologies has changed the way people share and express their opinions. People enthusiastically shared their thoughts and opinions via online media such as forums, blogs and social networks. The overwhelmed of online opinionated data have gained much attention by researchers especially in the field of text mining and natural language processing (NLP) to study in depth about sentiment analysis. There are several methods in classifying sentiment, including lexicon-based approach and machine learning approach. Each approach has its own advantages and disadvantages. However, there are not many literatures deliberate on the comparison of both approaches. This paper presents an overview of classification approaches in sentiment analysis. Various advantages and limitations of the sentiment classification approaches based on several criteria such as domain, classification type and accuracy are also discussed in this paper.
Nor Nadiah Yusof, Azlinah Mohamed, Shuzlina Abdul-Rahman
Comparisons of ADABOOST, KNN, SVM and Logistic Regression in Classification of Imbalanced Dataset
Abstract
Data mining classification techniques are affected by the presence of imbalances between classes of a response variable. The difficulty in handling the imbalanced data issue has led to an influx of methods, either resolving the imbalance issue at data or algorithmic level. The R programming language is one of the many tools available for data mining. This paper compares some classification algorithms in R for an imbalanced medical data set. The classifiers ADABOOST, KNN, SVM-RBF and logistic regression were applied to the original, random oversampling and undersampling data sets. Results show that ADABOOST, KNN and SVM-RBF exhibits over-fitting when applied to the original dataset. No overfitting occurs for the random oversampling dataset where by SVM-RBF has the highest accuracy (Training: 91.5%, Testing: 90.6%), sensitivity (Training :91.0%, Testing: 91.0%), specificity (Training: 92.0%,Testing: 90.2%) and precision (Training:91.9%, Testing 90.5%) for training and testing data set. For random undersampling, no overfitting occurs only for ADABOOST and logistic regression. Logistic regression is the most stable classifier exhibiting consistent training an testing results.
Hezlin Aryani Abd Rahman, Yap Bee Wah, Haibo He, Awang Bulgiba
Finding Significant Factors on World Ranking of e-Governments by Feature Selection Methods over KPIs
Abstract
Computing significant factors quantitatively is an imperative task in understanding the underlying reasons that contribute to a final outcome. In this paper, a case of e-Government ranking is studied by attempting to find the significance of each KPIs which leads to resultant rank of a country. Significant factors in this context are inferred as some degrees of relations between the input variables (which are the KPIs in this case) and the final outcome (the rank). In the past, significant factors were either acquired as first-hand information via direct questioning from users’ satisfaction survey or qualitative inference; typical question is ‘You are satisfied with a particular e-Government service’ by applying a multi-level Likert scale. Respondents answered by choosing one of the following: Strongly agree, Agree, Neutral, Disagree, and Strongly Disagree. The replies are then counted and studied using traditional statistical methods. In this paper, an alternative method by feature selection in data mining is proposed which computes quantitatively the relative importance of each KPI with respective to the predicted class, the rank. The main advantage of feature selection by data mining (FSDM) method is that it considers the cross-dependencies of the variables and how they contribute as a whole predictive model to a particular predicted outcome. In contrast, classical significant factor analysis such as correlogram tells only the strength of correlation between an individual pair of factor and outcome. Another advantage of using data mining method over simple statistic is that the inferred predictive model could be used as a predictor and/or what-if decision simulator; given some values of KPIs a corresponding rank could be guesstimated. A case study of computing significant factors in terms of KPIs that lead to the world rank in from the data of UN e-Government Survey 2010, is presented.
Simon Fong, Yan Zhuang, Huilong Luo, Kexing Liu, Gia Kim

Fuzzy Computing

Frontmatter
Possibility Vague Soft Expert Set Theory and Its Application in Decision Making
Abstract
In this paper, we aim to extend the notion of classical soft expert sets to possibility vague soft expert sets by applying the theory of soft expert sets to possibility vague soft sets. The complement, union, intersection, AND and OR operations as well as some related concepts pertaining to this notion are defined. The algebraic properties such as the De Morgan’s laws and the relevant laws of possibility vague soft expert sets are studied and subsequently proved. Lastly, this concept are applied to a decision making problem and its effectiveness is demonstrated using a hypothetical example.
Ganeshsree Selvachandran, Abdul Razak Salleh
An Iterative Method for Solving Fuzzy Fractional Differential Equations
Abstract
The aim of this paper is to solve fuzzy fractional differential equations (FFDEs) of the Caputo type. The basic idea is to convert FFDEs to a type of fuzzy Volterra integral equation. Then the obtained Volterra integral equation will be exploited with some suitable quadrature rules to get a fractional predictor-corrector method. The results show that the proposed method exhibit high precision with low cost.
Ali Ahmadian, Fudziah Ismail, Norazak Senu, Soheil Salahshour, Mohamed Suleiman, Sarkhosh Seddighi Chaharborj
Contrast Comparison of Flat Electroencephalography Image: Classical, Fuzzy, and Intuitionistic Fuzzy Set
Abstract
Image processing is used to enhance visual appearance of images for further interpretation. One of the applications of image processing is in medical imaging. Generally, the pixel values of an image may not be precise as uncertainty arises within the gray values of an image due to several factors. In this paper, the image of Flat EEG (fEEG) is compared via classical, fuzzy, and intuitionistic fuzzy set (IFS) methods. Furthermore, the comparison between the input and output images of fEEG is carried out based on contrast comparison.
Suzelawati Zenian, Tahir Ahmad, Amidora Idris
An Autocatalytic Model of a Pressurized Water Reactor in a Nuclear Power Generation
Abstract
The control system of a nuclear reactor ensures the safe operation of a nuclear power plant. A Pressurized Water Reactor (PWR) in a nuclear reactor is a complex system since it contains uranium oxide. The aim of this paper is to model the process of operation for primary system in PWR in the form of graphical representation. The method of autocatalytic set approach of PWR has been introduced and presented in this paper. Further, the result of the dynamic process of the model has been presented, which is then is verified against published data.
Azmirul Ashaari, Tahir Ahmad, Mustaffa Shamsuddin, Wan Munirah Wan Mohammad

Evolutionary Computing/Optimization

Frontmatter
Selfish Gene Image Segmentation Algorithm
Abstract
The research proposes a selfish gene image segmentation algorithm as an alternative to Genetic Algorithm. Research in Genetic Algorithms originated from Darwin’s theory faced the problem of finding the optimal solution due to its inherent characteristic of genetic drift and premature convergence. Selfish gene views genes as the basic unit in evolution. Thus the color image segmentation algorithm is designed based on virtual population with collection of genes rather than fixed genes chromosomes. The genes are positioned into predetermined loci forming two chromosomes that make up the virtual population in each generation. The chromosomes are rewarded and penalized according to the chromosomes performance. Evaluation with the ground truth images shows that the selfish gene is able to detect the variation of colors very similar to the way eye detect color.
Noor Elaiza Abd Khalid, Norharyati Md Ariff, Ahmad Firdaus Ahmad Fadzil, Noorhayati Mohamed Noor
Detecting IMSI-Catcher Using Soft Computing
Abstract
Lately, from a secure system providing adequate user’s protection of confidentiality and privacy, the mobile communication has been degraded to be a less trustful one due to the revelation of IMSI catchers that enable mobile phone tapping. To fight against these illegal infringements there are a lot of activities aiming at detecting these IMSI catchers. However, so far the existing solutions are only device-based and intended for the users in their self-protection. This paper presents an innovative network-based IMSI catcher solution that makes use of machine learning techniques. After giving a brief description of the IMSI catcher the paper identifies the attributes of the IMSI catcher anomaly. The challenges that the proposed system has to surmount are also explained. Last but least, the overall architecture of the proposed Machine Learning based IMSI catcher Detection system is described thoroughly.
Thanh van Do, Hai Thanh Nguyen, Nikolov Momchil, Van Thuan Do
Solving Curriculum Based Course Timetabling by Hybridizing Local Search Based Method within Harmony Search Algorithm
Abstract
The curriculum-based university course timetabling which has been established as non-deterministic polynomial problem involves the allocation of timeslots and rooms for a set of courses depend on the hard or soft constraints that are listed by the university. To solve the problem, firstly a set of hard constraints were fulfilled in order to obtain a feasible solution. Secondly, the soft constraints were fulfilled as much as possible. In this paper we focused to satisfy the soft constraints using a hybridization of harmony search with a great deluge. Harmony search comprised of two main operators such as memory consideration and random consideration operator. The hybridization consisted three setups based on the application of great deluge on the operators of the harmony search. The great deluge was applied either on the memory consideration operator, or random consideration operator or both operators together. In addition, several harmony memory consideration rates were applied on those setups. The algorithms of all setups were tested on curriculum-based datasets taken from the International Timetabling Competition, ITC2007. The results demonstrated that our approach was able to produce comparable solutions (with lower penalties on several data instances) when compared to other techniques from the literature.
Juliana Wahid, Naimah Mohd Hussin
A Parallel Latent Semantic Indexing (LSI) Algorithm for Malay Hadith Translated Document Retrieval
Abstract
Latent Semantic Indexing (LSI) is one of the well-known searching techniques which match queries to documents in information retrieval applications. LSI has been proven to improve the retrieval performance, however, as the size of documents gets larger, current implementations are not fast enough to compute the result on a standard personal computer. In this paper, we proposed a new parallel LSI algorithm on standard personal computers with multi-core processors to improve the performance of retrieving relevant documents. The proposed parallel LSI was designed to automatically run the matrix computation on LSI algorithms as parallel threads using multi-core processors. The Fork-Join technique is applied to execute the parallel programs. We used the Malay Translated Hadith of Shahih Bukhari from Jilid 1 until Jilid 4 as the test collections. The total number of documents used is 2028 of text files. The processing time during the pre-processing phase of the documents for the proposed parallel LSI is measured and compared to the sequential LSI algorithm. Our results show that processing time for pre-processing tasks using our proposed parallel LSI system is faster than sequential system. Thus, our proposed parallel LSI algorithm has improved the searching time as compared to sequential LSI algorithm.
Nurazzah Abd Rahman, Zulaile Mabni, Nasiroh Omar, Haslizatul Fairuz Mohamed Hanum, Nik Nur Amirah Tuan Mohamad Rahim
Short Term Traffic Forecasting Based on Hybrid of Firefly Algorithm and Least Squares Support Vector Machine
Abstract
The goal of an active traffic management is to manage congestion based on current and predicted traffic conditions. This can be achieved by utilizing traffic historical data to forecast the traffic flow which later supports travellers for a better journey planning. In this study, a new method that integrates Firefly algorithm (FA) with Least Squares Support Vector Machine (LSSVM) is proposed for short term traffic speed forecasting, which is later termed as FA-LSSVM. In particular, the Firefly algorithm which has the advantage in global search is used to optimize the hyper-parameters of LSSVM for efficient data training. Experimental result indicates that the proposed FA-LSSVM generates lower error rate and a higher accuracy compared to a non-optimized LSSVM. Such a scenario indicates that FA-LSSVM would be a competitor method in the area of time series forecasting.
Yuhanis Yusof, Farzana Kabir Ahmad, Siti Sakira Kamaruddin, Mohd Hasbullah Omar, Athraa Jasim Mohamed
Implementation of Dynamic Traffic Routing for Traffic Congestion: A Review
Abstract
Traffic congestion is a condition where traffic demands exceed traffic capacity. It is a global problem in transportation that occurs around the world especially in metropolitan city. Dynamic traffic routing has been recognized as one of the methods that is capable of dispersing traffic congestions efficiently. This paper reviews the recent implementations of dynamic traffic routing in traffic congestion problems. Study on how the dynamic or online concept has been implemented in traffic routing focusing on definition of dynamic routing, traffic routing environment, traffic routing policy and routing strategy is reviewed in this paper. Some issues such as proactive routing and handling non-recurrent congestion are properly expounded while highlighting some limitations as well as suggestions for future research. As a conclusion, dynamic traffic routing is shown to be an important method in optimizing traffic congestion release. More studies need to be conducted in search of better solution.
Norulhidayah Isa, Azlinah Mohamed, Marina Yusoff

Pattern Recognition

Frontmatter
A Comparative Study of Video Coding Standard Performance via Local Area Network
Abstract
In order to ensure the compatibility among video codecs from distinct manufacturers and applications, intensive efforts have been undertaken in recent years. For example, a digital video, its size is very large to be stored in memory of storage device. Practically, video should be processed to make it more practical to be shared, at the same time maintaining the quality of the video and avoiding error rate to occur during the transmission. All the issues were discussed in this paper. This paper describes the comparison of video coding standard and discusses on video transmission. For instance, a sample video has 320 x 240 pixels per frame, 24 frames per second, total 265 minutes full color video. Thus, several video compression standards had been used to analyse the throughput and round trip time performance base on different bit rates. The result shows the video with a higher bit rate will have a higher throughput. This experiment could proceed by applying a new method of video compression with the latest video coding standard to analyse the performances.
Siti Eshah Che Osman, Hamidah Jantan, Mohamad Taib Miskon, Wan Ahmad Khusairi Wan Chek
Partial Differential Equation (PDE) Based Image Smoothing System for Digital Radiographic Image
Abstract
Over the last few decades, partial differential equations (PDEs) have become one of the significant mathematical methods that are widely used in the current image processing area. One of its common applications is in image smoothing which is an essential preliminary step in image processing. Smoothing is necessary because it affects the result of further processes in image processing. In this project, a system based on second-order PDE and fourth-order PDE models are developed and implemented in digital radiographic image that contain welding defects. The results obtained from these models show better image quality as compared to conventional filters, such as median filter and Gaussian filter. The system is beneficial in assisting radiographic inspectors to produce a better evaluation and analysis on defects in welding images. In addition, non-destructive testing consultants from industries and academician from universities can also utilize this system for training and research purposes.
Suhaila Abd Halim, Arsmah Ibrahim, Yupiter HP Manurung
Main Structure of Handwritten Jawi Sub-word Representation Using Numeric Code
Abstract
Feature extraction is an important stage in Jawi recognition system because it can influence various aspects that can affect the recognition performance. Statistical feature extraction is strongly influenced by the presence of pixels that make up a word, especially for technique based on zonings and pixel density. Variability in writing style makes the presence of pixels that form the smallest primitive structure in a zone becomes less uniform and this affect the value of pixel density. To overcome this problem, a technique known as numeric code representation to represent the range of the primitive structure tilt in a zone has been proposed. Numeric code is generated by comparing average row and column of smallest primitive structure in each zone. The experimental results show that the numeric code representation is the best method in representing the main structure of the Jawi sub-word image when compared with the other three feature representation techniques. This is because it can generate the highest recognition rate for both classifiers which is used either based on probability or voting.
Roslim Mohamad, Mazani Manaf, Rose Hafsah Abd. Rauf, Mohammad Faidzul Nasruddin

Human Machine Interface

Frontmatter
Evaluating the Usability of Homestay Websites in Malaysia Using Automated Tools
Abstract
Usability evaluation is an imperative phase in the development of user-centred product design. There are growing numbers of usability research being conducted for different types of websites in Malaysia. However, little research has been done to investigate the usability level of homestay websites in Malaysia. The main objective of this preliminary study is to evaluate the usability of homestay websites in Malaysia by using various automated tools. The study has evaluated 347 homestay websites in Malaysia from the Cari Homestay portal website by using automated tools such as Web Page Analyzer (from Website Optimization) and Dead Link Checker tool. The data were analyzed by using descriptive analysis. The descriptive analysis showed that there are existences of usability issues such as violation of usability guidelines in terms of (i) page size, (ii) broken links and (iii) download speed. Relevant recommendations that can be used by the web developer to improve the website were also being provided. Future work may include series of interviews with real users to share experience on the interaction and perceptions towards the websites.
Wan Abdul Rahim Wan Mohd Isa, Muchlisah Md Yusoff, Dg Asnani Ag Nordin
Humanoid-Robot Intervention for Children with Autism: A Conceptual Model on FBM
Abstract
Autism is a lifelong disability that affects children development in terms of social interaction, communication, and imagination. Children with autism often are not able to communicate in a meaningful way with their surroundings and could not relate to the real world. Encompassing humanoid-robot during the therapy session is said as being one of the most beneficial therapies towards these children since autistic children are reported to be keener in engaging in machinery and gadgets. Due to the limited studies in the perspective of the children’s emotions and feelings, this study adopts Kansei assessment to investigate the emotions and feelings of the autistic children while engaging with the robot. Kansei assessment was done by the teacher which interpreted the emotional responses given by the autistic children. Two autistic children were involved in the study where both of the subjects are having mild autism. The data were then analyzed and translated to Fogg’s Behavioral Model to represent the children’s learning motivation. The developed Modified Fogg’s Behavioral Model successfully shows the inter-relation between the three components of ability, trigger and motivation for the autistic children while they interact with the humanoid-robot. The final model provides some evidence that despite having limited ability, given the right intervention, the children with autism will exhibit the same level of motivation with normal children.
Azhar Abdul Aziz, Fateen Faiqa Mislan Moghanan, Mudiana Mokhsin, Afiza Ismail, Anitawati Mohd Lokman
Cross-cultural Kansei Measurement
Abstract
Kansei Engineering (KE) enable designers in making decisions and focus on the design elements which make the product better fit to human feelings by discovering relationships between the customers’ feelings and the product features. However, using paper based checklist in evaluating the experiments lead to different results in different cultural races and demographical background and also limits the potential in obtaining desired results. This research attempts to fill in the gap by providing Web-based Kansei Measurement System and test it across culture to see whether it produce similar results. A comparative study of Kansei by two cultural background subjects and two measurement mechanisms, which are web-based and paper based Kansei checklist have been conducted. The resulted Kansei structure shows encouraging evidence that Web-based Kansei Measurement System could be used as cross-cultural Kansei measurement mechanism. The findings could benefit researchers and designers in the effort to improve the process of Kansei measurement to get the desired results.
Anitawati Mohd Lokman, Mohd Khairul Ikhwan Zolkefley

Hybrid Methods

Frontmatter
Accuracy Assessment of Urban Growth Pattern Classification Methods Using Confusion Matrix and ROC Analysis
Abstract
Urban growth pattern can be categorized as either infill, expansion or outlying. Studies on urban growth classification are focusing on the description of urban growth pattern geometric features using conventional landscape metrics. These metrics are too simple and unable to give detailed information on accuracy of the classification methods. This paper aims to assess the accuracy of classification methods that can determine urban growth patterns correctly for a specific growth area. Accuracy assessments are carried out using three different classification methods – moving window, topological relation border length and landscape expansion index. Based on confusion matrices and receiver operating characteristic (ROC) analysis, results show that landscape expansion index has the best accuracy among all.
Nur Laila Ab Ghani, Siti Zaleha Zainal Abidin, Noor Elaiza Abd Khalid
Intrusion Detection System Based on Modified K-means and Multi-level Support Vector Machines
Abstract
This paper proposed a multi-level model for intrusion detection that combines the two techniques of modified K-means and support vector machine (SVM). Modified K-means is used to reduce the number of instances in a training data set and to construct new training data sets with high-quality instances. The new, high-quality training data sets are then utilized to train SVM classifiers. Consequently, the multi-level SVMs are employed to classify the testing data sets with high performance. The well-known KDD Cup 1999 data set is used to evaluate the proposed system; 10% KDD is applied for training, and corrected KDD is utilized intesting. The experiments demonstrate that the proposed model effectively detects attacks in the DoS, R2L, and U2R categories. It also exhibits a maximum overall accuracy of 95.71%.
Wathiq Laftah Al-Yaseen, Zulaiha Ali Othman, Mohd Zakree Ahmad Nazri
Backmatter
Metadata
Title
Soft Computing in Data Science
Editors
Michael W. Berry
Azlinah Mohamed
Bee Wah Yap
Copyright Year
2015
Publisher
Springer Singapore
Electronic ISBN
978-981-287-936-3
Print ISBN
978-981-287-935-6
DOI
https://doi.org/10.1007/978-981-287-936-3

Premium Partner