Skip to main content

2016 | Buch

Soft Computing in Data Science

Second International Conference, SCDS 2016, Kuala Lumpur, Malaysia, September 21-22, 2016, Proceedings

herausgegeben von: Michael W. Berry, Azlinah Hj. Mohamed, Bee Wah Yap

Verlag: Springer Singapore

Buchreihe : Communications in Computer and Information Science

insite
SUCHEN

Über dieses Buch

This book constitutes the refereed proceedings of the International Conference on Soft Computing in Data Science, SCDS 2016, held in Putrajaya, Malaysia, in September 2016.

The 27 revised full papers presented were carefully reviewed and selected from 66 submissions. The papers are organized in topical sections on artificial neural networks; classification, clustering, visualization; fuzzy logic; information and sentiment analytics.

Inhaltsverzeichnis

Frontmatter

Artificial Neural Networks

Frontmatter
Shallow Network Performance in an Increasing Image Dimension
Abstract
This paper describes the performance of a shallow network towards increasing complexity of dimension in an image input representation. This paper will highlight the generalization problem in Shallow Neural Network despite its extensive usage. In this experiment, a backpropagation algorithm is chosen to test the network as it is widely used in many classification problems. A set of three different size of binary images are used in this experiment. The idea is to assess how the network performs as the scale of the input dimension increases. In addition, a benchmark MNIST handwritten digit sampling is also used to test the performance of the shallow network. The result of the experiment shows the network performance as the scale of input increases. The result is then discussed and explained. From the conducted experiments it is believed that the complexity of the input size and breadth of the network affects the performance of the Neural Network. Such results can be as a reference and guidance to people that is interested in doing research using backpropagation algorithm.
Mohd Razif Shamsuddin, Shuzlina Abdul-Rahman, Azlinah Mohamed
Applied Neural Network Model to Search for Target Credit Card Customers
Abstract
Many credit card businesses are no longer profitable due to antiquated and increasingly obsolete methods of acquiring customers, and as importantly, they followed suit when identifying ideal customers. The objective of this study is to identify the high spending and revolving customers through the development of proper parameters. We combined the back propagation neural network, decision tree and logistic methods as a way to overcome each method’s deficiency. Two sets of data were used to develop key eigenvalues that more accurately predict ideal customers. Eventually, after many rounds of testing, we settled on 14 eigenvalues with the lowest error rates when acquiring credit card customers with a significantly improved level of accuracy. It is our hope that data mining and big data can successfully utilize these advantages in data classification and prediction.
Jong-Peir Li
Selection Probe of EEG Using Dynamic Graph of Autocatalytic Set (ACS)
Abstract
Electroencephalography (EEG) machine is a medical equipment which is used to diagnose seizure. EEG signal records data in the form of graph which consist of abnormal patterns such as spikes, sharp waves and also spikes and wave complexes. This pattern also come in multiple line series which then give some difficulties to analyze. This paper introduce the implementation of dynamic graph of Autocatalytic Set (ACS) for EEG signal during seizure. The result is then compared with other publish method namely Principal Component Analysis (PCA) of same EEG data.
Azmirul Ashaari, Tahir Ahmad, Suzelawati Zenian, Noorsufia Abdul Shukor
A Comparison of BPNN, RBF, and ENN in Number Plate Recognition
Abstract
In this paper, we discuss a research project that related to autonomous recognition of Malaysia car plates using neural network approaches. This research aims to compare the proposed conventional Backpropagation Feed Forward Neural Network (BPNN), Radial Basis Function Network (RBF), and Ensemble Neural Network (ENN). There are numerous research articles discussed the performances of BPNN and RFB in various applications. Interestingly, there is lack of discussion and application of ENN approach as the idea of ENN is still very young. Furthermore, this paper also discusses a novel technique used to localize car plate automatically without labelling them or matching their positions with template. The proposed method could solve most of the localization challenges. The experimental results show the proposed technique could automatically localize most of the car plate. The testing results show that the proposed ENN performed better than the BPNN and RBF. Furthermore, the proposed RBF performed better than BPNN.
Chin Kim On, Teo Kein Yao, Rayner Alfred, Ag Asri Ag Ibrahim, Wang Cheng, Tan Tse Guan
Multi-step Time Series Forecasting Using Ridge Polynomial Neural Network with Error-Output Feedbacks
Abstract
Time series forecasting gets much attention due to its impact on many practical applications. Higher-order neural network with recurrent feedback is a powerful technique which used successfully for forecasting. It maintains fast learning and the ability to learn the dynamics of the series over time. For that, in this paper, we propose a novel model, called Ridge Polynomial Neural Network with Error-Output Feedbacks (RPNN-EOF), which combines three powerful properties: higher order terms, output feedback and error feedback. The well-known Mackey–Glass time series is used to evaluate the forecasting capability of RPNN-EOF. Results show that the proposed RPNN-EOF provides better understanding for the Mackey–Glass time series with root mean square error equal to 0.00416. This error is smaller than other models in the literature. Therefore, we can conclude that the RPNN-EOF can be applied successfully for time series forecasting. Furthermore, the error-output feedbacks can be investigated and applied with different neural network models.
Waddah Waheeb, Rozaida Ghazali

Classification/Clustering/Visualization

Frontmatter
Comparative Feature Selection of Crime Data in Thailand
Abstract
The crime is a major problem of community and society which is increasing day by day. Especially in Thailand, crime is a major problem that affects all aspects of the country such as tourism, administration of government and problem in daily life. Therefore, government and private sectors have to understand the several crime patterns for planning, preventing and solving solution of crime correctly. The purposes of this study are to generate a crime model for Thailand using data mining techniques. Data were collected from Dailynews and Thairath online newspapers. The proposed model can be generated by using more feature selection and more classification techniques to different model. Experiments show feature selection with the wrapper of attribute evaluator seems to be an appropriate evaluation algorithm because data set mostly is the best accuracy rate. This improves efficiency in identifying offenders more quickly and accurately. The model can be used for the prevention of crime that will occur in Thailand in the future.
Tanavich Sithiprom, Anongnart Srivihok
The Impact of Data Normalization on Stock Market Prediction: Using SVM and Technical Indicators
Abstract
Predicting stock index and its movement has never been lack of attention among traders and professional analysts, because of the attractive financial gains. For the last two decades, extensive researches combined technical indicators with machine learning techniques to construct effective prediction models. This study is to investigate the impact of various data normalization methods on using support vector machine (SVM) and technical indicators to predict the price movement of stock index. The experimental results suggested that, the prediction system based on SVM and technical indicators, should carefully choose an appropriate data normalization method so as to avoid its negative influence on prediction accuracy and the processing time on training.
Jiaqi Pan, Yan Zhuang, Simon Fong
Interactive Big Data Visualization Model Based on Hot Issues (Online News Articles)
Abstract
Big data is a popular term used to describe a massive volume of data, which is a key component of the current information age. Such data is complex and difficult to understand, and therefore, may be not useful for users in that state. News extraction, aggregation, clustering, news topic detection and tracking, and social network analysis are some of the several attempts that have been made to manage the massive data in social media. Current visualization tools are difficult to adapt to the constant growth of big data, specifically in online news articles. Therefore, this paper proposes Interactive Big Data Visualization Model Based on Hot Issues (IBDVM). IBDVM can be used to visualize hot issues in daily news articles. It is based on textual data clusters in textual databases that improve the performance, accuracy, and quality of big data visualization. This model is useful for online news reader, news agencies, editors, and researchers who involve in textual documents domains.
Wael M. S. Yafooz, Siti Z. Z. Abidin, Nasiroh Omar, Shadi Hilles
Metabolites Selection and Classification of Metabolomics Data on Alzheimer’s Disease Using Random Forest
Abstract
Alzheimer’s disease (AD) is neurodegenerative disorder characterized by the gradual memory loss, impairment of cognitive functions and progressive disability. It is known from previous studies that symptoms of AD are due to synaptic dysfunction and neuronal death in the area of the brain, which performs memory consolidation. Thus, the investigation of deviations in various cellular metabolite linkages is crucial to advance our understanding of early disease mechanism and to identify novel therapeutic targets. This study aims to identify small sets of metabolites that could be potential biomarkers of AD. Liquid chromatography/mass spectrometry-quadrupole time of flight (LC/MS-QTOF)-based metabolomics data were used to determine potential biomarkers. The metabolic profiling detected a total of 100 metabolites for 55 AD patients and 55 healthy control. Random forest (RF), a supervised classification algorithm was used to select the important features that might elucidate biomarkers of AD. Mean decrease accuracy of .05 or higher indicates important variables. Out of 100 metabolites, 10 were significantly modified, namely N-(2-hydroxyethyl) icosanamide which had the highest Gini index followed by X11-12-dihyroxy (arachidic) acid, N-(2-hydroxyethyl) palmitamide, phytosphingosine, dihydrosphingosine, deschlorobenzoyl indomenthacin, XZN-2-hydroxyethyl (icos) 11-enamide, X1-hexadecanoyl (sn) glycerol, trypthophan and dihydroceramide C2.
Mohammad Nasir Abdullah, Bee Wah Yap, Yuslina Zakaria, Abu Bakar Abdul Majeed
A Multi-objectives Genetic Algorithm Clustering Ensembles Based Approach to Summarize Relational Data
Abstract
In learning relational data, the Dynamic Aggregation of Relational Attributes algorithm is capable to transform a multi-relational database into a vector space representation, in which a traditional clustering algorithm can then be applied directly to summarize relational data. However, the performance of the algorithm is highly dependent on the quality of clusters produced. A small change in the initialization of the clustering algorithm parameters may cause adverse effects to the clusters quality produced. In optimizing the quality of clusters, a Genetic Algorithm is used to find the best combination of initializations in order to produce the optimal clusters. The proposed method involves the task of finding the best initialization with respect to the number of clusters, proximity distance measurements, fitness functions, and classifiers used for the evaluation. Based on the results obtained, clustering coupled with Euclidean distance is found to perform better in the classification stage compared to using clustering coupled with Cosine similarity. Based on the findings, the cluster entropy is the best fitness function, followed by multi-objectives fitness function used in the genetic algorithm. This is most probably because of the involvement of external measurement that takes the class label into consideration in optimizing the structure of the cluster results. In short, this paper shows the influence of varying the initialization values on the predictive performance.
Rayner Alfred, Gabriel Jong Chiye, Yuto Lim, Chin Kim On, Joe Henry Obit
Automated Generating Thai Stupa Image Descriptions with Grid Pattern and Decision Tree
Abstract
This research presents a novel algorithm for generating descriptions of stupa image such as stupa’s era, stupa’s architecture and other description by using information inside image which divided into grid and learning stupa description from the generated information with decision tree. In this paper, we get information inside image by divided image into several grid patterns, for example 10 × 10 and use data inside that image to submit to the decision tree model. The proposed algorithm aims to generate the descriptions in each stupa image. Decision tree was used for being the classifier for generating the description. We have presented a new approach to feature extraction based on analysis of information in image by using the grid information. The algorithms were tested with stupa image dataset in Phra Nakhon Si Ayutta province, Sukhothai province and Bangkok. The experimental results show that the proposed framework can efficiently give the correct descriptions to the stupa image compared to using the traditional method.
Sathit Prasomphan, Panuwut nomrubporn, Pirat Pathanarat
Imbalance Effects on Classification Using Binary Logistic Regression
Abstract
Classification problems involving imbalance data will affect the performance of classifiers. In predictive analytics, logistic regression is a statistical technique which is often used as a benchmark when other classifiers, such as Naïve Bayes, decision tree, artificial neural network and support vector machine, are applied to a classification problem. This study investigates the effect of imbalanced ratio in the response variable on the parameter estimate of the binary logistic regression via a simulation study. Datasets were simulated with controlled different percentages of imbalance ratio (IR), from 1 % to 50 %, and for various sample sizes. The simulated datasets were then modeled using binary logistic regression. The bias in the estimates was measured using MSE (Mean Square Error). The simulation results provided evidence that imbalance ratio affects the parameter estimates where severe imbalance (IR = 1 %, 2 %, 5 %) has higher MSE. Additionally, the effects of high imbalance (IR ≤ 5 %) will be more severe when sample size is small (n = 100 & n = 500). Further investigation using real dataset from the UCI repository (Bupa Liver (n = 345) and Diabetes Messidor, n = 1151)) confirmed the imbalanced ratio effect on the parameter estimates and the odds ratio, and thus will lead to misleading results.
Hezlin Aryani Abd Rahman, Bee Wah Yap
Weak Classifiers Performance Measure in Handling Noisy Clinical Trial Data
Abstract
Most research concluded that machine learning performance is better when dealing with cleaned dataset compared to dirty dataset. In this paper, we experimented three weak or base machine learning classifiers: Decision Table, Naive Bayes and k-Nearest Neighbor to see their performance on real-world, noisy and messy clinical trial dataset rather than employing beautifully designed dataset. We involved the clinical trial data scientist in leading us to a better data analysis exploration and enhancing the performance result evaluation. The classifiers performances were analyzed using Accuracy and Receiver Operating Characteristic (ROC), supported with sensitivity, specificity and precision values which resulted to contradiction of conclusion made by previous research. We employed pre-processing techniques such as interquartile range technique to remove the outliers and mean imputation to handle missing values and these techniques resulted to; all three classifiers work better in dirty dataset compared to imputed and clean dataset by showing highest accuracy and ROC measure. Decision Table turns out to be the best classifier when dealing with real-world noisy clinical trial.
Ezzatul Akmal Kamaru-Zaman, Andrew Brass, James Weatherall, Shuzlina Abdul Rahman
A Review of Feature Extraction Optimization in SMS Spam Messages Classification
Abstract
Spam these days has become a definite nuisance to mobile users. Provision of Short Messages Services (SMS) has been intruded, in line with an advancement of mobile technology by the emergence of SMS spam. This issue has not only cause distressing situation but also other serious threats such as money loss, fraud, and false news. The focus of this study is to excavate the features extraction in classifying SMS spam messages at users’ end. Its objective is to study the discriminatory control of the features and considering its informative or influence factor in classifying SMS spam messages. This study has been conducted by gathering research papers and journals from numerous sources on the subject of spam classification. The discovery offers a motivational effort for further execution in a wider perspective of combating spam such as measurement of spam’s risk level.
Kamahazira Zainal, Mohd Zalisham Jali
Assigning Different Weights to Feature Values in Naive Bayes
Abstract
Assigning weights in features has been an important topic in some classification learning algorithms. While the current weighting methods assign a weight to each feature, in this paper, we assign a different weight to the values of each feature. The performance of naive Bayes learning with value-based weighting method is compared with that of some other traditional methods for a number of datasets.
Chang-Hwan Lee
An Automatic Construction of Malay Stop Words Based on Aggregation Method
Abstract
In information retrieval, the key to an effective indexing can be achieved through the removal of stop words. Despite having many theories and algorithms related to the construction of stop words in many languages, yet, most of the Malay stop words used are either utilized/borrowed from English stop words, or constructed manually by different researchers which happen to be costly, time consuming and susceptible to error. In other words, no standard stop word list has been constructed for Malay language yet. In this study, we propose an aggregation technique using three different approaches for an automatic construction of general Malay Stop words. The first approach based on statistical method, by considering words’ frequencies (highest and lowest) against their ranks, this method inspired by zipf’s law. The second approach by considering words’ distribution against documents using variance measure. The third approach by computing how informative a word is by using Entropy measure. As a result, a total of 339 Malay stop words were produced. The discussion and implication of these findings are further elaborated.
Khalifa Chekima, Rayner Alfred
Modeling and Forecasting Mudharabah Investment with Risk by Using Geometric Brownian Motion
Abstract
This study developed mudharabah investment with risk model by considering the rate of return as a total of deterministic profit rate and a function of white noise that is geometric Brownian motion. The result shows that the investment is considered as accurately forecast when using this developed model. The profit from mudharabah investment is compared with single party investment. The result obtained shows that the profit difference between mudharabah investment and single party investment is very small. It is verified that the developed model can be used in forecasting the investment and profit for two parties.
Nurizzati Azhari, Maheran Mohd Jaffar
Multi-script Text Detection and Classification from Natural Scenes
Abstract
Most of the text detection and script classification approaches from natural scenes only cater for a single script whereas text in natural scenes may come in various scripts. This research proposes a gestalt-based approach for multi-script text detection and classification based on human perception. Human perceptual organization is where humans are able to organize visual input into meaningful information. This approach is based on the figure-ground articulation where we perceive the figure or text as standing in front of the background. Features extracted from wavelet coefficients and MSER is used as input to SVM for text detection and script classification. Experimental results indicate that this approach is competitive with the state of the art text detection and script classification approaches.
Zaidah Ibrahim, Zolidah Kasiran, Dino Isa, Nurbaity Sabri

Fuzzy Logic

Frontmatter
Algebraic and Graphical Interpretation of Complex Fuzzy Annulus (an Extension of Complex Fuzzy Sets)
Abstract
Complex fuzzy sets, which include complex-valued grades of memberships, are extensions of standard fuzzy sets that better represent time-periodic problem parameters. However, the membership functions of complex fuzzy sets are difficult to enumerate, as they are subject to personal preferences and bias. To overcome this problem, we generalize complex fuzzy sets to the complex fuzzy annulus, whose image is a sub-disk lying in the unit circle in the complex plane. The set theoretic operations of this concept are introduced and their algebraic properties are verified. The proposed model is then applied to a real-life problem, namely, the influencers of the Malaysian economy and the time lag between the occurrences of these influencers and their first manifestations in the economy.
Ganeshsree Selvachandran, Omar Awad Mashaan, Abdul Ghafur Ahmad
A Hierarchical Fuzzy Logic Control System for Malaysian Motor Tariff with Risk Factors
Abstract
In many countries, including Malaysia, it is made compulsory to have a motor insurance policy and the premium is determined based on the Motor Tariff which ensures that a standard premium is imposed to the policyholders. At present, the premium in Malaysia includes only two factors which are the sum insured and the cubic capacity of the engine. Many existing methods used to calculate the tariff depend solely on the data and does not enable the experts to provide their input into the system. In contrast, the rule based system which is used in the Fuzzy Logic Control System could cater for the experts’ input. This research aims to develop a system that can determine the motor tariff using the Hierarchical Fuzzy Logic Control System. Besides the sum insured and the cubic capacity of the engine, the system will also incorporate the risk level of policyholders into the Motor Tariff. As a prototype, two selected risk factors are used, namely the age of drivers and the age of cars. The risk premium subsystem is developed before combining it with the main tariff premium system that constitute the Hierarchical Fuzzy Logic Control System. The result confirmed that the premium is loaded when the risk level is high and discounted when the risk level is low. The finding is in tandem with Bank Negara Malaysia (BNM) impending detariffication exercise for determining the motor insurance policy.
Daud Mohamad, Lina Diyana Mohd Jamal
Modeling Steam Generator System of Pressurized Water Reactor Using Fuzzy Arithmetic
Abstract
Steam generator system is known as the bridge between the primary and secondary systems for phase changes from water into steam. The aim of this paper is to identify the best input that influence the steam generator system in the process of changing from water to steam, to ensure the process is efficient. The method consists of the transformation method of fuzzy arithmetic which is to compute the measure of influence for each parameter in the model system. The result is then verified against simulation and analysis.
Wan Munirah Wan Mohamad, Tahir Ahmad, Azmirul Ashaari
Edge Detection of Flat Electroencephalography Image via Classical and Fuzzy Approach
Abstract
Edge detection is a crucial step in image processing in order to mark the point where the light intensity changed significantly. It is widely used to detect gray-scale and colour images in various fields such as medical image processing, machine vision system and remote sensing. The classical edge detectors such as Prewitt, Robert, and Sobel are quite sensitive towards noise and sometimes inaccurate. In this paper, the boundary of the epileptic foci of Flat EEG (fEEG) is determined by implementing some of the methods ranging from classical to fuzzy approach. There are two methods being applied for the fuzzy edge detector technique which are Minimum Constructor and Maximum Constructor methods; and Fuzzy Mathematical Morphology approach.
Suzelawati Zenian, Tahir Ahmad, Amidora Idris

Information and Sentiment Analytics

Frontmatter
Feel-Phy: An Intelligent Web-Based Physics QA System
Abstract
Feel-Phy is a computerized and unmanned question answering system which is able to solve open-ended Physics problems, providing adaptive guidance and retrieve relevant resources to user inputs. Latent Semantic Indexing (LSI) is employed to process the user inputs and retrieve relevant references. The proposed architecture for Feel-Phy constitutes of four basic modules: data extraction, question classification, solution identification and answer formulation. The data extraction module is used to construct a Physics knowledge base. The question classification module is used to identify question type and understand the question. The solution identification module computes the answer to the question and also retrieve the top n most relevant resource references to the users. Finally, the last module, answer formulation is to present the results to the users. Our preliminary experiments have shown that this proposed method is able to solve well-structure Physics question and retrieve relevant references to the users.
Kwong Seng Fong, Chih How Bong, Zahrah Binti Ahmad, Norisma Binti Idris
Usability Evaluation of Secondary School Websites in Malaysia: Case of Federal Territories of Kuala Lumpur, Putrajaya and Labuan
Abstract
The main objective of this study is to investigate the usability of secondary schools websites in Federal Territories Kuala Lumpur, Putrajaya and Labuan, Malaysia. The evaluation was done by using three automated tools; (i) Web Page Analyzer (websiteoptimization.com), (ii) DeadLink Checker and (iii) Broken Link Checker. The samples include 53 secondary schools websites in Malaysia. The data was analyzed based from Nielson usability guidelines for (i) ideal size of web pages, (ii) number of broken links and the (iii) webpage size. The result of this study shows that the secondary schools websites in Federal Territories Kuala Lumpur, Putrajaya and Labuan, Malaysia had few usability issues. This study provides recommendations for usability improvement for secondary schools website in Malaysia. Future work may involve accessibility evaluation of the secondary school websites in Malaysia.
Wan Abdul Rahim Wan Mohd Isa, Zawaliah Ishak, Siti Zulaiha Shabirin
Judgment of Slang Based on Character Feature and Feature Expression Based on Slang’s Context Feature
Abstract
Our research aim was to develop the means to automatically identify a particular character string as slang and then connect the detected slang word to words with similar meaning in order to successfully process the sentence in which the word appears. By recognizing a slang word in this way, one can apply different processing to the word and avoid the distinctive problems associated with processing slang words. This paper proposes a method to distinguish standard words from slang words using information from the characters comprising the character string. An experiment testing the effectiveness of our method showed a 30 % or more improvement in classification accuracy compared to the baseline method. We also use a contextual feature related to emotion to expand the unregistered slang word in the training data into other expressions and propose an emotion estimation method based on the expanded expressions. In our experiment, successful emotion estimation was obtained in nearly 54 % of the cases, a notably higher rate than with the baseline method. Our proposed method was shown to have validity.
Kazuyuki Matsumoto, Seiji Tsuchiya, Minoru Yoshida, Kenji Kita
Factors Affecting Sentiment Prediction of Malay News Headlines Using Machine Learning Approaches
Abstract
Most sentiment analysis researches are done with the help of supervised machine learning techniques. Analyzing sentiment for these English text reviews is a non-trivial task in order to gauge public perception and acceptance of a particular issue being addressed. Nevertheless, there are not many studies conducted on analyzing sentiment of Malay news headlines due to lack of resources and tools. The Malay news headlines normally consist of a few words and are often written with creativity to attract the readers’ attention. This paper proposes a standard framework that investigates factors affecting sentiment prediction of Malay news headlines using machine learning approaches. It is important to investigate factors (e.g., types of classifiers, proximity measurements and number of Nearest Neighbors, k) that influence the prediction performance of the sentiment analysis as it helps to study and understand the parameters that can be tuned to optimize the prediction performance. Based on the results obtained, Support Vector Machine and Naïve Bayes classifiers were capable to obtain higher accuracy compared to the k-Nearest Neighbors (k-NN) classifier. In term of proximity measurement and number of Nearest Neighbors, k, the k-NN classifier achieved higher prediction performance when the Cosine similarity is applied with a small value of k (e.g., 3 and 5), compared to the Euclidean distance because it measures can be affected by the high dimensionality of the data.
Rayner Alfred, Wong Wei Yee, Yuto Lim, Joe Henry Obit
Assessing Factors that Influence the Performances of Automated Topic Selection for Malay Articles
Abstract
Malay language is a major language that is in used by citizens of Malaysia, Indonesia, Singapore and Brunei. As the language is widely used, there are abundant of text articles written in Malay language that are available on the internet. This has resulted in the increasing of the Malay articles published online and the number of articles has increased greatly over the years. Automatically labeling Malay text articles is crucial in managing these articles. Due to lack of resources and tools used to perform the topic selection automatically for Malay text articles, this paper studies the factors that influence the performances of the algorithms that can be applied to perform a topic selection automatically for Malay articles. This is done by comparing the contents of the articles with the corresponding topics and all Malay articles will be assigned to the appropriate topics depending on the results of the classification process. In this paper, all Malay articles will be classified by using the k-Nearest Neighbors (k-NN) and Naïve Bayes classifiers. Both classifiers are used to classify and assign a topic to these Malay articles according to a predefined set of topics. The effectiveness of classifying these Malay articles using the k-NN classifier is highly dependent on the distance methods used and the number of Nearest Neighbors, k. Thus, this paper also assesses the effects of using different distance methods (e.g., Cosine Similarity and the Euclidean Distance) and varying the number of clusters, k. Other than that, the effects of utilizing the stemming process on the performance of the classifiers are also studied. Based on the results obtained, the proposed approach shows that the k-NN classifier performs better than the Naïve Bayes classifier in classifying the Malay articles into their respective topics. In addition to that, the stemming process also improves the overall performances of both classifiers. Other findings include the application of Cosine Similarity as the distance measure has improved the performance of the k-NN classifier.
Rayner Alfred, Leow Jia Ren, Joe Henry Obit
Backmatter
Metadaten
Titel
Soft Computing in Data Science
herausgegeben von
Michael W. Berry
Azlinah Hj. Mohamed
Bee Wah Yap
Copyright-Jahr
2016
Verlag
Springer Singapore
Electronic ISBN
978-981-10-2777-2
Print ISBN
978-981-10-2776-5
DOI
https://doi.org/10.1007/978-981-10-2777-2

Premium Partner