Graph Mining

Frontmatter

Greedy Graph Edit Distance

In pattern recognition and data mining applications, where the underlying data is characterized by complex structural relationships, graphs are often used as a formalism for object representation. Yet, the high representational power and flexibility of graphs is accompanied by a significant increase of the complexity of many algorithms. For instance, exact computation of pairwise graph dissimilarity, i.e. distance, can be accomplished in exponential time complexity only. A previously introduced approximation framework reduces the problem of graph comparison to an instance of a linear sum assignment problem which allows graph dissimilarity computation in cubic time. The present paper introduces an extension of this approximation framework that runs in quadratic time. We empirically confirm the scalability of our extension with respect to the run time, and moreover show that the quadratic approximation leads to graph dissimilarities which are sufficiently accurate for graph based pattern classification.

Kaspar Riesen, Miquel Ferrer, Rolf Dornberger, Horst Bunke

Learning Heuristics to Reduce the Overestimation of Bipartite Graph Edit Distance Approximation

In data mining systems, which operate on complex data with structural relationships, graphs are often used to represent the basic objects under study. Yet, the high representational power of graphs is also accompanied by an increased complexity of the associated algorithms. Exact graph similarity or distance, for instance, can be computed in exponential time only. Recently, an algorithmic framework that allows graph dissimilarity computation in cubic time with respect to the number of nodes has been presented. This fast computation is at the expense, however, of generally overestimating the true distance. The present paper introduces six different post-processing algorithms that can be integrated in this suboptimal graph distance framework. These novel extensions aim at improving the overall distance quality while keeping the low computation time of the approximation. An experimental evaluation clearly shows that the proposed heuristics substantially reduce the overestimation in the existing approximation framework while the computation time remains remarkably low.

Miquel Ferrer, Francesc Serratosa, Kaspar Riesen

Seizure Prediction by Graph Mining, Transfer Learning, and Transformation Learning

We present in this study a novel approach to predicting EEG epileptic seizures: we accurately model and predict non-ictal cortical activity and use prediction errors as parameters that significantly distinguish ictal from non-ictal activity. We suppress seizure-related activity by modeling EEG signal acquisition as a cocktail party problem and obtaining seizure-related activity using Independent Component Analysis. Following recent studies intricately linking seizure to increased, widespread synchrony, we construct dynamic EEG synchronization graphs in which the electrodes are represented as nodes and the pair-wise correspondences between them are represented by edges. We extract 38 intuitive features from the synchronization graph as well as the original signal. From this, we use a rigorous method of feature selection to determine minimally redundant features that can describe the non-ictal EEG signal maximally. We learn a one-step forecast operator restricted to just these features, using autoregression (AR(1)). We improve this in a novel way by cross-learning common knowledge across patients and recordings using

Transfer Learning

, and devise a novel transformation to increase the efficiency of transfer learning. We declare imminent seizure based on detecting outliers in our prediction errors using a simple and intuitive method. Our median seizure detection time is 11.04 min prior to the labeled start of the seizure compared to a benchmark of 1.25 min prior, based on previous work on the topic. To the authors’ best knowledge this is the first attempt to model seizure prediction in this manner, employing efficient seizure suppression, the use of synchronization graphs and transfer learning, among other novel applications.

Nimit Dhulekar, Srinivas Nambirajan, Basak Oztan, Bülent Yener

Classification and Regression

Frontmatter

Local and Global Genetic Fuzzy Pattern Classifiers

Fuzzy pattern classifiers are a recent type of classifiers making use of fuzzy membership functions and fuzzy aggregation rules, providing a simple yet robust classifier model. The fuzzy pattern classifier is parametric giving the choice of fuzzy membership function and fuzzy aggregation operator. Several methods for estimation of appropriate fuzzy membership functions and fuzzy aggregation operators have been suggested, but considering only fuzzy membership functions with symmetric shapes found by heuristically selecting a “middle” point from the learning examples. Here, an approach for learning the fuzzy membership functions and the fuzzy aggregation operator from data is proposed, using a genetic algorithm for search. The method is experimentally evaluated on a sample of several public datasets, and performance is found to be significantly better than existing fuzzy pattern classifier methods. This is despite the simplicity of the fuzzy pattern classifier model, which makes it interesting.

Søren Atmakuri Davidsen, E. Sreedevi , M. Padmavathamma

IKLTSA: An Incremental Kernel LTSA Method

Since 2000, manifold learning methods have been extensively studied, and demonstrated excellent performance in dimensionality reduction in some application scenarios. However, they still have some drawbacks in approximating real nonlinear relationships during the dimensionality reduction process, thus are unable to retain the original data’s structure well. In this paper, we propose an incremental version of the manifold learning algorithm LTSA based on kernel method, which is called

IKLSTA

, the abbreviation of

I

ncremental

K

ernel

LTSA

. IKLTSA exploits the advantages of kernel method and can detect the explicit mapping from the high-dimensional data points to their low-dimensional embedding coordinates. It is also able to reflect the intrinsic structure of the original high dimensional data more exactly and deal with new data points incrementally. Extensive experiments on both synthetic and real-world data sets validate the effectiveness of the proposed method.

Chao Tan, Jihong Guan, Shuigeng Zhou

Sentiment Analysis

Frontmatter

SentiSAIL: Sentiment Analysis in English, German and Russian

Sentiment analysis has been well in the focus of researchers in recent years. Nevertheless despite the considerable amount of literature in the field the majority of publications target the domains of movie and product reviews in English. The current paper presents a novel sentiment analysis method, which extends the state-of-the-art by trilingual sentiment classification in the domains of general news and particularly the coverage of natural disasters in general news. The languages examined are English, German and Russian. The approach targets both traditional and social media content. The extensive experiments demonstrate that the performance of the proposed approach outperforms human annotators, as well as the original method, on which it is built and extended.

Gayane Shalunts, Gerhard Backfried

Sentiment Analysis for Government: An Optimized Approach

This paper describes a Sentiment Analysis (SA) method to analyze tweets polarity and to enable government to describe quantitatively the opinion of active users on social networks with respect to the topics of interest to the Public Administration.

We propose an optimized approach employing a document-level and a dataset-level supervised machine learning classifier to provide accurate results in both individual and aggregated sentiment classification.

The aim of this work is also to identify the types of features that allow to obtain the most accurate sentiment classification for a dataset of Italian tweets in the context of a Public Administration event, also taking into account the size of the training set. This work uses a dataset of 1,700 Italian tweets relating to the public event of “Lecce 2019 – European Capital of Culture”.

Angelo Corallo, Laura Fortunato, Marco Matera, Marco Alessi, Alessio Camillò, Valentina Chetta, Enza Giangreco, Davide Storelli

Data Preparation and Missing Values

Frontmatter

A Novel Algorithm for the Integration of the Imputation of Missing Values and Clustering

In this article we present a new and efficient algorithm to handle missing values in databases applied in data mining (DM). Missing values may harm the calculation of the clustering algorithm, and might lead to distorted results. Therefore missing values must be treated before the DM. Commonly, methods to handle missing values are implemented as a separate process from the DM. This may cause a long runtime and may lead to redundant I/O accesses. As a result, the entire DM process may be inefficient. We present a new algorithm (

km

-

Impute

) which integrates clustering and imputation of missing values in a unified process. The algorithm was tested on real Red wine quality measures (from the UCI Machine Learning Repository).

km

-

Impute

succeeded in imputing missing values and in building clusters as a unified integrated process. The structure and quality of clusters which were produced by

km

-

Impute

were similar to clusters of

k

-

means

. In addition, the clusters were analyzed by a wine expert. The clusters represented different types of Red wine quality. The success and the accuracy of the imputation were validated using another two datasets: White wine and Page blocks (from the UCI). The results were consistent with the tests which were applied on Red wine: The ratio of success of imputation in all three datasets was similar. Although the complexity of

km

-

Impute

was the same as

k

-

means

, in practice it was more efficient when applying on middle sized databases: The runtime was significantly shorter than

k

-

means

and fewer iterations were required until convergence.

km

-

Impute

also performed much less I/O accesses in comparison to

k

-

means.

Roni Ben Ishay, Maya Herman

Improving the Algorithm for Mapping of OWL to Relational Database Schema

Ontologies are applied to many applications in recent years, especially on the semantic web, information retrieval, information extraction, and question answering. The purpose of domain-specific ontology is to get rid of conceptual and terminological confusion. It accomplishes this by specifying a set of generic concepts that characterizes the domain as well as their definitions and interrelationships. There are some languages in order to represent ontologies, such as RDF, OWL. However, these languages are only suitable with ontologies having small data. For representing ontologies having big data, database is usually used. Many techniques and tools have been proposed over the last years. In this paper, we introduce an improved approach for mapping RDF or OWL to relational database based on the algorithm proposed by Kaunas University of Technology. This approach can be applied to ontologies having many classes and big data.

Chien D. C. Ta, Tuoi Phan Thi

Robust Principal Component Analysis of Data with Missing Values

Principal component analysis is one of the most popular machine learning and data mining techniques. Having its origins in statistics, principal component analysis is used in numerous applications. However, there seems to be not much systematic testing and assessment of principal component analysis for cases with erroneous and incomplete data. The purpose of this article is to propose multiple robust approaches for carrying out principal component analysis and, especially, to estimate the relative importances of the principal components to explain the data variability. Computational experiments are first focused on carefully designed simulated tests where the ground truth is known and can be used to assess the accuracy of the results of the different methods. In addition, a practical application and evaluation of the methods for an educational data set is given.

Tommi Kärkkäinen, Mirka Saarela

Association and Sequential Rule Mining

Frontmatter

Efficient Mining of High-Utility Sequential Rules

High-utility pattern mining is an important data mining task having wide applications. It consists of discovering patterns generating a high profit in databases. Recently, the task of high-utility sequential pattern mining has emerged to discover patterns generating a high profit in sequences of customer transactions. However, a well-known limitation of sequential patterns is that they do not provide a measure of the confidence or probability that they will be followed. This greatly hampers their usefulness for several real applications such as product recommendation. In this paper, we address this issue by extending the problem of sequential rule mining for utility mining. We propose a novel algorithm named HUSRM (High-Utility Sequential Rule Miner), which includes several optimizations to mine high-utility sequential rules efficiently. An extensive experimental study with four datasets shows that HUSRM is highly efficient and that its optimizations improve its execution time by up to 25 times and its memory usage by up to 50 %.

Souleymane Zida, Philippe Fournier-Viger, Cheng-Wei Wu, Jerry Chun-Wei Lin, Vincent S. Tseng

MOGACAR: A Method for Filtering Interesting Classification Association Rules

Knowledge Discovery process is intended to provide valid, novel, potentially useful and finally understandable patterns from data. An interesting research area concerns the identification and use of interestingness measures, in order to rank or filter results and provide what might be called better knowledge. For association rules mining, some research has been focused on how to filter itemsets and rules, in order to guide knowledge acquisition from the user’s point of view, as well as to improve efficiency of the process. In this paper, we explain MOGACAR, an approach for ranking and filtering association rules when there are multiple technical and business interestingness measures; MOGACAR uses a multi-objective optimization method based on genetic algorithm for classification association rules, with the intention to find the most interesting, and still valid, itemsets and rules.

Diana Benavides Prado

Support Vector Machines

Frontmatter

Classifying Grasslands and Cultivated Pastures in the Brazilian Cerrado Using Support Vector Machines, Multilayer Perceptrons and Autoencoders

One of the most biodiverse regions on the planet, Cerrado is the second largest biome in Brazil. Among the land changes in the Cerrado, over 500,000 km

$$^2$$

2

of the biome have been changed into cultivated pastures in recent years. Categorizing types of land cover and its native formations is important for protection policy and monitoring of the biome. Based on remote sensing techniques, this work aims at developing a methodology to map pasture and native grassland areas in the biome. Data related to EVI vegetation indices obtained by MODIS images were used to perform image classification. Support Vector Machine, Multilayer Perceptron and Autoencoder algorithms were used and the results showed that the analysis of different attributes extracted from EVI indices can aid in the classification process. The best result obtained an accuracy of 85.96 % in the study area, identifying data and attributes required to map pasture and native grassland in Cerrado.

Wanderson Costa, Leila Fonseca, Thales Körting

Hybrid Approach for Inductive Semi Supervised Learning Using Label Propagation and Support Vector Machine

Semi supervised learning methods have gained importance in today’s world because of large expenses and time involved in labeling the unlabeled data by human experts. The proposed hybrid approach uses SVM and Label Propagation to label the unlabeled data. In the process, at each step SVM is trained to minimize the error and thus improve the prediction quality. Experiments are conducted by using SVM and logistic regression(Logreg). Results prove that SVM performs tremendously better than Logreg. The approach is tested using 12 datasets of different sizes ranging from the order of 1000s to the order of 10000s. Results show that the proposed approach outperforms Label Propagation by a large margin with F-measure of almost twice on average. The parallel version of the proposed approach is also designed and implemented, the analysis shows that the training time decreases significantly when parallel version is used.

Aruna Govada, Pravin Joshi, Sahil Mittal, Sanjay K. Sahay

Frequent Item Set Mining and Time Series Analysis

Frontmatter

Optimizing the Data-Process Relationship for Fast Mining of Frequent Itemsets in MapReduce

Despite crucial recent advances, the problem of frequent itemset mining is still facing major challenges. This is particularly the case when: (i) the mining process must be massively distributed and; (ii) the minimum support (

MinSup

) is very low. In this paper, we study the effectiveness and leverage of specific data placement strategies for improving parallel frequent itemset mining (PFIM) performance in MapReduce, a highly distributed computation framework. By offering a clever data placement and an optimal organization of the extraction algorithms, we show that the itemset discovery effectiveness does not only depend on the deployed algorithms. We propose ODPR (Optimal Data-Process Relationship), a solution for fast mining of frequent itemsets in MapReduce. Our method allows discovering itemsets from massive datasets, where standard solutions from the literature do not scale. Indeed, in a massively distributed environment, the arrangement of both the data and the different processes can make the global job either completely inoperative or very effective. Our proposal has been evaluated using real-world data sets and the results illustrate a significant scale-up obtained with very low

MinSup

, which confirms the effectiveness of our approach.

Saber Salah, Reza Akbarinia, Florent Masseglia

Aggregation-Aware Compression of Probabilistic Streaming Time Series

In recent years, there has been a growing interest for probabilistic data management. We focus on probabilistic time series where a main characteristic is the high volumes of data, calling for efficient compression techniques. To date, most work on probabilistic data reduction has provided synopses that minimize the error of representation w.r.t. the original data. However, in most cases, the compressed data will be meaningless for usual queries involving aggregation operators such as SUM or AVG. We propose

PHA

(Probabilistic Histogram Aggregation), a compression technique whose objective is to minimize the error of such queries over compressed probabilistic data. We incorporate the aggregation operator given by the end-user directly in the compression technique, and obtain much lower error in the long term. We also adopt a global error aware strategy in order to manage large sets of probabilistic time series, where the available memory is carefully balanced between the series, according to their individual variability.

Reza Akbarinia, Florent Masseglia

Clustering

Frontmatter

Applying Clustering Analysis to Heterogeneous Data Using Similarity Matrix Fusion (SMF)

We define a heterogeneous dataset as a set of complex objects, that is, those defined by several data types including structured data, images, free text or time series. We envisage this could be extensible to other data types. There are currently research gaps in how to deal with such complex data. In our previous work, we have proposed an intermediary fusion approach called SMF which produces a pairwise matrix of distances between heterogeneous objects by fusing the distances between the individual data types. More precisely, SMF aggregates partial distances that we compute separately from each data type, taking into consideration uncertainty. Consequently, a single fused distance matrix is produced that can be used to produce a clustering using a standard clustering algorithm. In this paper we extend the practical work by evaluating SMF using the

k

-means algorithm to cluster heterogeneous data. We used a dataset of prostate cancer patients where objects are described by two basic data types, namely: structured and time-series data. We assess the results of clustering using external validation on multiple possible classifications of our patients. The result shows that the SMF approach can improved the clustering configuration when compared with clustering on an individual data type.

Aalaa Mojahed, Joao H. Bettencourt-Silva, Wenjia Wang, Beatriz de la Iglesia

On Bicluster Aggregation and its Benefits for Enumerative Solutions

Biclustering involves the simultaneous clustering of objects and their attributes, thus defining local two-way clustering models. Recently, efficient algorithms were conceived to enumerate all biclusters in real-valued datasets. In this case, the solution composes a complete set of maximal and non-redundant biclusters. However, the ability to enumerate biclusters revealed a challenging scenario: in noisy datasets, each true bicluster may become highly fragmented and with a high degree of overlapping. It prevents a direct analysis of the obtained results. Aiming at reverting the fragmentation, we propose here two approaches for properly aggregating the whole set of enumerated biclusters: one based on single linkage and the other directly exploring the rate of overlapping. Both proposals were compared with each other and with the actual state-of-the-art in several experiments, and they not only significantly reduced the number of biclusters but also consistently increased the quality of the solution.

Saullo Oliveira, Rosana Veroneze, Fernando J. Von Zuben

Semi-Supervised Stream Clustering Using Labeled Data Points

Semi-supervised stream clustering performs cluster analysis of data streams by exploiting background or domain expert knowledge. Almost of existing semi-supervised stream clustering techniques exploit background knowledge as constraints such as must-link and cannot-link constraints. The use of constraints is not appropriate with respect to the dynamic nature of data streams. In this paper, we proposed a new semi-supervised stream clustering algorithm, SSE-Stream. SSE-Stream exploits background knowledge in the form of single labeled data points to monitor and detect change of the clustering structure evolution. Exploiting background knowledge as single labeled data points is more appropriate for data streams. They can be immediately utilised for determining the class of clusters, and effectively support the changing behavior of data streams. SSE-Stream defines new cluster representation to include labeled data points, and uses it to extend the clustering operations such as merge and split for detecting change of the clustering structure evolution. Experimental results on real-world stream datasets show that SSE-Stream is able to improve the output clustering quality, especially for highly complex and drift datasets.

Kritsana Treechalong, Thanawin Rakthanmanon, Kitsana Waiyamai

Avalanche: A Hierarchical, Divisive Clustering Algorithm

Hierarchical clustering has been successfully used in many applications, such as bioinformatics and social sciences. In this paper, we introduce

Avalanche,

a new top-down hierarchical clustering approach that takes a dissimilarity matrix as its input. Such a tool can be used for applications where the dataset is partitioned based on pairwise distances among the examples, such as taxonomy generation tools and molecular biology applications in which dissimilarity among gene sequences are used as inputs — as opposed to flat file attribute/value pair datasets. The proposed algorithm uses local as well as global information to recursively split data associated with a tree node into two sub-nodes until some predefined termination condition is met. To split a node, initially the example that is furthest away from the other examples — the

anti-medoid

— is assigned to right sub-node and then additional examples are progressively assigned to this node which are nearest neighbors of the previously added example as long as a given objective function improves. Experimental evaluations done with artificial and real world datasets show that the new approach has improved speed, and obtained comparable clustering results as the well-known UPGMA algorithm on all datasets used in the experiment.

Paul K. Amalaman, Christoph F. Eick

Text Mining

Frontmatter

Author Attribution of Email Messages Using Parse-Tree Features

Most existing research on authorship attribution uses various types of lexical, syntactic, and structural features for classification. Some of these features are not meaningful for small texts such as email messages. In this paper we demonstrate a very effective use of a syntactic feature of an author’s writing - text’s parse tree characteristics - for authorship analysis of email messages. We define author templates consisting of context free grammar (CFG) production frequencies occurring in an author’s training set of email messages. We then use similar frequencies extracted from a new email message to match against various authors’ templates to identify the best match. We evaluate our approach on Enron email dataset and show that CFG production frequencies work very well and are robust in attributing the authorship of email messages.

Jagadeesh Patchala, Raj Bhatnagar, Sridharan Gopalakrishnan

Query Click and Text Similarity Graph for Query Suggestions

Query suggestion is an important feature of the search engine with the explosive and diverse growth of web contents. Different kind of suggestions like query, image, movies, music and book etc. are used every day. Various types of data sources are used for the suggestions. If we model the data into various kinds of graphs then we can build a general method for any suggestions. In this paper, we have proposed a general method for query suggestion by combining two graphs: (1) query click graph which captures the relationship between queries frequently clicked on common URLs and (2) query text similarity graph which finds the similarity between two queries using Jaccard similarity. The proposed method provides literally as well as semantically relevant queries for users’ need. Simulation results show that the proposed algorithm outperforms heat diffusion method by providing more number of relevant queries. It can be used for recommendation tasks like query, image, and product suggestion.

D. Sejal, K. G. Shailesh, V. Tejaswi, Dinesh Anvekar, K. R. Venugopal, S. S. Iyengar, L. M. Patnaik

Offline Writer Identification in Tamil Using Bagged Classification Trees

In this paper, we explore the effectiveness of bagged classification trees, in solving the writer identification problem in the Tamil language. Unlike other languages, in Tamil the writer identification problem is mostly an unexplored problem. Novel feature extraction methods tailored to better understand Tamil characters have been proposed. The feature extraction methods used in this paper are chosen after analysing the statistical spread of a feature across different handwriting classes. We have also analysed how increasing the number of bagged classification trees would affect the classification accuracy. Our learning algorithm is trained with hundred and forty four samples and is tested with twenty different samples per handwriting style. In total the algorithm is trained with ten different handwriting styles. Using the proposed features and bagged classification trees, we achieve 76.4 % accuracy. The practicality of the proposed method is also analysed using a few time consumption measuring parameters.

Sudarshan Babu

Applications of Data Mining

Frontmatter

Data Analysis for Courses Registration

Data mining is a knowledge discovery process to extract the interesting previously unknown, potentially useful and non-trivial patterns from large repositories of data. There is currently increasing interest in data mining in educational systems, making it into a growing new research community. This paper applies a frequent patterns extraction approach to analyzing the distribution of courses in universities, where there are core and elective courses. The system analyzes the data that is stored in the department’s database. The objective is to consider if allocation of courses is appropriate when they are more likely to be taken in the same semester by most students. A workflow is proposed; where the data is assumed to be collected over many semesters for already graduated students. A case study is presented and results are summarized. The results show the importance of the proposed system to analyze the courses registration in a given department.

Nada Alzahrani, Rasha Alsulim, Nourah Alaseem, Ghada Badr

Learning the Relationship Between Corporate Governance and Company Performance Using Data Mining

The objective of this paper is to identify the relationship between corporate governance variables and firm performance by employing data mining methods. We choose two dependent variables, Tobin’s Q ratio and Altman Z-score, as measures for the companies’ performances and apply machine learning techniques on the data collected from the components companies of three major stock indexes: S&P 500, STOXX Europe 600 and STOXX Eastern Europe 300. We use decision trees and logistic regressions as learning algorithms, and then we compare their performances. For the US components, we found a positive connection between the presence of women in the board and the company performance, while in Western Europe that it is better to employ a larger audit committee in order to lower the bankruptcy risk. An independent chairperson is a positive factor related to Altman Z-score, for the companies from Eastern Europe.

Darie Moldovan, Simona Mutu

A Bayesian Approach to Sparse Learning-to-Rank for Search Engine Optimization

Search engine optimization (SEO) is the process of affecting the visibility of a web page in the engine’s search results. SEO specialists must understand how search engines work and which features of the web-page affect its position in the search results. This paper employs machine learning ranking algorithms to constructing the rank model of a web-search engine. Ranking a set of retrieved documents according to their relevance to a given query has become a popular problem at the intersection of web search, machine learning and information retrieval. Feature selection in learning to rank has recently emerged as a crucial issue. Recent work on ranking, focused on a number of different paradigms, namely, point-wise, pair-wise, and list-wise approaches, for which several preprocessing feature section methods have been proposed. Unfortunately, only a few works have been focused on integrating the feature selection into the learning process and all of these embedded methods are based on

$$ l_{1} $$

l

1

regularization technique. Such type of regularization does not possess many properties, essential for SEO, such as unbiasedness, grouping effect and oracle property. In this paper we suggest a new Bayesian framework for feature selection in learning-to-rank problem. The proposed approach gives the strong probabilistic statement of shrinkage criterion for features selection. The proposed regularization is unbiased, has grouping and oracle properties, its maximal risk diverges to finite value. Experimental results show that the proposed framework is competitive on both artificial data and publicly available LETOR data sets.

Olga Krasotkina, Vadim Mottl

Data Driven Geometry for Learning

High dimensional covariate information provides a detailed description of any individuals involved in a machine learning and classification problem. The inter-dependence patterns among these covariate vectors may be unknown to researchers. This fact is not well recognized in classic and modern machine learning literature; most model-based popular algorithms are implemented using some version of the dimension-reduction approach or even impose a built-in complexity penalty. This is a defensive attitude toward the high dimensionality. In contrast, an accommodating attitude can exploit such potential inter-dependence patterns embedded within the high dimensionality. In this research, we implement this latter attitude throughout by first computing the similarity between data nodes and then discovering pattern information in the form of Ultrametric tree geometry among almost all the covariate dimensions involved. We illustrate with real Microarray datasets, where we demonstrate that such dual-relationships are indeed class specific, each precisely representing the discovery of a biomarker. The whole collection of computed biomarkers constitutes a global feature-matrix, which is then shown to give rise to a very effective learning algorithm.

Elizabeth P. Chou

Mining Educational Data to Predict Students’ Academic Performance

Data mining is the process of extracting useful information from a huge amount of data. One of the most common applications of data mining is the use of different algorithms and tools to estimate future events based on previous experiences. In this context, many researchers have been using data mining techniques to support and solve challenges in higher education. There are many challenges facing this level of education, one of which is helping students to choose the right course to improve their success rate. An early prediction of students’ grades may help to solve this problem and improve students’ performance, selection of courses, success rate and retention. In this paper we use different classification techniques in order to build a performance prediction model, which is based on previous students’ academic records. The model can be easily integrated into a recommender system that can help students in their course selection, based on their and other graduated students’ grades. Our model uses two of the most recognised decision tree classification algorithms: ID3 and J48. The advantages of such a system have been presented along with a comparison in performance between the two algorithms.

Mona Al-Saleem, Norah Al-Kathiry, Sara Al-Osimi, Ghada Badr

Patient-Specific Modeling of Medical Data

Patient-specific models are instance-based learn algorithms that take advantage of the particular features of the patient case at hand to predict an outcome. We introduce two patient-specific algorithms based on decision tree paradigm that use AUC as a metric to select an attribute. We apply the patient specific algorithms to predict outcomes in several datasets, including medical datasets. Compared to the standard entropy-based method, the AUC-based patient-specific decision path models performed equivalently on area under the ROC curve (AUC). Our results provide support for patient-specific methods being a promising approach for making clinical predictions.

Guilherme Alberto Sousa Ribeiro, Alexandre Cesar Muniz de Oliveira, Antonio Luiz S. Ferreira, Shyam Visweswaran, Gregory F. Cooper

A Bayesian Approach to Sparse Cox Regression in High-Dimentional Survival Analysis

Survival prediction and prognostic factor identification play an important role in machine learning research. This paper employs the machine learning regression algorithms for constructing survival model. The paper suggests a new Bayesian framework for feature selection in high-dimensional Cox regression problems. The proposed approach gives a strong probabilistic statement of the shrinkage criterion for feature selection. The proposed regularization gives the estimates that are unbiased, possesses grouping and oracle properties, their maximal risk diverges to a finite value. Experimental results show that the proposed framework is competitive on both simulated data and publicly available real data sets.

Olga Krasotkina, Vadim Mottl

Data Mining in System Biology, Drug Discovery, and Medicine

Frontmatter

Automatic Cell Tracking and Kinetic Feature Description of Cell Paths for Image Mining

Live-cell assays are used to study the dynamic functional cellular processes in High-Content Screening (HCA) of drug discovery processes or in computational biology experiments. The large amount of image data created during the screening requires automatic image-analysis procedures that can describe these dynamic processes. One class of tasks in this application is the tracking of cells. We describe in this paper a fast and robust cell tracking algorithm applied to High-Content Screening in drug discovery or computational biology experiments. We developed a similarity-based tracking algorithm that can track the cells without an initialization phase of the parameters of the tracker. The similarity-based detection algorithm is robust enough to find similar cells although small changes in the cell morphology have been occurred. The cell tracking algorithm can track normal cells as well as mitotic cells by classifying the cells based on our previously developed texture classifier. Results for the cell path are given on a test series from a real drug discovery process. We present the path of the cell and the low-level features that describe the path of the cell. This information can be used for further image mining of high-level descriptions of the kinetics of the cells.

Petra Perner

Springer Professional

Über dieses Buch

Inhaltsverzeichnis

Frontmatter

Graph Mining

Frontmatter

Greedy Graph Edit Distance

Learning Heuristics to Reduce the Overestimation of Bipartite Graph Edit Distance Approximation

Seizure Prediction by Graph Mining, Transfer Learning, and Transformation Learning

Classification and Regression

Frontmatter

Local and Global Genetic Fuzzy Pattern Classifiers

IKLTSA: An Incremental Kernel LTSA Method

Sentiment Analysis

Frontmatter

SentiSAIL: Sentiment Analysis in English, German and Russian

Sentiment Analysis for Government: An Optimized Approach

Data Preparation and Missing Values

Frontmatter

A Novel Algorithm for the Integration of the Imputation of Missing Values and Clustering

Improving the Algorithm for Mapping of OWL to Relational Database Schema

Robust Principal Component Analysis of Data with Missing Values

Association and Sequential Rule Mining

Frontmatter

Efficient Mining of High-Utility Sequential Rules

MOGACAR: A Method for Filtering Interesting Classification Association Rules

Support Vector Machines

Frontmatter

Classifying Grasslands and Cultivated Pastures in the Brazilian Cerrado Using Support Vector Machines, Multilayer Perceptrons and Autoencoders

Hybrid Approach for Inductive Semi Supervised Learning Using Label Propagation and Support Vector Machine

Frequent Item Set Mining and Time Series Analysis

Frontmatter

Optimizing the Data-Process Relationship for Fast Mining of Frequent Itemsets in MapReduce

Aggregation-Aware Compression of Probabilistic Streaming Time Series

Clustering

Frontmatter

Applying Clustering Analysis to Heterogeneous Data Using Similarity Matrix Fusion (SMF)

On Bicluster Aggregation and its Benefits for Enumerative Solutions

Semi-Supervised Stream Clustering Using Labeled Data Points

Avalanche: A Hierarchical, Divisive Clustering Algorithm

Text Mining

Frontmatter

Author Attribution of Email Messages Using Parse-Tree Features

Query Click and Text Similarity Graph for Query Suggestions

Offline Writer Identification in Tamil Using Bagged Classification Trees

Applications of Data Mining

Frontmatter

Data Analysis for Courses Registration

Learning the Relationship Between Corporate Governance and Company Performance Using Data Mining

A Bayesian Approach to Sparse Learning-to-Rank for Search Engine Optimization

Data Driven Geometry for Learning

Mining Educational Data to Predict Students’ Academic Performance

Patient-Specific Modeling of Medical Data

A Bayesian Approach to Sparse Cox Regression in High-Dimentional Survival Analysis

Data Mining in System Biology, Drug Discovery, and Medicine

Frontmatter

Automatic Cell Tracking and Kinetic Feature Description of Cell Paths for Image Mining

Backmatter