Skip to main content



Sentiment Knowledge Discovery in Twitter Streaming Data

Micro-blogs are a challenging new source of information for data mining techniques. Twitter is a micro-blogging service built to discover what is happening at any moment in time, anywhere in the world. Twitter messages are short, and generated constantly, and well suited for knowledge discovery using data stream mining. We briefly discuss the challenges that Twitter data streams pose, focusing on classification problems, and then consider these streams for opinion mining and sentiment analysis. To deal with streaming unbalanced classes, we propose a sliding window Kappa statistic for evaluation in time-changing data streams. Using this statistic we perform a study on Twitter data using learning algorithms for data streams.
Albert Bifet, Eibe Frank

A Similarity-Based Adaptation of Naive Bayes for Label Ranking: Application to the Metalearning Problem of Algorithm Recommendation

The problem of learning label rankings is receiving increasing attention from several research communities. A number of common learning algorithms have been adapted for this task, including k-Nearest Neighbours (k-NN) and decision trees. Following this line, we propose an adaptation of the naive Bayes classification algorithm for the label ranking problem. Our main idea lies in the use of similarity between the rankings to replace the concept of probability. We empirically test the proposed method on some metalearning problems that consist of relating characteristics of learning problems to the relative performance of learning algorithms. Our method generally performs better than the baseline indicating that it is able to identify some of the underlying patterns in the data.
Artur Aiguzhinov, Carlos Soares, Ana Paula Serra

Topology Preserving SOM with Transductive Confidence Machine

We propose a novel topology preserving self-organized map (SOM) classifier with transductive confidence machine (TPSOM-TCM). Typically, SOM acts as a dimension reduction tool for mapping training samples from a high-dimensional input space onto a neuron grid. However, current SOM-based classifiers can not provide degrees of classification reliability for new unlabeled samples so that they are difficult to be used in risk-sensitive applications where incorrect predictions may result in serious consequences. Our method extends a typical SOM classifier to allow it to supply such reliability degrees. To achieve this objective, we define a nonconformity measurement with which a randomness test can predict how nonconforming a new unlabeled sample is with respect to the training samples. In addition, we notice that the definition of nonconformity measurement is more dependent on the quality of topology preservation than that of quantization error reduction. We thus incorporate the grey relation coefficient (GRC) into the calculation of neighborhood radii to improve the topology preservation without increasing the quantization error. Our method is able to improve the time efficiency of a previous method kNN-TCM, when the number of samples is large. Extensive experiments on both the UCI and KDDCUP 99 data sets show the effectiveness of our method.
Bin Tong, ZhiGuang Qin, Einoshin Suzuki

An Artificial Experimenter for Enzymatic Response Characterisation

Identifying the characteristics of biological systems through physical experimentation, is restricted by the resources available, which are limited in comparison to the size of the parameter spaces being investigated. New tools are required to assist scientists in the effective characterisation of such behaviours. By combining artificial intelligence techniques for active experiment selection, with a microfluidic experimentation platform that reduces the volumes of reactants required per experiment, a fully autonomous experimentation machine is in development to assist biological response characterisation. Part of this machine, an artificial experimenter, has been designed that automatically proposes hypotheses, then determines experiments to test those hypotheses and explore the parameter space. Using a multiple hypotheses approach that allows for representative models of response behaviours to be produced with few observations, the artificial experimenter has been employed in a laboratory setting, where it selected experiments for a human scientist to perform, to investigate the optical absorbance properties of NADH.
Chris Lovell, Gareth Jones, Steve R. Gunn, Klaus-Peter Zauner

Subgroup Discovery for Election Analysis: A Case Study in Descriptive Data Mining

In this paper, we investigate the application of descriptive data mining techniques, namely subgroup discovery, for the purpose of the ad-hoc analysis of election results. Our inquiry is based on the 2009 German federal Bundestag election (restricted to the City of Cologne) and additional socio-economic information about Cologne’s polling districts. The task is to describe relations between socio-economic variables and the votes in order to summarize interesting aspects of the voting behavior. Motivated by the specific challenges of election data analysis we propose novel quality functions and visualizations for subgroup discovery.
Henrik Grosskreutz, Mario Boley, Maike Krause-Traudes

On Enumerating Frequent Closed Patterns with Key in Multi-relational Data

We study the problem of mining closed patterns in multi-relational databases. Garriga et al. (IJCAI’07) proposed an algorithm RelLCM2 for mining closed patterns (i.e., conjunctions of literals) in multi-relational data, which is an extension of LCM, an efficient enumeration algorithm for frequent closed itemsets mining proposed in the seminal paper by Uno et al. (DS’04). We assume that a database considered contains a special predicate called key (or target), which determines the entities of interest and what is to be counted. We introduce a notion of closed patterns with key (key-closedness for short), where variables in a pattern other than the one in a key predicate are considered to be existentially quantified, and they are linked to a given target object. We then define a closure operation (key-closure) for computing key-closed patterns, and show that the difference between the semantics of key-closed patterns and that of the closed patterns in RelLCM2 implies different properties of the closure operations; in particular, the uniqueness of closure does not hold for key-closure. Nevertheless, we show that we can enumerate key-closed patterns using the technique of ppc-extensions à la LCM, thereby making the enumeration possible without storage space for previously generated patterns. We also propose a literal order designed for mining key-closed patterns, which will require less search space. The correctness of our algorithm is shown, and its computational complexity is discussed. Some preliminary experimental results are also given.
Hirohisa Seki, Yuya Honda, Shinya Nagano

Why Text Segment Classification Based on Part of Speech Feature Selection

The aim of our research is to develop a scalable automatic why question answering system for English based on supervised method that uses part of speech analysis. The prior approach consisted in building a why-classifier using function words. This paper investigates the performance of combining supervised data mining methods with various feature selection strategies in order to obtain a more accurate why classifier.Feature selection was performed a priori on the dataset to extract representative verbs and/or nouns and avoid the dimensionality curse. LogitBoost and SVM were used for the classification process. Three methods of extending the initial ”function words only” approach, to handle context-dependent features, are proposed and experimentally evaluated on various datasets. The first considers function words and context-independent adverbs; the second incorporates selected lemmatized verbs; the third contains selected lemmatized verbs & nouns. Experiments on web-extracted datasets showed that all methods performed better than the baseline, with slightly more reliable results for the third one.
Iulia Nagy, Katsuyuki Tanaka, Yasuo Ariki

Speeding Up and Boosting Diverse Density Learning

In multi-instance learning, each example is described by a bag of instances instead of a single feature vector. In this paper, we revisit the idea of performing multi-instance classification based on a point-and-scaling concept by searching for the point in instance space with the highest diverse density. This is a computationally expensive process, and we describe several heuristics designed to improve runtime. Our results show that simple variants of existing algorithms can be used to find diverse density maxima more efficiently. We also show how significant increases in accuracy can be obtained by applying a boosting algorithm with a modified version of the diverse density algorithm as the weak learner.
James R. Foulds, Eibe Frank

Incremental Learning of Cellular Automata for Parallel Recognition of Formal Languages

Parallel language recognition by cellular automata (CAs) is currently an important subject in computation theory. This paper describes incremental learning of one-dimensional, bounded, one-way, cellular automata (OCAs) that recognize formal languages from positive and negative sample strings. The objectives of this work are to develop automatic synthesis of parallel systems and to contribute to the theory of real-time recognition by cellular automata. We implemented methods to learn the rules of OCAs in the Occam system, which is based on grammatical inference of context-free grammars (CFGs) implemented in Synapse. An important feature of Occam is incremental learning by a rule generation mechanism called bridging and the search for rule sets. The bridging looks for and fills gaps in incomplete space-time transition diagrams for positive samples. Another feature of our approach is that the system synthesizes minimal or semi-minimal rule sets of CAs. This paper reports experimental results on learning several OCAs for fundamental formal languages including sets of balanced parentheses and palindromes as well as the set {a n b n c n | n ≥ 1}.
Katsuhiko Nakamura, Keita Imada

Sparse Substring Pattern Set Discovery Using Linear Programming Boosting

In this paper, we consider finding a small set of substring patterns which classifies the given documents well. We formulate the problem as 1 norm soft margin optimization problem where each dimension corresponds to a substring pattern. Then we solve this problem by using LPBoost and an optimal substring discovery algorithm. Since the problem is a linear program, the resulting solution is likely to be sparse, which is useful for feature selection. We evaluate the proposed method for real data such as movie reviews.
Kazuaki Kashihara, Kohei Hatano, Hideo Bannai, Masayuki Takeda

Discovery of Super-Mediators of Information Diffusion in Social Networks

We address the problem of discovering a different kind of influential nodes, which we call ”super-mediator”, i.e. those nodes which play an important role to pass the information to other nodes, and propose a method for discovering super-mediators from information diffusion samples without using a network structure. We divide the diffusion sequences in two groups (lower and upper), each assuming some probability distribution, find the best split by maximizing the likelihood, and rank the nodes in the upper sequences by the F-measure. We apply this measure to the information diffusion samples generated by two real networks, identify and rank the super-mediator nodes. We show that the high ranked super-mediators are also the high ranked influential nodes when the diffusion probability is large, i.e. the influential nodes also play a role of super-mediator for the other source nodes, and interestingly enough that when the high ranked super-mediators are different from the top ranked influential nodes, which is the case when the diffusion probability is small, those super-mediators become the high ranked influential nodes when the diffusion probability becomes larger. This finding will be useful to predict the influential nodes for the unexperienced spread of new information, e.g. spread of new acute contagion.
Kazumi Saito, Masahiro Kimura, Kouzou Ohara, Hiroshi Motoda

Integer Linear Programming Models for Constrained Clustering

We address the problem of building a clustering as a subset of a (possibly large) set of candidate clusters under user-defined constraints. In contrast to most approaches to constrained clustering, we do not constrain the way observations can be grouped into clusters, but the way candidate clusters can be combined into suitable clusterings. The constraints may concern the type of clustering (e.g., complete clusterings, overlapping or encompassing clusters) and the composition of clusterings (e.g., certain clusters excluding others). In the paper, we show that these constraints can be translated into integer linear programs, which can be solved by standard optimization packages. Our experiments with benchmark and real-world data investigates the quality of the clusterings and the running times depending on a variety of parameters.
Marianne Mueller, Stefan Kramer

Efficient Visualization of Document Streams

In machine learning and data mining, multidimensional scaling (MDS) and MDS-like methods are extensively used for dimensionality reduction and for gaining insights into overwhelming amounts of data through visualization. With the growth of the Web and activities of Web users, the amount of data not only grows exponentially but is also becoming available in the form of streams, where new data instances constantly flow into the system, requiring the algorithm to update the model in near-real time. This paper presents an algorithm for document stream visualization through a MDS-like distance-preserving projection onto a 2D canvas. The visualization algorithm is essentially a pipeline employing several methods from machine learning. Experimental verification shows that each stage of the pipeline is able to process a batch of documents in constant time. It is shown that in the experimental setting with a limited buffer capacity and a constant document batch size, it is possible to process roughly 2.5 documents per second which corresponds to approximately 25% of the entire blogosphere rate and should be sufficient for most real-life applications.
Miha Grčar, Vid Podpečan, Matjaž Juršič, Nada Lavrač

Bridging Conjunctive and Disjunctive Search Spaces for Mining a New Concise and Exact Representation of Correlated Patterns

In the literature, many works were interested in mining frequent patterns. Unfortunately, these patterns do not offer the whole information about the correlation rate amongst the items that constitute a given pattern since they are mainly interested in appearance frequency. In this situation, many correlation measures have been proposed in order to convey information on the dependencies within sets of items. In this work, we adopt the correlation measure bond, which provides several interesting properties. Motivated by the fact that the number of correlated patterns is often huge while many of them are redundant, we propose a new exact concise representation of frequent correlated patterns associated to this measure, through the definition of a new closure operator. The proposed representation allows not only to efficiently derive the correlation rate of a given pattern but also to exactly offer its conjunctive, disjunctive and negative supports. To prove the utility of our approach, we undertake an empirical study on several benchmark data sets that are commonly used within the data mining community.
Nassima Ben Younes, Tarek Hamrouni, Sadok Ben Yahia

Graph Classification Based on Optimizing Graph Spectra

Kernel methods such as the SVM are becoming increasingly popular due to their high performance in graph classification. In this paper, we propose a novel graph kernel, called SPEC, based on graph spectra and the Interlace Theorem, as well as an algorithm, called OPTSPEC, to optimize the SPEC kernel used in an SVM for graph classification. The fundamental performance of the method is evaluated using artificial datasets, and its practicality confirmed through experiments using a real-world dataset.
Nguyen Duy Vinh, Akihiro Inokuchi, Takashi Washio

Algorithm for Detecting Significant Locations from Raw GPS Data

We present a fast algorithm for probabilistically extracting significant locations from raw GPS data based on data point density. Extracting significant locations from raw GPS data is the first essential step of algorithms designed for location-aware applications. Assuming that a location is significant if users spend a certain time around that area, most current algorithms compare spatial/temporal variables, such as stay duration and a roaming diameter, with given fixed thresholds to extract significant locations. However, the appropriate threshold values are not clearly known in priori and algorithms with fixed thresholds are inherently error-prone, especially under high noise levels. Moreover, for N data points, they are generally O(N 2) algorithms since distance computation is required. We developed a fast algorithm for selective data point sampling around significant locations based on density information by constructing random histograms using locality sensitive hashing. Evaluations show competitive performance in detecting significant locations even under high noise levels.
Nobuharu Kami, Nobuyuki Enomoto, Teruyuki Baba, Takashi Yoshikawa

Discovery of Conservation Laws via Matrix Search

One of the main goals of Discovery Science is the development and analysis of methods for automatic knowledge discovery in the natural sciences. A central area of natural science research concerns reactions: how entities in a scientific domain interact to generate new entities. Classic AI research due to Valdés-Pérez, Żytkow, Langley and Simon has shown that many scientific discovery tasks that concern reaction models can be formalized as a matrix search. In this paper we present a method for finding conservation laws, based on two criteria for selecting a conservation law matrix: (1) maximal strictness: rule out as many unobserved reactions as possible, and (2) parsimony: minimize the L1-norm of the matrix. We provide an efficient and scalable minimization method for the joint optimization of criteria (1) and (2). For empirical evaluation, we applied the algorithm to known particle accelerator data of the type that are produced by the Large Hadron Collider in Geneva. It matches the important Standard Model of particles that physicists have constructed through decades of research: the program rediscovers Standard Model conservation laws and the corresponding particle families of baryon, muon, electron and tau number. The algorithm also discovers the correct molecular structure of a set of chemical substances.
Oliver Schulte, Mark S. Drew

Gaussian Clusters and Noise: An Approach Based on the Minimum Description Length Principle

We introduce a well-grounded minimum description length (MDL) based quality measure for a clustering consisting of either spherical or axis-aligned normally distributed clusters and a cluster with a uniform distribution in an axis-aligned rectangular box. The uniform component extends the practical usability of the model e.g. in the presence of noise, and using the MDL principle for the model selection makes comparing the quality of clusterings with a different number of clusters possible. We also introduce a novel search heuristic for finding the best clustering with an unknown number of clusters. The heuristic is based on the idea of moving points from the Gaussian clusters to the uniform one and using MDL for determining the optimal amount of noise. Tests with synthetic data having a clear cluster structure imply that the search method is effective in finding the intuitively correct clustering.
Panu Luosto, Jyrki Kivinen, Heikki Mannila

Exploiting Code Redundancies in ECOC

We study an approach for speeding up the training of error-correcting output codes (ECOC) classifiers. The key idea is to avoid unnecessary computations by exploiting the overlap of the different training sets in the ECOC ensemble. Instead of re-training each classifier from scratch, classifiers that have been trained for one task can be adapted to related tasks in the ensemble. The crucial issue is the identification of a schedule for training the classifiers which maximizes the exploitation of the overlap. For solving this problem, we construct a classifier graph in which the nodes correspond to the classifiers, and the edges represent the training complexity for moving from one classifier to the next in terms of the number of added training examples. The solution of the Steiner Tree problem is an arborescence in this graph which describes the learning scheme with the minimal total training complexity. We experimentally evaluate the algorithm with Hoeffding trees, as an example for incremental learners where the classifier adaptation is trivial, and with SVMs, where we employ an adaptation strategy based on adapted caching and weight reuse, which guarantees that the learned model is the same as per batch learning.
Sang-Hyeun Park, Lorenz Weizsäcker, Johannes Fürnkranz

Concept Convergence in Empirical Domains

How to achieve shared meaning is a significant issue when more than one intelligent agent is involved in the same domain. We define the task of concept convergence, by which intelligent agents can achieve a shared, agreed-upon meaning of a concept (restricted to empirical domains). For this purpose we present a framework that, integrating computational argumentation and inductive concept learning, allows a pair of agents to (1) learn a concept in an empirical domain, (2) argue about the concept’s meaning, and (3) reach a shared agreed-upon concept definition. We apply this framework to marine sponges, a biological domain where the actual definitions of concepts such as orders, families and species are currently open to discussion. An experimental evaluation on marine sponges shows that concept convergence is achieved, within a reasonable number of interchanged arguments, and reaching short and accurate definitions (with respect to precision and recall).
Santiago Ontañón, Enric Plaza

Equation Discovery for Model Identification in Respiratory Mechanics of the Mechanically Ventilated Human Lung

Lung protective ventilation strategies reduce the risk of ventilator associated lung injury. To develop such strategies, knowledge about mechanical properties of the mechanically ventilated human lung is essential. This study was designed to develop an equation discovery system to identify mathematical models of the respiratory system in time-series data obtained from mechanically ventilated patients. Two techniques were combined: (i) the usage of declarative bias to reduce search space complexity and inherently providing the processing of background knowledge. (ii) A newly developed heuristic for traversing the hypothesis space with a greedy, randomized strategy analogical to the GSAT algorithm. In 96.8% of all runs the applied equation discovery system was capable to detect the well-established equation of motion model of the respiratory system in the provided data. We see the potential of this semi-automatic approach to detect more complex mathematical descriptions of the respiratory system from respiratory data.
Steven Ganzert, Josef Guttmann, Daniel Steinmann, Stefan Kramer

Mining Class-Correlated Patterns for Sequence Labeling

Sequence labeling is the task of assigning a label sequence to an observation sequence. Since many methods to solve this problem depend on the specification of predictive features, automated methods for their derivation are desirable. Unlike in other areas of pattern-based classification, however, no algorithm to directly mine class-correlated patterns for sequence labeling has been proposed so far. We introduce the novel task of mining class-correlated sequence patterns for sequence labeling and present a supervised pattern growth algorithm to find all patterns in a set of observation sequences, which correlate with the assignment of a fixed sequence label no less than a user-specified minimum correlation constraint. From the resulting set of patterns, features for a variety of classifiers can be obtained in a straightforward manner. The efficiency of the approach and the influence of important parameters are shown in experiments on several biological datasets.
Thomas Hopf, Stefan Kramer

ESTATE: Strategy for Exploring Labeled Spatial Datasets Using Association Analysis

We propose an association analysis-based strategy for exploration of multi-attribute spatial datasets possessing naturally arising classification. Proposed strategy, ESTATE (Exploring Spatial daTa Association patTErns), inverts such classification by interpreting different classes found in the dataset in terms of sets of discriminative patterns of its attributes. It consists of several core steps including discriminative data mining, similarity between transactional patterns, and visualization. An algorithm for calculating similarity measure between patterns is the major original contribution that facilitates summarization of discovered information and makes the entire framework practical for real life applications. Detailed description of the ESTATE framework is followed by its application to the domain of ecology using a dataset that fuses the information on geographical distribution of biodiversity of bird species across the contiguous United States with distributions of 32 environmental variables across the same area.
Tomasz F. Stepinski, Josue Salazar, Wei Ding, Denis White

Adapted Transfer of Distance Measures for Quantitative Structure-Activity Relationships

Quantitative structure-activity relationships (QSARs) are regression models relating chemical structure to biological activity. Such models allow to make predictions for toxicologically or pharmacologically relevant endpoints, which constitute the target outcomes of trials or experiments. The task is often tackled by instance-based methods (like k-nearest neighbors), which are all based on the notion of chemical (dis-)similarity. Our starting point is the observation by Raymond and Willett that the two big families of chemical distance measures, fingerprint-based and maximum common subgaph based measures, provide orthogonal information about chemical similarity. The paper presents a novel method for finding suitable combinations of them, called adapted transfer, which adapts a distance measure learned on another, related dataset to a given dataset. Adapted transfer thus combines distance learning and transfer learning in a novel manner. In a set of experiments, we compare adapted transfer with distance learning on the target dataset itself and inductive transfer without adaptations. In our experiments, we visualize the performance of the methods by learning curves (i.e., depending on training set size) and present a quantitative comparison for 10% and 100% of the maximum training set size.
Ulrich Rückert, Tobias Girschick, Fabian Buchwald, Stefan Kramer

Incremental Mining of Closed Frequent Subtrees

We study the problem of mining closed frequent subtrees from tree databases that are updated regularly over time. Closed frequent subtrees provide condensed and complete information for all frequent subtrees in the database. Although mining closed frequent subtrees is in general faster than mining all frequent subtrees, this is still a very time consuming process, and thus it is undesirable to mine from scratch when the change to the database is small. The set of previous mined closed subtrees should be reused as much as possible to compute new emerging subtrees. We propose, in this paper, a novel and efficient incremental mining algorithm for closed frequent labeled ordered trees. We adopt a divide-and-conquer strategy and apply different mining techniques in different parts of the mining process. The proposed algorithm requires no additional scan of the whole database while its memory usage is reasonable. Our experimental study on both synthetic and real-life datasets demonstrates the efficiency and scalability of our algorithm.
Viet Anh Nguyen, Akihiro Yamamoto

Optimal Online Prediction in Adversarial Environments

In many prediction problems, including those that arise in computer security and computational finance, the process generating the data is best modelled as an adversary with whom the predictor competes. Even decision problems that are not inherently adversarial can be usefully modeled in this way, since the assumptions are sufficiently weak that effective prediction strategies for adversarial settings are very widely applicable.
Peter L. Bartlett

Discovery of Abstract Concepts by a Robot

This paper reviews experiments with an approach to discovery through robot’s experimentation in its environment. In addition to discovering laws that enable predictions, we are particularly interested in the mechanisms that enable the discovery of abstract concepts that are not explicitly observable in the measured data, such as the notions of a tool or stability. The approach is based on the use of Inductive Logic Programming. Examples of actually discovered abstract concepts in the experiments include the concepts of a movable object, an obstacle and a tool.
Ivan Bratko

Contrast Pattern Mining and Its Application for Building Robust Classifiers

The ability to distinguish, differentiate and contrast between different data sets is a key objective in data mining. Such ability can assist domain experts to understand their data and can help in building classification models. This presentation will introduce the techniques for contrasting data sets. It will also focus on some important real world applications that illustrate how contrast patterns can be applied effectively for building robust classifiers.
Kotagiri Ramamohanarao

Towards General Algorithms for Grammatical Inference

Many algorithms for grammatical inference can be viewed as instances of a more general algorithm which maintains a set of primitive elements, which distributionally define sets of strings, and a set of features or tests that constrain various inference rules. Using this general framework, which we cast as a process of logical inference, we re-analyse Angluin’s famous lstar algorithm and several recent algorithms for the inference of context-free grammars and multiple context-free grammars. Finally, to illustrate the advantages of this approach, we extend it to the inference of functional transductions from positive data only, and we present a new algorithm for the inference of finite state transducers.
Alexander Clark

The Blessing and the Curse of the Multiplicative Updates

Multiplicative updates multiply the parameters by nonnegative factors. These updates are motivated by a Maximum Entropy Principle and they are prevalent in evolutionary processes where the parameters are for example concentrations of species and the factors are survival rates. The simplest such update is Bayes rule and we give an in vitro selection algorithm for RNA strands that implements this rule in the test tube where each RNA strand represents a different model. In one liter of the RNA “soup” there are approximately 1020 different strands and therefore this is a rather high-dimensional implementation of Bayes rule.
We investigate multiplicative updates for the purpose of learning online while processing a stream of examples. The “blessing” of these updates is that they learn very fast because the good parameters grow exponentially. However their “curse” is that they learn too fast and wipe out parameters too quickly. We describe a number of methods developed in the realm of online learning that ameliorate the curse of these updates. The methods make the algorithm robust against data that changes over time and prevent the currently good parameters from taking over. We also discuss how the curse is circumvented by nature. Some of nature’s methods parallel the ones developed in Machine Learning, but nature also has some additional tricks.
Manfred K. Warmuth


Weitere Informationen

Premium Partner