nach oben

2014 | Buch

Kapitel lesen Erstes Kapitel lesen

Foundations of Intelligent Systems

21st International Symposium, ISMIS 2014, Roskilde, Denmark, June 25-27, 2014. Proceedings

herausgegeben von: Troels Andreasen, Henning Christiansen, Juan-Carlos Cubero, Zbigniew W. Raś

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

This book constitutes the refereed proceedings of the 21st International Symposium on Methodologies for Intelligent Systems, ISMIS 2014, held in Roskilde, Denmark, in June 2014. The 61 revised full papers were carefully reviewed and selected from 111 submissions. The papers are organized in topical sections on complex networks and data stream mining; data mining methods; intelligent systems applications; knowledge representation in databases and systems; textual data analysis and mining; special session: challenges in text mining and semantic information retrieval; special session: warehousing and OLAPing complex, spatial and spatio-temporal data; ISMIS posters.

Inhaltsverzeichnis

Frontmatter

Complex Networks and Data Stream Mining

Community Detection by an Efficient Ant Colony Approach

Community detection is an efficient tool to analyze large complex networks offering new insights about their structures and functioning. A community is a significant organizational unity formed by nodes with more connections between them. Ant colony algorithms have been used to detect communities on a fast and efficient way. In this work, changes are performed on an ant colony algorithm for community detection by means of modularity optimization. The changes rely on the way an ant moves and on the adopted stopping criteria. To assess the proposed strategy, benchmark networks are studied and preliminary results indicate that the suggested changes make the original algorithm more robust, reaching higher values of modularity of the detected communities.

Lúcio Pereira de Andrade, Rogério Pinto Espíndola, Nelson Francisco Favilla Ebecken

Adaptive XML Stream Classification Using Partial Tree-Edit Distance

XML classification finds many applications, ranging from data integration to e-commerce. However, existing classification algorithms are designed for static XML collections, while modern information systems frequently deal with streaming data that needs to be processed on-line using limited resources. Furthermore, data stream classifiers have to be able to react to concept drifts, i.e., changes of the streams underlying data distribution. In this paper, we propose XStreamClass, an XML classifier capable of processing streams of documents and reacting to concept drifts. The algorithm combines incremental frequent tree mining with partial tree-edit distance and associative classification. XStreamClass was experimentally compared with four state-of-the-art data stream ensembles and provided best average classification accuracy on real and synthetic datasets simulating different drift scenarios.

Dariusz Brzezinski, Maciej Piernik

RILL: Algorithm for Learning Rules from Streaming Data with Concept Drift

Incremental learning of classification rules from data streams with concept drift is considered. We introduce a new algorithm RILL, which induces rules and single instances, uses bottom-up rule generalization based on nearest rules, and performs intensive pruning of the obtained rule set. Its experimental evaluation shows that it achieves better classification accuracy and memory usage than the related rule algorithm VFDR and it is also competitive to decision trees VFDT-NB.

Magdalena Deckert, Jerzy Stefanowski

Community Detection for Multiplex Social Networks Based on Relational Bayesian Networks

Many techniques have been proposed for community detection in social networks. Most of these techniques are only designed for networks defined by a single relation. However, many real networks are multiplex networks that contain multiple types of relations and different attributes on the nodes. In this paper we propose to use relational Bayesian networks for the specification of probabilistic network models, and develop inference techniques that solve the community detection problem based on these models. The use of relational Bayesian networks as a flexible high-level modeling framework enables us to express different models capturing different aspects of community detection in multiplex networks in a coherent manner, and to use a single inference mechanism for all models.

Jiuchuan Jiang, Manfred Jaeger

Mining Dense Regions from Vehicular Mobility in Streaming Setting

The detection of congested areas can play an important role in the development of systems of traffic management. Usually, the problem is investigated under two main perspectives which concern the representation of space and the shape of the dense regions respectively. However, the adoption of movement tracking technologies enables the generation of mobility data in a streaming style, which adds an aspect of complexity not yet addressed in the literature. We propose a computational solution to mine dense regions in the urban space from mobility data streams. Our proposal adopts a stream data mining strategy which enables the detection of two types of dense regions, one based on spatial closeness, the other one based on temporal proximity. We prove the viability of the approach on vehicular data streams in the urban space.

Corrado Loglisci, Donato Malerba

Mining Temporal Evolution of Entities in a Stream of Textual Documents

One of the recently addressed research directions focuses on the problem of mining topic evolutions from textual documents. Following this main stream of research, in this paper we face the different, but related, problem of mining the topic evolution of entities (persons, companies, etc.) mentioned in the documents. To this aim, we incrementally analyze streams of time-stamped documents in order to identify clusters of similar entities and represent their evolution over time. The proposed solution is based on the concept of temporal profiles of entities extracted at periodic instants in time. Experiments performed both on synthetic and real world datasets prove that the proposed framework is a valuable tool to discover underlying evolutions of entities and results show significant improvements over the considered baseline methods.

Gianvito Pio, Pasqua Fabiana Lanotte, Michelangelo Ceci, Donato Malerba

An Efficient Method for Community Detection Based on Formal Concept Analysis

This work aims at proposing an original approach based on formal concept analysis (FCA) for community detection in social networks (SN). Firstly, we study FCA methods which partially detect community in social networks. Secondly we propose a

GroupNode modularity

function whose goal is to improve a partial detection method taking into account all actors of the social network. Our approach is validated through different experiments based on real known social networks in the field and a synthetic benchmark networks. In addition, we adapted the

F-measure

function in the case of multi-class in order to evaluate the quality of a detected community.

Selmane Sid Ali, Fadila Bentayeb, Rokia Missaoui, Omar Boussaid

Data Mining Methods

On Interpreting Three-Way Decisions through Two-Way Decisions

Three-way decisions for classification consist of the actions of acceptance, rejection and non-commitment (i.e., neither acceptance nor rejection) in deciding whether an object is in a class. A difficulty with three-way decisions is that one must consider costs of three actions simultaneously. On the other hand, for two-way decisions, one simply considers costs of two actions. The main objective of this paper is to take advantage of the simplicity of two-way decisions by interpreting three-way decisions as a combination of a pair of two-way decision models. One consists of acceptance and non-acceptance and the other consists of rejection and non-rejection. The non-commitment of the three-way decision model is viewed as non-acceptance and non-rejection of the pair of two-way decision models.

Xiaofei Deng, Yiyu Yao, JingTao Yao

FHM: Faster High-Utility Itemset Mining Using Estimated Utility Co-occurrence Pruning

High utility itemset mining is a challenging task in frequent pattern mining, which has wide applications. The state-of-the-art algorithm is HUI-Miner. It adopts a vertical representation and performs a depth-first search to discover patterns and calculate their utility without performing costly database scans. Although, this approach is effective, mining high-utility itemsets remains computationally expensive because HUI-Miner has to perform a costly join operation for each pattern that is generated by its search procedure. In this paper, we address this issue by proposing a novel strategy based on the analysis of item co-occurrences to reduce the number of join operations that need to be performed. An extensive experimental study with four real-life datasets shows that the resulting algorithm named FHM (Fast High-Utility Miner) reduces the number of join operations by up to 95 % and is up to six times faster than the state-of-the-art algorithm HUI-Miner.

Philippe Fournier-Viger, Cheng-Wei Wu, Souleymane Zida, Vincent S. Tseng

Automatic Subclasses Estimation for a Better Classification with HNNP

Although nowadays many artificial intelligence and especially machine learning research concerns big data, there are still a lot of real world problems for which only small and noisy data sets exist. Applying learning models to those data may not lead to desirable results. Hence, in a former work we proposed a hybrid neural network plait (HNNP) for improving the classification performance on those data. To address the high intraclass variance in the investigated data we used manually estimated subclasses for the HNNP approach. In this paper we investigate on the one hand the impact of using those subclasses instead of the main classes for HNNP and on the other hand an approach for an automatic subclasses estimation for HNNP to overcome the expensive and time consuming manual labeling. The results of the experiments with two different real data sets show that using automatically estimated subclasses for HNNP delivers the best classification performance and outperforms also single state-of-the-art neural networks as well as ensemble methods.

Ruth Janning, Carlotta Schatten, Lars Schmidt-Thieme

A Large-Scale, Hybrid Approach for Recommending Pages Based on Previous User Click Pattern and Content

In a large-scale recommendation setting, item-based collaborative filtering is preferable due to the availability of huge number of users’ preference information and relative stability in item-item similarity. Item-based collaborative filtering only uses users’ items preference information to predict recommendation for targeted users. This process may not always be effective, if the amount of preference information available is very small. For this kind of problem, item-content based similarity plays important role in addition to item co-occurrence-based similarity. In this paper we propose and evaluate a Map-Reduce based, large-scale, hybrid collaborative algorithm to incorporate both the content similarity and co-occurrence similarity. To generate recommendation for users having more or less preference information the relative weights of the item-item content-based and co-occurrence-based similarities are user-dependently tuned. Our experimental results on Yahoo! Front Page “Today Module User Click Log” dataset shows that we are able to get significant average precision improvement using the proposed method for user-dependent parametric incorporation of the two similarity metrics compared to other recent cited work.

Mohammad Amir Sharif, Vijay V. Raghavan

EverMiner Prototype Using LISp-Miner Control Language

The goal of the

EverMiner

project is to run automatic data mining process starting with several items of initial domain knowledge and leading to new knowledge being inferred. A formal description of items of domain knowledge as well as of all particular steps of the process is used. The

EverMiner

project is based on the

LISp-Miner

software system which involves several data mining tools. There are experiments with the proposed approach realized by manual chaining of tools of the

LISp-Miner

. The paper describes experiences with the

LISp-Miner Control Language

which allows to transform a formal description of data mining process into an executable program.

Milan Šimůnek, Jan Rauch

Local Characteristics of Minority Examples in Pre-processing of Imbalanced Data

Informed pre-processing methods for improving classifiers learned from class-imbalanced data are considered. We discuss different ways of analyzing the characteristics of local distributions of examples in such data. Then, we experimentally compare main informed pre-processing methods and show that identifying types of minority examples depending on their

nearest neighbourhood may help in explaining differences in performance of these methods. Finally, we exploit the information about the local neighbourhood to modify the oversampling ratio in a SMOTE–related method.

Jerzy Stefanowski, Krystyna Napierała, Małgorzata Trzcielińska

Visual-Based Detection of Properties of Confirmation Measures

The paper presents a visualization technique that facilitates and eases analyses of interestingness measures with respect to their properties. Detection of properties possessed by these measures is especially important when choosing a measure for KDD tasks. Our visual-based approach is a useful alternative to often laborious and time consuming theoretical studies, as it allows to promptly perceive properties of the visualized measures. Assuming a common, four-dimensional domain of the measures, a synthetic dataset consisting of all possible contingency tables with the same number of observations is generated. It is then visualized in 3D using a tetrahedron-based barycentric coordinate system. Additional scalar function - an interestingness measure - is rendered using colour. To demonstrate the capabilities of the proposed technique, we detect properties of a particular group of measures, known as confirmation measures.

Robert Susmaga, Izabela Szczęch

Intelligent Systems Applications

A Recursive Algorithm for Building Renovation in Smart Cities

Layout configuration algorithms in civil engineering have two major strategies called

constructive

and

iterative improvement

. Both strategies have been successfully applied within different facility scenarios such as room configurations and apartment layouts. Yet, most of the work share two commonalities: They attack problems in which the reference plane is parallel to the Earth and, in most cases, the number of activities are known in advance. This work aims to close that gap by developing a constructive-based algorithm for the layout configuration of building facades in the context of a French project called CRIBA. The project develops a smart-city support system for high-performance renovation of apartment buildings. Algorithm details are explained and one example is presented to illustrate the kind of facades it can deal with.

Andrés Felipe Barco, Elise Vareilles, Michel Aldanondo, Paul Gaborit

Spike Sorting Based upon PCA over DWT Frequency Band Selection

When analyzing the neurobiological data many of its aspects have to be carefully looked upon. Data coming from the MRI, EMG or microrecording all have its special properties that have to be extracted during the process analysis. In case of recordings coming from the microrecording procedure i.e. from microelectrodes placed within the neuronal tissue signal can be analyzed in at least two ways. First approach focuses on the background noise present in such recordings. Second one, looks upon the presence of the spikes - electrical signs of the bioelectrical neurophysiological activity of neuron cells. In a given recording one may often find many spikes with different shapes. For further analytical reasons it is frequently desired that spikes are to be grouped according to their shape. Such grouping / shape clustering is called spike sorting and there are many known approaches to that problem. Still, before spikes are detected and sorted the raw recorded signal is almost always filtered and altered in various DSP processes. This preliminary DSP operations may significantly hamper the spike sorting efficiency. Analysis presented in this paper provides answer as to which frequency bands are alone sufficient for proper and successful spike sorting.

Konrad Ciecierski, Zbigniew W. Raś, Andrzej W. Przybyszewski

Neural Network Implementation of a Mesoscale Meteorological Model

Numerical weather prediction is a computationally expensive task that requires not only the numerical solution to a complex set of non-linear partial differential equations, but also the creation of a parameterization scheme to estimate sub-grid scale phenomenon. This paper outlines an alternative approach to developing a mesoscale meteorological model – a modified recurrent neural network that learns to simulate the solution to these equations. Along with an appropriate time integration scheme and learning algorithm, this method can be used to create multi-day forecasts for a large region.

The learning method presented in this paper is an extended form of Backpropagation Through Time for a recurrent network with outputs that feed back through as inputs only after undergoing a fixed transformation.

Robert Firth, Jianhua Chen

Spectral Machine Learning for Predicting Power Wheelchair Exercise Compliance

Pressure ulcers are a common and devastating condition faced by users of power wheelchairs. However, proper use of power wheelchair tilt and recline functions can alleviate pressure and reduce the risk of ulcer occurrence. In this work, we show that when using data from a sensor instrumented power wheelchair, we are able to predict with an average accuracy of 92% whether a subject will successfully complete a repositioning exercise when prompted. We present two models of compliance prediction. The first, a spectral Hidden Markov Model, uses fast, optimal optimization techniques to train a sequential classifier. The second, a decision tree using information gain, is computationally efficient and produces an output that is easy for clinicians and wheelchair users to understand. These prediction algorithms will be a key component in an intelligent reminding system that will prompt users to complete a repositioning exercise only in contexts in which the user is most likely to comply.

Robert Fisher, Reid Simmons, Cheng-Shiu Chung, Rory Cooper, Garrett Grindle, Annmarie Kelleher, Hsinyi Liu, Yu Kuang Wu

Mood Tracking of Radio Station Broadcasts

This paper presents an example of a system for the analysis of emotions contained within radio broadcasts. We prepared training data, did feature extraction, built classifiers for music/speech discrimination and for emotion detection in music. To study changes in emotions, we used recorded broadcasts from 4 selected European radio stations. The collected data allowed us to determine the dominant emotion in the radio broadcasts and construct maps visualizing the distribution of emotions in time. The obtained results provide a new interesting view of the emotional content of radio station broadcasts.

Jacek Grekow

Evidential Combination Operators for Entrapment Prediction in Advanced Driver Assistance Systems

We propose the use of evidential combination operators for advanced driver assistance systems (ADAS) for vehicles. More specifically, we elaborate on how three different operators, one precise and two imprecise, can be used for the purpose of entrapment prediction, i.e., to estimate when the relative positions and speeds of the surrounding vehicles can potentially become dangerous. We motivate the use of the imprecise operators by their ability to model uncertainty in the underlying sensor information and we provide an example that demonstrates the differences between the operators.

Alexander Karlsson, Anders Dahlbom, Hui Zhong

Influence of Feature Sets on Precision, Recall, and Accuracy of Identification of Musical Instruments in Audio Recordings

In this paper we investigate how various feature sets influence precision, recall, and accuracy of identification of multiple instruments in polyphonic recordings. Our investigations were performed on classical music, and musical instruments typical for this music. Five feature sets were investigated. The results show that precision and recall change to a great extend, beyond the usual trade-off, whereas accuracy is relatively stable. Also, the results depend on the polyphony level of particular pieces of music. The investigated music varies in polyphony level, from 2-instrument duet (with piano) to symphonies.

Elżbieta Kubera, Alicja A. Wieczorkowska, Magdalena Skrzypiec

Multi-label Ferns for Efficient Recognition of Musical Instruments in Recordings

In this paper we introduce multi-label ferns, and apply this technique for automatic classification of musical instruments in audio recordings. We compare the performance of our proposed method to a set of binary random ferns, using jazz recordings as input data. Our main result is obtaining much faster classification and higher F-score. We also achieve substantial reduction of the model size.

Miron B. Kursa, Alicja A. Wieczorkowska

Computer-Supported Polysensory Integration Technology for Educationally Handicapped Pupils

In this paper, a multimedia system providing technology for hearing and visual attention stimulation is shortly presented. The system aims to support the development of educationally handicapped pupils. The system has been presented in the context of its configuration, architecture, and therapeutic exercise implementation issues. Results of pupils’ improvements after 8 weeks of training with the system are also provided. Training with the system led to the development of spatial orientation and understanding cause-and-effect relationships.

Michal Lech, Andrzej Czyzewski, Waldemar Kucharski, Bozena Kostek

Integrating Cluster Analysis to the ARIMA Model for Forecasting Geosensor Data

Clustering geosensor data is a problem that has recently attracted a large amount of research. In this paper, we focus on clustering geophysical time series data measured by a geo-sensor network. Clusters are built by accounting for both spatial and temporal information of data. We use clusters to produce globally meaningful information from time series obtained by individual sensors. The cluster information is integrated to the ARIMA model, in order to yield accurate forecasting results. Experiments investigate the trade-off between accuracy and efficiency of the proposed algorithm.

Sonja Pravilovic, Annalisa Appice, Donato Malerba

Unsupervised and Hybrid Approaches for On-line RFID Localization with Mixed Context Knowledge

Indoor localization of humans is still a complex problem, especially in resource-constrained environments, e. g., if there is only a small number of data available over time. We address this problem using active RFID technology and focus on room-level localization. We propose several unsupervised localization approaches and compare their accuracy to state-of-the art unsupervised and supervised localization methods. In addition, we combine unsupervised and supervised methods into a hybrid approach using different types of mixed context knowledge. We show, that the new unsupervised approaches significantly outperform state-of-the-art supervised methods, and that the hybrid approach performs best in our application setting. We analyze real world data collected at a two days evaluation of our working group management system MyGroup.

Christoph Scholz, Martin Atzmueller, Gerd Stumme

Mining Surgical Meta-actions Effects with Variable Diagnoses’ Number

Commonly, information systems are organized by the use of tables that are composed of a fixed number of columns representing the information system’s attributes. However, in a typical hospital scenario, patients may have a variable number of diagnoses and this data is recorded in the patients’ medical records in a random order. Treatments are prescribed based on these diagnoses, which makes it harder to mine meta-actions from healthcare datasets. In such scenario, the patients are not necessarily followed for a specific disease, but are treated for what they are diagnosed for. This makes it even more complex to prescribe personalized treatments since patients react differently to treatments based on their state (diagnoses). In this work, we present a method to extract personalized meta-actions from surgical datasets with variable number of diagnoses. We used the Florida State Inpatient Databases (SID), which is a part of the Healthcare Cost and Utilization Project (HCUP) [1] to demonstrate how to extract meta-actions and evaluate them.

Hakim Touati, Zbigniew W. Raś, James Studnicki, Alicja A. Wieczorkowska

Knowledge Representation in Databases and Systems

A System for Computing Conceptual Pathways in Bio-medical Text Models

This paper describes the key principles in a system for querying and conceptual path finding in a logic-based knowledge base. The knowledge base is extracted from textual descriptions in bio-, pharma- and medical areas. The knowledge base applies natural logic, that is, a variable-free term-algebraic form of predicate logic. Natural logics are distinguished by coming close to natural language so that propositions are readable by domain experts. The natural logic knowledge base is accompanied by an internal graph representation, where the nodes represent simple concept terms as well as compound concepts stemming from entire phrases. Path finding between concepts is facilitated by a labelled graph form that represents the knowledge base as well as the ontological information.

Troels Andreasen, Henrik Bulskov, Jørgen Fischer Nilsson, Per Anker Jensen

Putting Instance Matching to the Test: Is Instance Matching Ready for Reliable Data Linking?

To extend the scope of retrieval and reasoning spanning several linked data stores, it is necessary to find out whether information in different collections actually points to the same real world object. Thus, data stores are interlinked through owl:sameAs relations. Unfortunately, this cross-linkage is not as extensive as one would hope. To remedy this problem, instance matching systems automatically discovering owl:sameAs links, have been proposed recently. According to results on existing benchmarks, such systems seem to have reached a convincing level of maturity. But the evaluations miss out on some important characteristics encountered in real-world data. To establish if instance matching systems are really ready for real-world data interlinking, we analyzed the main challenges of instance matching. We built a representative data set that emphasizes these challenges and evaluated the global quality of instance matching systems on the example of a top performer from last year’s Instance Matching track organized by the Ontology Alignment Evaluation Initiative (OAEI).

Silviu Homoceanu, Jan-Christoph Kalo, Wolf-Tilo Balke

Improving Personalization and Contextualization of Queries to Knowledge Bases Using Spreading Activation and Users’ Feedback

Facilitating knowledge acquisition when users are consulting knowledge bases (KB) is often a challenge, given the large amount of data contained. Providing users with appropriate contextualization and personalization of the content of KBs is a way to try to achieve this goal. This paper presents a mechanism intended to provide contextualization and personalization of queries to KBs based on collected data regarding users’ preferences, both implicitly (users’ profiles) and explicitly (users’ feedback). This mechanism combines user data with a spreading activation (SA) algorithm to generate the contextualization. The initial positive results of the evaluation of the contextualization are presented in this paper.

Ana Belen Pelegrina, Maria J. Martin-Bautista, Pamela Faber

Plethoric Answers to Fuzzy Queries: A Reduction Method Based on Query Mining

Querying large-scale databases may often lead to plethoric answers, even when fuzzy queries are used. To overcome this problem, we propose to strengthen the initial query with additional predicates, selected among predefined ones according mainly to their degree of semantic relationship with the initial query. In the approach we propose, related predicates are identified by mining a repository of previously executed queries.

Olivier Pivert, Grégory Smits

Generating Description Logic ${\mathcal ALC}$ from Text in Natural Language

In this paper, we present a natural language translator for expressive ontologies and ensure that it is a viable solution to the automated acquisition of ontologies and complete axioms, constituting an effective solution for automating the expressive ontology building Process. The translator is based on syntactic and semantic text analysis. The viability of our approach is demonstrated through the generation of descriptions of complex axioms from concepts defined by users and glossaries found at Wikipedia. We evaluated our approach in an initial experiment with entry sentences enriched with hierarchy axioms, disjunction, conjunction, negation, as well as existential and universal quantification to impose restriction of properties.

Ryan Ribeiro de Azevedo, Fred Freitas, Rodrigo Rocha, José Antônio Alves de Menezes, Luis F. Alves Pereira

DBaaS-Expert: A Recommender for the Selection of the Right Cloud Database

The most important benefit of Cloud Computing is that organizations no longer need to expend capital up-front for hardware and software purchases. Indeed, all services are provided on a pay-per-use basis. The cloud services market is forecast to grow, and numerous providers offer database as a service (DBaaS). Nevertheless, as the number of DBaaS’ offerings increases, it becomes difficult to compare various offerings through checking of a documentation ads-oriented. In this paper, we propose and describe

DBaaS-Expert

– a framework which helps a user to choose the right DBaaS Cloud Provider among DBaaS’ offerings. The core components of

DBaaS-Expert

is first an ontology which captures cloud data management systems services concepts, and second a ranking core which scores each DBaaS offer in terms of criteria.

Soror Sahri, Rim Moussa, Darrell D. E. Long, Salima Benbernou

Context-Aware Decision Support in Dynamic Environments: Methodology and Case Study

Dynamic environments assume on-the-fly decision support based on available information and current situation development. The paper addresses the methodology of context-aware decision support in dynamic environments. Context is modeled as a “problem situation.” It specifies domain knowledge describing the situation and problems to be solved in this situation. The context is produced based on the knowledge extracted from an application ontology, which is formalized as an object-oriented constraint network. The paper proposes a set of technologies that can be used to implement the ideas behind the research. An application of these ideas is illustrated by an example of decision support for tourists travelling by car. In this example, the proposed system generates ad hoc travel plans and assists tourists in planning their attraction attending times depending on the context information about the current situation in the region and its foreseen development.

Alexander Smirnov, Tatiana Levashova, Alexey Kashevnik, Nikolay Shilov

Textual Data Analysis and Mining

Unsupervised Aggregation of Categories for Document Labelling

We present a novel algorithm of document categorization, assigning multiple labels out of a large set of hierarchically arranged (but not necessarily tree-like) set of possible categories. It extends our Wikipedia-based method presented in [1] via unsupervised aggregation (generalization) of document categories. We compare resulting categorization with the original (not aggregated) version and with the variant which transforms categories to a manually selected set of labels.

Piotr Borkowski, Krzysztof Ciesielski, Mieczysław A. Kłopotek

Classification of Small Datasets: Why Using Class-Based Weighting Measures?

In text classification, providing an efficient classifier even if the number of documents involved in the learning step is small remains an important issue. In this paper we evaluate the performance of traditional classification methods to better evaluate their limitation in the learning phase when dealing with small amount of documents. We thus propose a new way for weighting features which are used for classifying. These features have been integrated in two well known classifiers: Class-Feature-Centroid and Naïve Bayes, and evaluations have been performed on two real datasets. We have also investigated the influence on parameters such as number of classes, documents or words in the classification. Experiments have shown the efficiency of our proposal relatively to state of the art classification methods. Either with a very few amount of data or with a small number of features that can be extracted from poor content documents, we show that our approach performs well.

Flavien Bouillot, Pascal Poncelet, Mathieu Roche

Improved Factorization of a Connectionist Language Model for Single-Pass Real-Time Speech Recognition

Statistical Language Models are often difficult to derive because of the so-called “dimensionality curse”. Connectionist Language Models defeat this problem by utilizing a distributed word representation which is modified simultaneously as the neural network synaptic weights. This work describes certain improvements in the utilization of Connectionist Language Models for single-pass real-time speech recognition. These include comparing the word probabilities independently between the words and a novel mechanism of factorization of the lexical tree. Experiments comparing the improved model to the standard Connectionist Language Model in a Large-Vocabulary Continuous Speech Recognition (LVCSR) task show the new method obtains about a 33-fold speed increase while achieving a minimally worse word-level speech recognition performance.

Łukasz Brocki, Danijel Koržinek, Krzysztof Marasek

Automatic Extraction of Logical Web Lists

Recently, there has been increased interest in the extraction of structured data from the web (both “Surface” Web and“Hidden” Web). In particular, in this paper we focus on the automatic extraction of Web Lists. Although this task has been studied extensively, existing approaches are based on the assumption that lists are wholly contained in a Web page.They do not consider that many websites span their listing on several Web Pages and show for each of these only a partial

view

. Similar to databases, where a view can represent a subset of the data contained in a table, they split a

logical list

in multiple views (

view lists

). Automatic extraction of

logical lists

is an open problem. To tackle this issue we propose an unsupervised and domain-independent algorithm for

logical list extraction

. Experimental results on real-life and data-intensive Web sites confirm the effectiveness of our approach.

Pasqua Fabiana Lanotte, Fabio Fumarola, Michelangelo Ceci, Andrea Scarpino, Michele Damiano Torelli, Donato Malerba

Combining Formal Logic and Machine Learning for Sentiment Analysis

This paper presents a formal logical method for deep structural analysis of the syntactical properties of texts using machine learning techniques for efficient syntactical tagging. To evaluate the method it is used for

entity level

sentiment analysis as an alternative to pure machine learning methods for sentiment analysis, which often work on sentence or word level, and are argued to have difficulties in capturing long distance dependencies.

Niklas Christoffer Petersen, Jørgen Villadsen

Clustering View-Segmented Documents via Tensor Modeling

We propose a clustering framework for

view-segmented documents

, i.e., relatively long documents made up of smaller fragments that can be provided according to a target set of views or aspects. The framework is designed to exploit a view-based document segmentation into a third-order tensor model, whose decomposition result would enable any standard document clustering algorithm to better reflect the multi-faceted nature of the documents. Experimental results on document collections featuring paragraph-based, metadata-based, or user-driven views have shown the significance of the proposed approach, highlighting performance improvement in the document clustering task.

Salvatore Romeo, Andrea Tagarelli, Dino Ienco

Searching XML Element Using Terms Propagation Method

In this paper, we describe terms propagation method dealing with focussed XML component retrieval. Focussed XML component retrieval is one of the most important challenge in the XML IR field. The aim of the focussed retrieval approach is to find the most exhaustive and specific element that focus on the user need. These needs can be expressed through content queries composed of simple keyword. Our method provides a natural representation of document, its elements and its content, and allows an automatic selection of a combination of elements that better answers the user’s query. In this paper we show the efficiency of the terms propagation method using a terms weighting formula that takes into account the size of the nodes and the size of the document. Our method has been evaluated on the «Focused» task of INEX 2006 and compared to XFIRM model which is based on relevance propagation method. Evaluations have shown a significant improvement in the retrieval process efficiency.

Samia Berchiche-Fellag, Mohamed Mezghiche

Special Session: Challenges in Text Mining and Semantic Information Retrieval

AI Platform for Building University Research Knowledge Base

This paper is devoted to the 3-years research performed at Warsaw University of Technology, aimed at building of an advanced software for university research knowledge base. As a result, a text mining platform has been built, enabling research in the areas of text mining and semantic information retrieval. In the paper some of the implemented methods are tested from the point of view of their applicability in a real life system.

Jakub Koperwas, Łukasz Skonieczny, Marek Kozłowski, Piotr Andruszkiewicz, Henryk Rybiński, Wacław Struk

A Seed Based Method for Dictionary Translation

The paper refers to the topic of automatic machine translation. The proposed method enables translating a dictionary by means of mining repositories in the source and target repository, without any directly given relationships connecting two languages. It consists of two stages: (1) translation by lexical similarity, where words are compared graphically, and (2) translation by semantic similarity, where contexts are compared. Polish and English version of Wikipedia were used as multilingual corpora. The method and its stages are thoroughly analyzed. The results allow implementing this method in human-in-the-middle systems.

Robert Krajewski, Henryk Rybiński, Marek Kozłowski

SAUText — A System for Analysis of Unstructured Textual Data

Nowadays semantic lexical resources, like ontologies, are becoming increasingly important in many systems, in particular those providing access to structured textual data. Typically such resources are built based on already existing repositories and by analyzing available texts. In practice, however, building new or enriching existing resources of such type cannot be accomplished without using an appropriate tool. In this paper we present SAUText – a new system which provides infrastructure for carrying out research involving usage of semantic resources and analyzing unstructured textual data. In the system we use dedicated repository for storing various kinds of text data and take advantage of parallelization in order to speed up the analysis.

Grzegorz Protaziuk, Jacek Lewandowski, Robert Bembenik

Evaluation of Path Based Methods for Conceptual Representation of the Text

Typical text clustering methods use the

bag of words

(BoW) representation to describe content of documents. However, this method is known to have several limitations. Employing Wikipedia as the lexical knowledge base has shown an improvement of the text representation for data-mining purposes. Promising extensions of that trend employ hierarchical organization of Wikipedia category system. In this paper we propose three path-based measures for calculating document relatedness in such conceptual space and compare them with the

Path Length

widely used approach. We perform their evaluation using the OPTICS clustering algorithm for categorization of keyword-based search results. The results have shown that our method outperforms the Path-Length approach.

Łukasz Kucharczyk, Julian Szymański

Special Session: Warehousing and OLAPing Complex, Spatial and Spatio-Temporal Data

Restructuring Dynamically Analytical Dashboards Based on Usage Profiles

Today, analytical dashboards assume a very important role in the daily life of any company. For some, they could be seeing as simple “cosmetic” software artefacts presenting analytical data in a more pleasant way. However, for others, they are very important analysis instruments, quite indispensable for current decision-making tasks. Decision-makers use to defend strongly their use. They are simple to interpret, easy to deal, and fast showing data. However, a regular dashboard is not capable to adapt itself to new user needs, having not the ability to personalize themselves dynamically during a regular OLAP session. In this paper, we present the structure, components and services of an analytical system that has the ability to restructure dynamically the organization and contents of its dashboards, following usage patterns established previously on specific users’ OLAP sessions.

Orlando Belo, Paulo Rodrigues, Rui Barros, Helena Correia

Enhancing Traditional Data Warehousing Architectures with Real-Time Capabilities

In this paper we explore the possibility of taking a data warehouse with a traditional architecture and making it real-time-capable. Real-time in warehousing concerns data freshness, the capacity to integrate data constantly, or at a desired rate, without requiring the warehouse to be taken offline. We discuss the approach and show experimental results that prove the validity of the solution.

Alfredo Cuzzocrea, Nickerson Ferreira, Pedro Furtado

Inference on Semantic Trajectory Data Warehouse Using an Ontological Approach

Using location aware devices is getting more and more spread, generating then a huge quantity of mobility data. The latter describes the movement of mobile objects and is called as well

Trajectory

data. In fact, these raw trajectories lack contextual information about the moving object goals and his activity during the travel. Therefore, the former must be enhanced with semantic information to be called then

Semantic Trajectory

. The semantic models proposed in the literature are in many cases ontology-based, and are composed of thematic, temporal and spatial ontologies and rules to support inference and reasoning tasks on data. Thus, calculating inference on moving objects trajectories considering all thematic, spatial, and temporal rules can be very long depending on the amount of data involved in this process. On the other side, TDW is an efficient tool for analyzing and extracting valuable information from raw mobility data. For that we propose throughout this work a TDW design, inspired from an ontology model. We will emphasis the trajectory to be seen as a first class semantic concept. Then we apply the inference on the proposed model to see if we can enhance it and make the complexity of this mechanism manageable.

Thouraya Sakouhi, Jalel Akaichi, Jamal Malki, Alain Bouju, Rouaa Wannous

Combining Stream Processing Engines and Big Data Storages for Data Analysis

We propose a system combining stream processing engines and big data storages for analyzing large amounts of data streams. It allows us to analyze data online and to store data for later offline analysis. An emphasis is laid on designing a system to facilitate simple implementations of data analysis algorithms.

Thomas Steinmaurer, Patrick Traxler, Michael Zwick, Reinhard Stumptner, Christian Lettner

ISMIS Posters

Representation and Evolution of User Profile in Information Retrieval Based on Bayesian Approach

In the web personalization how to represent user profile is one of the key issues. The user profile refers to his/her interests which change over time. This paper, presents a personalized search approach for representation and evolution of the user profile, based on dynamic bayesian network. The theoretical framework provided by these networks allows to infer and to evolve the user profile from his /her interactions with the search system. An experimental evaluation was designed to appraise the exploitation impact of the user profile defined by his/her interests on the search results relevance.

Farida Achemoukh, Rachid Ahmed-Ouamer

Creating Polygon Models for Spatial Clusters

This paper proposes a novel methodology for creating efficient polygon models for spatial datasets. A comprehensive analysis framework is proposed that takes a spatial cluster as an input and generates a polygon model for the cluster as an output. The framework creates a visually appealing, simple, and smooth polygon for the cluster by minimizing a fitness function. We propose a novel polygon fitness function for this task. Moreover, a novel emptiness measure is introduced for quantifying the presence of empty spaces inside polygons.

Fatih Akdag, Christoph F. Eick, Guoning Chen

Skeleton Clustering by Autonomous Mobile Robots for Subtle Fall Risk Discovery

In this paper, we propose two new instability features, a data pre-processing method, and a new evaluation method for skeleton clustering by autonomous mobile robots for subtle fall risk discovery. We had proposed an autonomous mobile robot which clusters skeletons of a monitored person for distinct fall risk discovery and achieved promising results. A more natural setting posed us problems such as ambiguities in class labels and low discrimination power of our original instability features between safe/unsafe skeletons. We validate our three new proposals through evaluation by experiments.

Yutaka Deguchi, Einoshin Suzuki

Sonar Method of Distinguishing Objects Based on Reflected Signal Specifics

This paper presents a method of pattern recognition based on sonar signal specificity. Environment data is collected by a Lego Mindstorms NXT mobile robot using a static sonar sensor. The primary stage of research includes offline data processing. As a result, a set of object features enabling effective pattern recognition was established. The most essential features, reflected into object parameters are described. The set of objects consists of two types of solids: parallelepipeds and cylinders. The main objective is to set clear and simple rules of distinguishing the objects and implement them in a real-time system: NXT robot. The tests proved the offline calculations and assumptions. The object recognition system presents an average accuracy of 86%. The experimental results are presented. Further work aims to implement in mobile robot localization: building a relative confidence degree map to define vehicle location.

Teodora Dimitrova-Grekow, Marcin Jarczewski

Endowing Semantic Query Languages with Advanced Relaxation Capabilities

Most of studies on relaxing Semantic Web Database (

$\mathcal{S}\mathcal{W}\mathcal{D}\mathcal{B}$

) queries focus on developing new relaxation techniques or on optimizing the top-k query processing. However, only few works have been conducted to provide a fine and declarative control of query relaxation using an

$\mathcal{S}\mathcal{W}\mathcal{D}\mathcal{B}$

query language. In this paper we first define a set of requirements for an

$\mathcal{S}\mathcal{W}\mathcal{D}\mathcal{B}$

cooperative query language(

$\mathcal{C}\mathcal{Q}\mathcal{L}$

). Then, based on these requirements, we propose an extension of query language with a new clause to use and combine the relaxation operators we introduce. A similarity function is associated with these operators to rank the alternative answers retrieved.

Géraud Fokou, Stéphane Jean, Allel Hadjali

A Business Intelligence Solution for Monitoring Efficiency of Photovoltaic Power Plants

Photovoltaics (PV) is the field of technology and research related to the application of solar cells, in order to convert sunlight directly into electricity. In the last decade, PV plants have become ubiquitous in several countries of the European Union (EU). This paves the way for marketing new smart systems, designed to monitor the energy production of a PV park grid and supply intelligent services for customer and production applications. In this paper, we describe a new business intelligence system developed to monitor the efficiency of the energy production of a PV park. The system includes services for data collection, summarization (based on trend cluster discovery), synthetic data generation, supervisory monitoring, report building and visualization.

Fabio Fumarola, Annalisa Appice, Donato Malerba

WBPL: An Open-Source Library for Predicting Web Surfing Behaviors

We present WBPL (Web users Behavior Prediction Library), a cross-platform open-source library for predicting the behavior of web users. WBPL allows training prediction models from server logs. The proposed library offers support for three of the most used webservers (Apache, Nginx and Lighttpd). Models can then be used to predict the next resources fetched by users and can be updated with new logs efficiently. WBPL offers multiple state-of-the-art prediction models such as PPM, All-K-Order-Markov and DG and a novel prediction model CPT (Compact Prediction Tree). Experiments on various web click-stream datasets shows that the library can be used to predict web surfing or buying behaviors with a very high overall accuracy (up to 38 %) and is very efficient (up to 1000 predictions /s).

Ted Gueniche, Philippe Fournier-Viger, Roger Nkambou, Vincent S. Tseng

Data-Quality-Aware Skyline Queries

This paper deals with skyline queries in the context of “dirty databases”, i.e., databases that may contain bad quality or suspect data. We assume that each tuple or attribute value of a given dataset is associated with a quality level and we define several extensions of skyline queries that make it possible to take data quality into account when checking whether a tuple is dominated by another. This leads to the computation of different types of gradual (fuzzy) skylines.

Hélène Jaudoin, Olivier Pivert, Grégory Smits, Virginie Thion

Neuroscience Rough Set Approach for Credit Analysis of Branchless Banking

This paper focuses on mobile banking; very often referred to as “branchless banking” which presents a platform wherein rough set theory algorithms can enhance autonomous machine learning to analyze credit for a purely mobile banking platform. First, the terms “mobile banking” and “ branchless banking” are defined. Next, it reviews the huge impact branchless banking with credit analysis will have on the world and the traditional banking models as it becomes a reality in Africa. Credit Analysis techniques of current branchless banks such as Wonga are then explained and an improvement on their techniques is presented. Finally, experiments taken implementing the author’s neuroscience algorithms and applied with rough SVMs, Variable Precision Rough Set Models and Variable Consistency Dominance-based Rough Set Approach models are performed on financial data sets and their results are presented.

Rory Lewis

Collective Inference for Handling Autocorrelation in Network Regression

In predictive data mining tasks, we should account for autocorrelations of both the independent variables and the dependent variable, which we can observe in neighborhood of a target node and that same node. The prediction on a target node should be based on the value of the neighbours which might even be unavailable. To address this problem, the values of the neighbours should be inferred collectively. We present a novel computational solution to perform collective inferences in a network regression task. We define an iterative algorithm, in order to make regression inferences about predictions of multiple nodes simultaneously and feed back the more reliable predictions made by the previous models in the labeled network. Experiments investigate the effectiveness of the proposed algorithm in spatial networks.

Corrado Loglisci, Annalisa Appice, Donato Malerba

On Predicting a Call Center’s Workload: A Discretization-Based Approach

Agent scheduling in call centers is a major management problem as the optimal ratio between service quality and costs is hardly achieved. In the literature, regression and time series analysis methods have been used to address this problem by predicting the future arrival counts. In this paper, we propose to discretize these target variables into finite intervals. By reducing its domain length, the goal is to accurately mine the demand peaks as these are the main cause for abandoned calls. This was done by employing multi-class classification. This approach was tested on a real-world dataset acquired through a taxi dispatching call center. The results demonstrate that this framework can accurately reduce the number of abandoned calls, while maintaining a reasonable staff-based cost.

Luis Moreira-Matias, Rafael Nunes, Michel Ferreira, João Mendes-Moreira, João Gama

Improved Approximation Guarantee for Max Sum Diversification with Parameterised Triangle Inequality

We present improved 2/

approximation guarantee for the problem of selecting

diverse

set of

items when its formulation is based on Max Sum Facility Dispersion problem and the underlying dissimilarity measure satisfies

parameterised triangle inequality

with parameter

Diversity-aware approach is gaining interest in many important applications such as web search, recommendation, database querying or summarisation, especially in the context of ambiguous user query or unknown user profile.

In addition, we make some observations on the applicability of these results in practical computations on real data and link to important recent applications in the

result diversification problem

in web search and semantic graph summarisation. The results apply to both relaxed and strengthen variants of the triangle inequality.

Marcin Sydow

Learning Diagnostic Diagrams in Transport-Based Data-Collection Systems

Insights about service improvement in a transit network can be gained by studying transit service reliability. In this paper, a general procedure for constructing a transit service reliability diagnostic (

Tsrd

) diagram based on a Bayesian network is proposed to automatically build a behavioural model from Automatic Vehicle Location (AVL) and Automatic Passenger Counters (APC) data. Our purpose is to discover the variability of transit service attributes and their effects on traveller behaviour. A

Tsrd

diagram describes and helps to analyse factors affecting public transport by combining domain knowledge with statistical data.

Vu The Tran, Peter Eklund, Chris Cook

Backmatter

Titel: Foundations of Intelligent Systems
herausgegeben von: Troels Andreasen
Henning Christiansen
Juan-Carlos Cubero
Zbigniew W. Raś
Verlag: Springer International Publishing
Electronic ISBN: 978-3-319-08326-1
Print ISBN: 978-3-319-08325-4
DOI: https://doi.org/10.1007/978-3-319-08326-1