nach oben

2005 | Buch

Kapitel lesen Erstes Kapitel lesen

Intelligent Information Processing and Web Mining

Proceedings of the International IIS: IIPWM’ 05 Conference held in Gdansk, Poland, June 13–16, 2005

herausgegeben von: Prof. Dr. Mieczysław A. Kłopotek, Prof. Dr. Sławomir T. Wierzchoń, Dr. Krzysztof Trojanowski

Verlag: Springer Berlin Heidelberg

Buchreihe : Advances in Intelligent and Soft Computing

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

The international conference Intelligent Information Processing and Web Mining IIS:IIPWM’05, organized in Gda?sk-Sobieszewo on 13–16th June, 2005, was a continuation of a long tradition of conferences on applications of Arti?cial Intelligence (AI) in Information Systems (IS), organized by the Institute of Computer Science of Polish Academy of Sciences in cooperation with other scienti?c and business institutions. The Institute itself is deeply engaged in research both in AI and IS and many scientists view it as a leading institution both in fundamental and - plied research in these areas in Poland. The originators of this conference series, Prof. M. D?browski and Dr. M. Michalewicz had in 1992 a long-term goal of bringing together scientists and industry of di?erent braches from Poland and abroad to achieve a creative synthesis. One can say that their dream has come to reality. Scientists from ?ve continents made their subm- sions to this conference. A brief look at the a?liations makes international cooperation visible. The research papers have either a motivation in c- crete applications or are o?-springs of some practical requests. This volume presents the best papers carefully chosen from a large set of submissions (about 45%). At this point we would like to express our thanks to the m- bers of Programme Committee for their excellent job. Also we are thankful to the organizers of the special sessions accompanying this conference: Jan Komorowski, Adam Przepiórkowski, Zbigniew W.

Inhaltsverzeichnis

Frontmatter

Regular Sessions: Knowledge Discovery and Exploration

Frontmatter

Feature Extraction by the SCoTLASS: An Illustrative Example

Derivation of new features of observed variables has two important goals: reduction of dimensionality and de-noising. A desired property of the derived new features is their meaningful interpretation. The SCoTLASS method (Jolliffe, Trendafilov and Uddin, 2003) offers such possibility.

We explore the properties of the SCoTLASS method applied to the yeast genes data investigated in (Bartkowiak et al., 2003, 2004). All the derived features have really a simple meaningful structure: each new feature is spanned by two original variables belonging to the same block.

Anna Bartkowiak, Nickolay T. Trendafilov

Rule Induction for Click-Stream Analysis: Set Covering and Compositional Approach

We present a set covering algorithm and a compositional algorithm to describe sequences of www pages visits in click-stream data. The set covering algorithm utilizes the approach of rule specialization like the well known CN2 algorithm, the compositional algorithm is based on our original KEX algorithm, however both algorithms deal with sequences of events (visited pages) instead of sets of attributevalue pairs. The learned rules can be used to predict next page to be viewed by a user or to describe the most typical paths of www pages visitors and the dependencies among the www pages. We have successfully used both algorithms on real data from an internet shop and we mined useful information from the data.

Petr Berka, Vladimír Laš, Tomáš Kočka

Family of Instance Reduction Algorithms Versus Other Approaches

The goal of the paper is to compare the performance of instance reduction algorithms (IRA) with other approaches. The paper briefly describes a family of instance reduction algorithms proposed by the authors. To evaluate their performance the computational experiment is carried out. The experiment involves comparing a performance of IRA with several other approaches using alternative machine learning classification tools.

Ireneusz Czarnowski, Piotr Jędrzejowicz

Minimum Spanning Trees Displaying Semantic Similarity

Similarity of semantic content of web pages is displayed using interactive graphs presenting fragments of minimum spanning trees. Homepages of people are analyzed, parsed into XML documents and visualized using TouchGraph LinkBrowser, displaying clusters of people that share common interest. The structure of these graphs is strongly affected by selection of information used to calculate similarity. Influence of simple selection and Latent Semantic Analysis (LSA) on structures of such graphs is analyzed. Homepages and lists of publications are converted to a word frequency vector, filtered, weighted and similarity matrix between normalized vectors is used to create separate minimum sub-trees showing clustering of people’s interest. Results show that in this application simple selection of important keywords is as good as LSA but with much lower algorithmic complexity.

Włodzisław Duch, Paweł Matykiewicz

Concept Description Vectors and the 20 Question Game

Knowledge of properties that are applicable to a given object is a necessary prerequisite to formulate intelligent question. Concept description vectors provide simplest representation of this knowledge, storing for each object information about the values of its properties. Experiments with automatic creation of concept description vectors from various sources, including ontologies, dictionaries, encyclopedias and unstructured text sources, are described. Information collected in this way is used to formulate questions that have high discriminating power in the twenty questions game.

Włodzisław Duch, Julian Szymański, Tomasz Sarnatowicz

Automatic Scripts Retrieval and Its Possibilities for Social Sciences Support Applications

This paper introduces our method for automatic Schankian-like scripts retrieval from the Internet resources and its preliminary results which might be interesting for Social Sciences researchers. We describe the first module of our system, which is supposed to automatically retrieve commonsensical knowledge from the Web resources by using web-mining techniques. It retrieves minimal “object — action — action” scripts which show humans’ common activities changing due the origin of a webpage author. Such data can be used in fields of economics, psycholinguistics, sociolinguistics, psychology, sociology or in language education. By this paper we would like to make NLP researchers notice the potential of commonsense retrieval and encourage them to consider creating such tools for their languages.

Yali Ge, Rafał Rzepka, Kenji Araki

The Analysis of the Unlabeled Samples of the Iron Age Glass Data

The late iron age glass database consists of a significant proportion of the samples, classification of which is unknown. The data-mining methods such as the rule induction, the clusterization and the visualization are used in this paper to classify these samples to the one of the three main chronological periods (La Tene C1, La Tene C2, La Tene D1) of the glass artifacts. The results of the experiments performed with the C4.5 and the Ridor algorithms followed by the analysis conducted by domain experts indicate, that the unlabeled samples constitute a mixture of all classes in which LT C2 and LT D1 are in majority.

Karol Grudziński, Maciej Karwowski

Discriminant versus Strong Rule Sets

The main objective of our research was to compare two completely different approaches to rule induction. In the first approach, represented by the LEM2 rule induction algorithm, induced rules are discriminant, i.e., every concept is completely described and rules are consistent. In the second approach, represented by the IRIM rule induction algorithm, a few strong and simple rules are induced. These rules do not necessarily completely describe concepts and, in general, are inconsistent. Though LEM2 frequently outperforms IRIM, the difference in performance is, statistically, insignificant. Thus IRIM, inducing a few strong but simple rules is a new and interesting addition to the LERS data mining system.

Jerzy W. Grzymala-Busse, Witold J. Grzymala-Busse, Jay Hamilton IV

IDARM — Mining of Indirect Association Rules

Typical association rules, called in the paper “direct”, reflect relationships existing between items that relatively often co-occur in common transactions. In the web domain items correspond to pages and transactions to user sessions. The main idea of new approach is to discover indirect associations existing between pages that rarely occur together but there are other, “third” pages, called transitive, with which they appear relatively frequently. Two types of indirect associations rules are described in the paper: partial indirect associations and complete ones. The former respect a single transitive page, while the latter cover all existing transitive pages. The presented IDARM algorithm extracts complete indirect association rules with their important measure — confidence, using pre-calculated direct rules.

Przemysław Kazienko

Building a Concept Hierarchy from a Distance Matrix

Concept hierarchies are important in many generalized data mining applications, such as multiple level association rule mining. In literature, concept hierarchy is usually given by domain experts. In this paper, we propose algorithms to automatically build a concept hierarchy from a provided distance matrix. Our approach is modifying the traditional hierarchical clustering algorithms. For the purpose of algorithm evaluation, a distance matrix is derived from the concept hierarchy built by our algorithm. Root mean squared error between the provided distant matrix and the derived distance matrix is used as evaluation criterion. We compare the traditional hierarchical clustering and our modified algorithm under three strategies of computing cluster distance, namely single link, average link, and complete link. Empirical results show that the traditional algorithm under complete link strategy performs better than the other strategies. Our modified algorithms perform almost the same under the three strategies; and our algorithms perform better than the traditional algorithms under various situations.

Huang-Cheng Kuo, Jen-Peng Huang

Literal Trees and Resolution Technique

The problem of mechanized deduction requires carrying out research and comparison of different methods for inference search in first-order classical logic, such as resolution-type methods, the model elimination method, the SLD-resolution, and so on. In this connection, it is desired to give a way for investigating their common and distinct features in order to use them better in theory and practice. This paper is devoted to such an investigation. Interconnection between a complete extension of the SLD-type resolution having the form of literal trees calculus and a ertain resolution technique is established. The interconnection permits to obtain some results on soundness and completeness for different resolution-type methods in the case of the weakest requirements to the factorization. In addition, when classical logic with equality is considered, it gives a possibility to make an original way for complete incorporation of the paramodulation into resolution with weak factorization as well as into the model elimination method.

Alexander Lyaletski, Alexander Letichevsky, Oleksandr Kalinovskyy

Rough Classification Used for Learning Scenario Determination in Intelligent Learning System

Learning scenario determination is one of the key tasks of every Intelligent Learning Systems (ILS). This paper presents a method for learner classification in ILS based on rough classification methods proposed by Pawlak. The goal of rough learner classification is based on the selection of such a minimal set of learner profile attributes and their values that can be used for determination of optimal learning scenario. For this aim the problems of rough classification are defined and their solutions are presented.

Ngoc Thanh Nguyen, Janusz Sobecki

Rough Ethograms: Study of Intelligent System Behavior

This article introduces a new form of ethogram that provides a basis for studying reinforcement learning in biologically inspired collective robotics systems. In general, an ethogram is a record of behavior patterns, which has grown out of ethology (ways to explain agent behavior). The rough set approach introduced by Zdzisław Pawlak in 1982 provides a ground for deriving pattern-based rewards in the context of an approximation space. The framework provided by an approximation space makes it possible to derive pattern-based reference rewards used to compute action rewards as well as action preferences. A brief description of a prototype of an ecosystem testbed used to record ethograms in a dynamically changing system of agents is presented. The contribution of this article is an introduction to an ethological approach to the study of action preferences and action rewards during reinforcement learning in intelligent systems considered in the context of approximation spaces.

James F. Peters, Christopher Henry, Sheela Ramanna

Automatic Knowledge Retrieval from the Web

This paper presents the method of automatic knowledge retrieval from the web. The aim of the system that implements it, is to automatically create entries to a knowledge database, similar to the ones that are being provided by the volunteer contributors. As only a small fraction of the statements accessible on the web can be treated as valid knowledge concepts we considered the method for their filtering and verification, based on the similarity measurements with the concepts found in the manually created knowledge database. The results demonstrate that the system can retrieve valid knowledge concepts both for topics that are described in the manually created database, as well as the ones that are not covered there.

Marcin Skowron, Kenji Araki

Knowledge Visualization Using Optimized General Logic Diagrams

Knowledge Visualizer (KV) uses a General Logic Diagram (GLD) to display examples and/or various forms of knowledge learned from them in a planar model of a multi-dimensional discrete space. Knowledge can be in different forms, for example, decision rules, decision trees, logical expressions, clusters, classifiers, and neural nets with discrete input variables. KV is implemented as a module of the inductive database system VINLEN, which integrates a conventional database system with a range of inductive inference and data mining capabilities. This paper describes briefly the KV module and then focuses on the problem of arranging attributes that span the diagram in a way that leads to the most readable rule visualization in the diagram. This problem has been solved by applying a simulated annealing.

Bartłomiej Śnieżyński, Robert Szymacha, Ryszard S. Michalski

Efficient Processing of Frequent Itemset Queries Using a Collection of Materialized Views

One of the classic data mining problems is discovery of frequent item-sets. Frequent itemset discovery tasks can be regarded as advanced database queries specifying the source dataset, the minimum support threshold, and optional constraints on itemsets. We consider a data mining system which supports storing of results of previous queries in the form of materialized data mining views. Previous work on materialized data mining views addressed the issue of reusing results of one of the previous frequent itemset queries to efficiently answer the new query. In this paper we present a new approach to frequent itemset query processing in which a collection of materialized views can be used for that purpose.

Marek Wojciechowski, Maciej Zakrzewicz

Regular Sessions: Computational Linguistics

Frontmatter

GramCat and GramEsp: two grammars for chunking

In this article we present two grammars (GramCat and GramEsp) for chunking of unrestricted Catalan and Spanish texts. With these grammars we extend the classical notion of

chunk

as it is defined by Abney, taking advantage of Catalan and Spanish morphosyntactic features: Catalan and Spanish rich inflectional morphology and the high frequency of some prepositional patterns allow us to include both pre- and post-nominal modifiers in the noun phrase.

Montserrat Civit, M. Antònia Martí

Dynamic Perfect Hashing with Finite-State Automata

Minimal perfect hashing provides a mapping between a set of

unique words and

consecutive numbers. When implemented with minimal finite-state automata, the mapping is determined only by the (usually alphabetical) order of words in the set. Addition of new words would change the order of words already in the language of the automaton, changing the whole mapping, and making it useless in many domains. Therefore, we call it static. Dynamic minimal perfect hashing assigns consecutive numbers to consecutive words as they are added to the language of the automaton. Dynamic perfect hashing is important in many domains, including text retrieval and databases. We investigate three methods for its implementation.

Jan Daciuk, Denis Maurel, Agata Savary

Dictionary-Based Part-of-Speech Tagging of Polish

A set of 70 lexical symbols is defined which covers written Polish. Under the assumption of the second order Markov process, a dictionary-based method of tagging parts of speech is presented. Instead of having to be trained on earlier tagged text, parameters of the method are estimated on frequencies of unambiguously classifiable unigrams, bigrams and trigrams of lexical symbols found in untagged literary text. The method is being proved to tag correctly 88% of all words in text.

Stanisław Galus

Enhancing a Portuguese Text Classifier Using Part-of-Speech Tags

Support Vector Machines have been applied to text classification with great success. In this paper, we apply and evaluate the impact of using part-of-speech tags (nouns, proper nouns, adjectives and verbs) as a feature selection procedure in a European Portuguese written dataset — the Portuguese Attorney General’s Office documents.

From the results, we can conclude that verbs alone don’t have enough information to produce good learners. On the other hand, we obtain learners with equivalent performance and a reduced number of features (at least half) if we use specific part-of-speech tags instead of all words.

Teresa Gonçalves, Paulo Quaresma

A Neural Network Based Morphological Analyser of the Natural Language

The paper proposes a morphological analyser supported by a neural network to inflect words written in Polish. The approach can be also applied to other languages. The main task of the analyser is to create base forms from the analysed words’ forms. Other objective is to provide grammatical information for the analysed form. Computational experiment results confirm that both objectives are fulfilled by the proposed neural network based morphological analyser. The common words are inflected with a very high quality of nearly 99.9%. Other words like geographical names and people’s names, thanks to the incorporation of neural network, inflect with a quality reaching 93.3%.

Piotr Jędrzejowicz, Jakub Strychowski

An Oracle Grammar-Rule Learning Based on Syntactic Knowledge

In this paper, we put forward two algorithms for Chinese oracle-bone grammar rules learning: automatically learning rule sets and the error-emended learning from the corpus. Through analysing the syntactic knowledge of oracle-bone inscription, an error-emended learning method is used to construct rule-base of oracle-bone phrase structure, which combines linguist’s introspection and summarize with corpora to capture rules and describe them with formalized symbols. By using part-of-speech and contextual information together, our experimental results show the learning and expression are effective for oracle-bone grammar rule-base.

Minghu Jiang, Huiying Cai, Dafan Liu, Junzhen Wang

Speech Interface for Internet Service “Yellow Pages”

The paper describes new automatic service intended for using telephone directory inquiries without human-operator. We propose to use the automatic speech recognition system for answering the queries of the users. The information about firms and organizations can be taken from the electronic catalogue “Yellow Pages of Saint-Petersburg” located in the Internet and regularly updated with new information. During development such system for Russian language there are sufficient problems, connected with complex structure of word-formation of Russian language. The developed speech recognition system SIRIUS has one additional level of Russian language representation - morphemic level. As a result the size of vocabulary and time for speech processing are significantly decreased. Also the dialogue model for voice control of electronic catalogue “Yellow Pages” is presented in the paper. The first experimental results allow to say about good perspectives of development of telephone speech recognizer for large vocabulary for Russian.

Alexey Karpov, Andrey Ronzhin

Automatic Information Classifier Using Rhetorical Structure Theory

Information classification is aimed to secure the documents from being disclosed. The information is classified according to their critical semantic. The decision of classifying a portion of the document as a ‘secret’ depends on the effect of its disclose in the organization the document written for. However, understanding the semantic of the document is not an easy task. The rhetorical structure theory (RST) is one of the leading theories aimed for this reason. In this paper, we will explain a technique to classify the information using RST.

Hassan Mathkour, Ameur Touir, Waleed Al-Sanie

Rule-Based Medical Content Extraction and Classification

We present the final version of the system for automatic content extraction from Polish medical data. The system combines general IE techniques with an external post-processing. The obtained data is normalized and linked to a simplified ontology. Then, it is automatically grouped to form more complex structures representing medical reports.

Agnieszka Mykowiecka, Anna Kupść, Małgorzata Marciniak

A Rule-Based Tagger for Polish Based on Genetic Algorithm

In the paper an approach to the construction of rule-based morphosyntactic tagger for Polish is proposed. The core of the tagger are modules of rules (classification systems), acquired from the IPI PAN corpus by application of Genetic Algorithms. Each module is specialised in making decisions concerning different parts of a tag (a structure of attributes). The acquired rules are combined with linguistic rules made by hand and memory-based rules acquired also from the corpus. The construction of the tagger and experiments concerning its properties are also presented in the paper.

Maciej Piasecki, Bartłomiej Gaweł

Regular Sessions: Search Engines

Frontmatter

On Some Clustering Algorithms for Document Maps Creation

In this research paper we pinpoint at the need of redesigning of the WebSOM document map creation algorithm. We insist that the SOM clustering should be preceded by identifying major topics of the document collection. Furthermore, the SOM clustering should be preceded by a pre-clustering process resulting in creation of groups of documents with stronger relationships; the groups, not the documents, should be subject of SOM clustering. We propose appropriate algorithms and report on achieved improvements.

Krzysztof Ciesielski, Michał Dramiński, Mieczysław A. Kłopotek, Mariusz Kujawiak, Sławomir T. Wierzchoń

Filtering Web Documents for a Thematic Warehouse Case Study: eDot a Food Risk Data Warehouse (extended)

Ordinary sources, like databases and general-pupose document collections, seems to be insufficient and inadequate to scale the needs and the requirements of the new generation of warehouses: thematic data warehouses. Knowing that more and more online thematic data is available, the web can be considered as a useful data source for populating thematic data warehouses. To do so, the warehouse data supplier must be able to filter the heterogeneous web content to keep only the documents corresponding to the warehouse topic. Therefore, building efficient automatic tools to characterize web documents dealing with a given thematic is essential to challenge the warehouse data acquisition issue. In this paper, we present our filtering approach implemented in an automatic tool called “

eDot-Filter”

. This tool is used to filter crawled documents to keep only the documents dealing with food risk. These documents are then stored in a thematic warehouse called “

eDot

”. Our filtering approach is based on “

WeQueL

”, a declarative web query langage that improves the expressive power of keyword-based queries.

Amar-Djalil Mezaour

Data Merging in Life Science Data Integration Systems

An index-driven integration system provides access to a multitude of data sources: it uses pre-compiled indexes covering content of these sources. Such a scenario is especially attractive in life science applications which integrate data from hundreds of very valuable carefully maintained databases. A key bottleneck in building such systems is data merging where partial answers obtained from different data sources are to be merged and the problem of overlapping data should be solved. In response to a query the

most informative

redundancy-free answer should be constructed. In the paper we propose a formal foundation for merging XML-like data and discuss indexing support for data merging.

Tadeusz Pankowski, Ela Hunt

Approximation Quality of the RBS Ranking Algorithm

The RBS algorithm is a novel link-based algorithm for ranking results of a search engine. RBS may be viewed as an extension of PageRank by a parameterized “back button” modeling. RBS is based on the “random surfer with back step” model [7] similarly as PageRank is based on the simpler “random surfer” model [4]. To scale to real Web RBS computes merely a fast probabilistic approximation of the ranking induced by the “random surfer with back step” model [6].

In this paper we experimentally measure the quality of this approximation using a high quality synthetic Web evolution model [5] of our own implementation.

The results demonstrate that RBS is a very good approximation to the “ideal” ranking. Furthermore, as the experiment shows, RBS clearly outperforms PageRank in “back step” modeling even if we try to parameterize the latter.

Marcin Sydow

Regular Sessions: Biologically Motivated Algorithms and Systems

Frontmatter

Effects of Versatile Crossover and Mutation Operators on Evolutionary Search in Partition and Permutation Problems

The paper investigates the influence of versatile crossover and mutation operators on the efficiency of evolutionary search in solving two important classes of hard optimization problems. Chromosome representations of set partitions and permutations defined in the paper are not problem-oriented and are described together with their versatile variation operators. The proposed representations are tested in evolutionary programs on standard partitions and permutation problems i.e. graph coloring (GCP) and traveling salesman (TSP). The optimization results vary depending on the problem class. They are relatively positive with respect to GCP and negative for TSP.

Zbigniew Kokosiński

Global Induction of Oblique Decision Trees: An Evolutionary Approach

A new evolutionary algorithm for induction of oblique decision trees is proposed. In contrast to the classical top-down approach, it searches for the whole tree at the moment. Specialized genetic operators are developed, which enable modifying both the tree structure and the splitting hyper-planes in non-terminal nodes. The problem of over-fitting can be avoided thanks to suitably defined fitness function. Experimental results on both synthetical and real-life data are presented and compared with obtained by the state-of-the-art decision tree systems.

Marek Krętowski, Marek Grześ

Nature-Inspired Algorithms for the TSP

Three nature-inspired algorithms are applied to solve Travelling Salesman Problem (TSP). The first originally developed Multi-agent Evolutionary Algorithm (MAEA) is based on multi-agent interpretation of TSP problem. An agent is assigned to a single city and builds locally its neighbourhood — a subset of cities, which are considered as local candidates to a global solution of TSP. Creating cycles — global solutions of TSP is based on Ant Colonies (AC) paradigm. Found cycles are placed in Global Table and are evaluated by genetic algorithm (GA) to modify a rank of cities in local neighbourhood. MAEA is compared with two another algorithms: artificial immune — based system (AIS) and a standard AC — both applied to TSP. We present experimental results showing that MAEA outperforms both AIS and AC algorithms.

Jarosław Skaruz, Franciszek Seredyński, Michał Gamus

Graph-Based Analysis of Evolutionary Algorithm

Evolutionary algorithms work in an algorithmically simple manner but produce a huge amount of data. The extraction of useful information to gain further insight into the state of algorithm is a not-trivial task. In the paper, we propose a method of analysis of evolutionary algorithm by means of a graph theory. The method is inspired by latest results on scale-free network and small world phenomena. The paper presents visualization of evolutionary process based on network visualization software. The properties of such network are analyzed and various research possibilities are discussed.

Zbigniew Walczak

Regular Sessions: Statistical and Database Methods in AI

Frontmatter

Probability of Misclassification in Bayesian Hierarchical Classifier

The paper deals with the probability of misclassification in a multistage classifier. This classification problem is based on a decision-tree scheme. For given tree skeleton and features to be used, the Bayes decision rules at each non-terminal node are presented. Additionally the information on objects features is fuzzy or nonfuzzy. The upper bound of the difference between probability of misclassification for the both information’s is presented. In the paper we use the maximum likelihood estimator for fuzzy data.

Robert Burduk

A study on Monte Carlo Gene Screening

In the paper, three conceptually simple but computer-intensive versions of an approach to selecting informative genes for classification are proposed. All of them rely on multiple construction of a tree classifier for many training sets randomly chosen from the original sample set, where samples in each training set consist of only a fraction of all of the genes. It is argued that the resulting ranking of genes can then be used to advantage for classification via a classifier of any type.

Michał Dramiński, Jacek Koronacki, Jan Komorowski

Spatial Telemetric Data Warehouse Balancing Algorithm in Oracle9i/Java Environment

Balancing of parallel systems workload is very essential to ensure minimal response time of tasks submitted to process. Complexity of data warehouse systems is very high with respect to system structure, data model and many mechanisms used, which have a strong influence on overall performance. In this paper we present a dynamic algorithm of spatial telemetric data warehouse workload balancing. We implement HCAM data partitioning scheme which use Hilbert curves to space ordering. The scheme was modified in a way that makes possible setting of dataset size stored in each system node. Presented algorithm iteratively calculates optimal size of partitions, which are loaded into each node, by executing series of aggregation on a test data set. We investigate both situation in which data are and are not fragmented. Moreover we test a set of fragment sizes. Performed system tests conformed possibility of spatial telemetric data warehouse balancing algorithm realization by selection of dataset size stored in each node. Project was implemented in Java programming language with using of a set of available technology.

Marcin Gorawski, Robert Chechelski

Comparison of Efficiency of Some Updating Schemes on Bayesian Networks

The problem of efficiency of general reasoning with knowledge updating on Bayesian networks is considered. The optimization should take into account not only the reasoning efficiency but also the prospective updating issues. An improved updating process based on an idea of data removing is proposed. Further possible improvement of the reasoning architecture is presented. Comparison of existing and proposed approaches are made on a basis of a computational experiment. Results of this experiment are presented.

Tomasz Łukaszewski

Analysis of the Statistical Characteristics in Mining of Frequent Sequences

The paper deals with the search and analysis of the subsequences in large volume sequences (texts, DNA sequences, etc.). A new algorithm ProMFS for mining frequent sequences is proposed and investigated. It is based on the estimated probabilistic-statistical characteristics of the appearance of elements of the sequence and their order. The algorithm builds a new much shorter sequence and makes decisions on the main sequence in accordance with the results of analysis of the shorter one.

Romanas Tumasonis, Gintautas Dzemyda

PCA and ICA Methods for Prediction Results Enhancement

In this paper we show that applying of multidimensional decompositions can improve the modelling results. The predictions usually consist of twofold elements, wanted and destructive ones. Rejecting of the destructive components should improve the model. The statistical methods like PCA and ICA with new modifications are employed. The example from the telecom market proofs correctness of the approach.

Ryszard Szupiluk, Piotr Wojewnik, Tomasz Zőbkowski

Creating Reliable Database for Experiments on Extracting Emotions from Music

Emotions can be expressed in various ways. Music is one of possible media to express emotions. However, perception of music depends on many aspects and is very subjective. This paper focuses on collecting and labelling data for further experiments on discovering emotions in music audio files. The database of more than 300 songs was created and the data were labelled with adjectives. The whole collection represents 13 more detailed or 6 more general classes, covering diverse moods, feelings and emotions expressed in the gathered music pieces.

Alicja Wieczorkowska, Piotr Synak, Rory Lewis, Zbigniew Ras

Poster Session

Frontmatter

You Must Be Cautious While Looking For Patterns With Multiresponse Questionnaire Data

The problem of testing statistical hypotheses of independence of two multiresponse variables is considered. This is a specific inferential environment to analyze certain patterns particularly for the questionnaire data. Data analyst normally looks for certain combination of responses being more frequently chosen than the other ones. As a result of experimental study we formulate some practical advices and suggest areas of further research.

Guillermo Ch Bali, Dariusz Czerski, Mieczysław Kłopotek, Andrzej Matuszewski

Immunological Selection Mechanism in Agent-Based Evolutionary Computation

Artificial immune systems turned out to be interesting technique introduced into the area of

soft-computing

. In the paper the idea of an immunological selection mechanism in the agent-based evolutionary computation is presented. General considerations are illustrated by the particular system dedicated to function optimization. Selected experimental results conclude the work.

Aleksander Byrski, Marek Kisiel-Dorohinicki

Unfavorable Behavior Detection in Real World Systems Using the Multiagent System

Nowadays detecting destructive attacks and dangerous activities is crucial problem in many real world security systems. A security system should enable to distinguish some actors which effects of behavior are perhaps unfavorable for a considered area. The considered areas are real world systems e.g. airports, shops or city centers. Project of real world security system should assume the changing and unpredictable type of dangerous activities in real world systems. Security system has to detect and react to new kind of dangers that have never been encountered before. In this article there are presented methods derived from some ethically-social and immunological mechanisms that should enable automated intrusion detection.

Krzysztof Cetnarowicz, Renata Cięciwa, Edward Nawarecki, Gabriel Rojek

Intelligent Navigation in Documents Sets Comparative Study

Today’s text searching mechanisms are based on keyword search. However a large number of results makes the searches less effective. This paper presents results when search of relevant documents is obtained based on currently browsed document instead of keywords. Five keywords based search methods were tested. Comparative study was done on following methods: LSA, PLSA, WebSOM, PHITS, Query Dependant PageRank.

Maciej Czyżowicz

Assessing Information Heard on the Radio

Much work in AI is devoted to the problem of how computers can acquire beliefs about the world through perception, but little effort is devoted to the problem of how computers can acquire beliefs through testimony. This paper is part of a continuing project whose ultimate goal is that of constructing an implementable model of how agents acquire knowledge through testimony. In particular, it looks at how agents acquire information from the radio and many factors are identified that may cause an agent to override the defeasible rule to believe what he hears.

Antoni Diller

Belief Rules vs. Decision Rules: A Preliminary Appraisal of the Problem

An in-house developed computer program system

Belief

SEEKER

, capable to generate belief networks and to convert them into respective sets of belief rules, was applied in mining the melanoma database. The obtained belief rules were compared with production rules generated by

LERS

system. It was found, that belief rules can be presumably treated as a generalization of standard

IF...THEN

rules.

Jerzy W. Grzymała-Busse, Zdzisław S. Hippe, Teresa Mroczek

iMatch — A New Matchmaker For Large Scale Agent Applications

This paper discusses a new design for the matchmaker used in DECAF

Ashwin Bhat Gurpur

Machine Learning and Statistical MAP Methods

For machine learning of an input-output function

from examples, we show it is possible to define an a priori probability density function on the hypothesis space to represent knowledge of the probability distribution of

, even when the hypothesis space

is large (i.e., nonparametric). This allows extension of maximum a posteriori (MAP) estimation methods nonparametric function estimation. Among other things, the resulting MAPN (MAP for nonparametric machine learning) procedure easily reproduces spline and radial basis function solutions of learning problems.

Mark Kon, Leszek Plaskota, Andrzej Przybyszewski

Autocovariance Based Weighting Strategy for Time Series Prediction with Weighted LS-SVM

Classic kernel methods (SVM, LS-SVM) use some arbitrarily chosen loss functions. These functions equally penalize errors on all training samples. In problem of time series prediction better results can be achieved when the relative importance of the samples is expressed in the loss function. In this paper an autocovariance based weighting strategy for chaotic time series prediction is presented. Proposed method can be considered a way to improve the performance of kernel algorithms by incorporating some additional knowledge and information on the analyzed learning problem.

Paweł Majewski

Workflow Mining Alpha Algorithm — A Complexity Study

Workflow mining algorithms are used to improve and/or refine design of existing workflows. Workflows are composed of sequential, parallel, conflict and iterative structures. In this paper we present results of experimental complexity study of the alpha workflow mining algorithm. We studied time and space complexity as dependent on workflow’s internal structure and on the number of workflow tasks.

Bolesław Mikolajczak, Jian-Lun Chen

Recommendation Rules — a Data Mining Tool to Enhance Business-to-Customer Communication in Web Applications

Contemporary information systems are facing challenging tasks involving advanced data analysis, pattern discovery, and knowledge utilization. Data mining can be successfully employed to sieve through huge amounts of raw data in search for interesting patterns. Knowledge discovered during data mining activity can be used to provide value-added services to users, customers, and organizations.

The adoption of the Web as one of the main media for business-to-customer (B2C) communication provides novel opportunities for using data mining to personalize and enhance customer interfaces. In this paper we introduce the notion of recommendation rules — a simple knowledge model that can be successfully used in the Web environment to improve the quality of B2C relationship by highly personalized communication. We present the formalism and we show how to efficiently generate recommendation rules from a large body of customer data.

Mikołaj Morzy

Feasibility Studies of Quality of Knowledge Mined from Multiple Secondary Sources

I. Implementation of generic operations

In the paper continuation of our research described earlier is shortly discussed. The main goal of current investigation is to build combined learning model (probably quasi-optimal) merging some knowledge models, developed by means of different machine learning algorithms. In order to reach this goal, a set of generic operations were implemented and tested on melanocytic dataset.

Wiesław Paja, Zdzisław S. Hippe

Modern Metaheuristics for Function Optimization Problem

This paper compares the behaviour of three metaheuristics for the function optimization problem on a set of classical functions handling a lot number of variables and known to be hard. The first algorithm to be described is Particle Swarm Optimization (PSO). The second one is based on the paradigm of Artificial Immune System (AIS). Both algorithms are then compared with a Genetic Algorithm (GA). New insights on how these algorithms behave on a set of difficult objective functions with a lot number of variables are provided.

Marek Pilski, Pascal Bouvry, Franciszek Seredyński

Property Driven Mining in Workflow Logs

We present a language for property specification for workflows and a tool for property checks. The language is based on the Propositional Linear Temporal Logic and the structure of workflow logs. These language and tool help companies to diagnose business processes, react to changes in business environment and collect formal definitions of business properties. We give examples of specifications of business properties that set relations between events of business processes

Ella E. Roubtsova

Logical Design with Molecular Components

We propose a theoretical model to realize DNA made circuits based on

in-vitro

algorithms, to perform arithmetic and logical operations. The physical components of the resulting Arithmetic-Logic Unit are a variety of elements such as biochemical laboratories, test tubes and human operators. The advantage of the model is the possibility to perform arithmetic operations with huge binary numbers.

Filomena de Santis, Gennaro Iaccarino

A New Programming Interface for Reinforcement Learning Simulations

The field of Reinforcement Learning, a sub-field of machine learning, represents an important direction for research in Artificial Intelligence, the way for improving an agent’s behavior, given a certain feed-back about its performance. In this paper we propose an original interface for programming reinforcement learning simulations in known environments. Using this interface, there are possible simulations both for reinforcement learning based on the states’ utilities and learning based on actions’ values (Q-learning).

Gabriela Şerban

Anomaly Detection System for Network Security: Immunity-based Approach

In this paper we present architecture of recently built experimental anomaly detection system based on the paradigm of artificial immune system and working in a network environment. We show how network traffic data are mapped into antibodies or antigens of artificial immune system and how similarities between signatures of attackers and antibodies are measured. We present an example of the work of the system in the real network environment.

Franciszek Seredyński, Pascal Bouvry, Dawid R. Rutkowski

Importance of TDS Attribute in Computer Assisted Classification of Melanocytic Skin Lesions

In the paper the new algorithm and results of its application to designation the importance of the TDS attribute in identifying the melanocytic skin lesions are described. The algorithm consists of decision table, TDS belief net, and the nearest neighbor method applied to the bases, which was obtained by reduction the number of attributes from thirteen to four. The obtained results show that this algorithm is very promising.

Aleksander Sokołowski

A Rules-to-Trees Conversion in the Inductive Database System VINLEN

Decision trees and rules are completing methods of knowledge representation. Both have advantages in some applications. Algorithms that convert trees to rules are common. In the paper an algorithm that converts rules to decision tree and its implementation in inductive database VINLEN is presented.

Tomasz Szydło, Bartłomiej Śnieżyński, Ryszard S. Michalski

Invited Session: Syntactic Parsing and Machine Learning

Frontmatter

Deep Parser for Free English Texts Based on Machine Learning with Limited Resources

The paper presents an attempt at the construction of a wide scale parser for English based on Inductive Learning and limited resources. The parser loosely preserves the

shift-reduce

scheme, enriched with powerful actions, and a compound

Decision Tree

instead of the decision table. The attempt originates directly from Hermjakob’s ideas [3], but an important goal was to analyse possible extensions to a wide scale solution. Several supporting heuristics, as well as the overview of the development process and experiments, are described in the paper.

Marek Łabuzek, Maciej Piasecki

Baseline Experiments in the Extraction of Polish Valence Frames

Initial experiments in learning valence (subcategorisation) frames of Polish verbs from a morphosyntactically annotated corpus are reported here. The learning algorithm consists of a linguistic module, responsible for very simple shallow parsing of the input text (nominal and prepositional phrase recognition) and for the identification of valence frame cues (hypotheses), and a statistical module which implements three well-known inferential statistics (likelihood ratio,

test, binomial miscue probability test). The results of the three statistics are evaluated and compared with a baseline approach of selecting frames on the basis of the relative frequencies of frame/verb co-occurrences. The results, while clearly reflecting the many deficiencies of the linguistic analysis and the inadequacy of the statistical measures employed here for a free word order language rich in ellipsis and morphosyntactic syncretisms, are nevertheless promising.

Adam Przepiórkowski, Jakub Fast

Invited Session: New Trends in Data Mining and Knowledge Discovery in Uncertain, Nonstationary Spatio-Temporal Data

Frontmatter

Gene Expression Clustering: Dealing with the Missing Values

We propose a new method to deal with missing values in the gene expression data. It is applied to improve the quality of clustering genes with respect to their functionality. Calculations are run against real-life data, within the framework of self-organizing maps. The applied gene distances correspond to the rank-based Spearman correlation and entropy-based information measure.

Alicja Grużdź, Aleksandra Ihnatowicz, Dominik Ślęzak

Estimation the Rhythmic Salience of Sound with Association Rules and Neural Networks

In this paper experiments done towards improving the performance of systems retrieving musical rhythm are described. Authors briefly review machine learning models used to estimate tendency of sounds to be located in accented positions. This is done on the basis of their physical attributes such as duration, frequency and amplitude. For this purpose Data Mining association rule model and neural networks with discrete output - LVQ networks are used. By means of evaluation method introduced by the authors it is possible to compare the results returned by both models. This work aims at retrieving multi-level rhythmic structure of a musical piece on the basis of its melody, which may result in systems retrieval systems for automatic music identification.

Bożena Kostek, Jarosław Wójcik, Piotr Holonowicz

A Neuro-Fuzzy Classifier Based on Rough Sets

In this paper, we use the concept of rough sets to define equivalence classes encoding input data, and to eliminate redundant or insignificant attributes in data sets which leads to reduction of the complexity of designed systems. In order to deal with ill-defined or real experimental data, we represent input object as fuzzy variables by fuzzy membership function. Furthermore we incorporate the significance factor of the input feature, corresponding to output pattern classification, in order to constitute a fuzzy inference which enhances classification considered as a nonlinear mapping. A new kind of rough fuzzy neural classifier and a learning algorithm with LSE are proposed in this paper. The neuro-fuzzy classifier proposed here can realize a nonlinear mapping from the input feature vector space (that may have the overlapping characteristic) into the output classification vector space.

Huanglin Zeng, Roman W. Swiniarski

Invited Session: Knowledge Base Systems

Frontmatter

Evolutionary Multi-Agent Model for Knowledge Acquisition

In this paper the conception of evolutionary multi-agent model for knowledge acquisition has been introduced. The basic idea of the proposed solution is to use the multi-agent paradigm in order to enable the integration and co-operation of different knowledge acquisition and representation methods. At the single-agent level the reinforcement learning process is realized, while the obtained knowledge is represented as the set of simple decision rules. One of the conditions of effective agent learning is the optimization of the set of it’s features (parameters) that are represented by the genotype’s vector. The evolutionary optimization runs at the level of population of agents.

Wojciech Froelich

Restricted Linear Information Systems

In the paper a class of infinite information systems is described. For decision tables over each such information system there exist low upper bounds on minimal complexity of decision trees and polynomial algorithms of decision tree optimization for various complexity measures.

Mikhail Ju. Moshkov

The Concept of the Hierarchical Clustering Algorithms for Rules Based Systems

This paper presents a conception of fast and useful inference process in knowledge based systems. The main known weakness is long and not smart process of looking for rules during the inference process. Basic inference algorithm, which is used by the rule interpreter, tries to fit the facts to rules in knowledge base. So it takes each rule and tries to execute it. As a result we receive the set of new facts, but it often contains redundant information unexpected for user. The main goal of our works is to discover the methods of inference process controlling, which allow us to obtain only necessary decision information. The main idea of them is to create rules partitions, which can drive inference process. That is why we try to use the hierarchical clustering to agglomerate the rules.

Agnieszka Nowak, Alicja Wakulicz-Deja

Petri Net and Matrix Representation of Rule Knowledge Base for Verification Task

The problem of verification of rule knowledge base covers the verification of dynamic properties, which reflect the processes occurring during inference. Process of detection of these anomalies requires modelling of dynamics of these processes. Suggested in previous papers the decision unit conception did not guarantee such a possibility. This paper gives attention to the analysis of possible use of Petri nets and incidence matrix as the way of representation of knowledge base. The paper presents the relation between Petri nets and decision units’ nets and simultaneously points at the possible use of Petri nets to develop the properties of decision units.

Roman Siminski

Artificial Neural Networks in Incomplete Data Sets Processing

This paper presents some results obtained in experiments with artificial neural networks trained with different learning algorithms in case of lack of some data in training and testing sets.

Magdalena Tkacz

Intelligent Data Processing in Distributed Internet Applications

We discuses usage of elements of. Net platform — Web Service and XML to create distributed internet applications as data processing system. The main aim is to create retrieval system of hidden relations between data. The application uses elements of rough sets theory in order to get data, which could be used for further exploration. Data are sent to particular Web Services as XML format and could represent various domains e.g. medicine, pharmacy. The implementation intelligent techniques of data processing based on Web Services and XML standard, illustrates new possibilities for data processing in distributed environment. It gives scaling and possibility to adapt it in many specialized solutions.

Beata Zielosko, Alicja Wakulicz-Deja

Invited Session: KDD and Facial Recognition

Frontmatter

Skyline with Presorting: Theory and Optimizations

There has been interest recently in skyline queries, also called Pareto queries, on relational databases. Relational query languages do not support search for “best” tuples, beyond the order by statement. The proposed skyline operator allows one to query for best tuples with respect to any number of attributes as preferences. In this work, we explore what the skyline means, and why skyline queries are useful, particularly for expressing preference. We describe the theoretical aspects and possible optimizations of an efficiant algorithm for computing skyline queries presented in [6].

Jan Chomicki, Parke Godfrey, Jarek Gryz, Dongming Liang

Faster Clustering with DBSCAN

Grouping data into meaningful clusters belongs to important tasks in the area of artificial intelligence and data mining. DBSCAN is recognized as a high quality scalable algorithm for clustering data. It enables determination of clusters of any shape and identification of noise data. In this paper, we propose a method improving the performance of DBSCAN. The usefulness of the method is verified experimentally both for indexed and non-indexed data.

Marzena Kryszkiewicz, Łukasz Skonieczny

Innertron: New Methodology of Facial Recognition, Part I

On October 10, 2001, the US President identified the most wanted persons sought by the United States of America. Agencies such as FBI, CIA and Homeland Security spread images of the most wanted persons across the United States. Even though US citizens saw their images on television, the Internet and posters, computers had, and still have for that matter, no ability at all to identify these persons. To date FBI, CIA and Homeland Security depend entirely on human beings, not computers, to identify persons at borders and international airports. In other words, facial recognition remains an incompetent technology.

Accordingly, authors first succinctly show the weaknesses of the current facial recognition methodologies, namely Eigenface Technology, Local Feature Analysis (from the classical 7 point to the 32–50 blocks approach), the Scale-Space Approach, Morphological Operations and industrial or patented methodologies such as ILEFIS™, Viisage™, Visionics™ and Cognitec’s FaceVACS-Logon™, Identix™ and Neven Vision™. Secondly, they introduce a completely new, simple and robust methodology called

Innertron

Rory A. Lewis, Zbigniew W. Raś

Innertron: New Methodology of Facial Recognition, Part II

The first part of the Innertron methodology for facial recognition, presented in [1], removes those faces that certainly do not fall within the constraints of the nose domain with an upper limit at (2.0, −1.5) to (−2.0, −1.5) and a lower domains at (3.0, −7.5) to (−3.0, −7.5). The 300 square pixels will contain the nose of the subject sought. The second part of the Innertron methodology, presented in this paper, in essence knows that the subject sought is somewhere

close by

and now starts looking at features such as the eyebrows, angles of the eyes, thickness of the nose, shape of the nostril. Seven regions are distinguished that take into account the ability to lift eyebrows and most importantly the ability to cover portions of the face and still garner a

hit

in the database.

Rory A. Lewis, Zbigniew W. Raś

Approximating a Set of Approximate Inclusion Dependencies

Approximating a collection of patterns is a new and active area of research in data mining. The main motivation lies in two observations : the number of mined patterns is often too large to be useful for any end-users and user-defined input parameters of many data mining algorithms are most of the time almost arbitrary defined (e.g. the frequency threshold).

In this setting, we apply the results given in the seminal paper [11] for frequent sets to the problem of approximating a set of approximate inclusion dependencies with

k inclusion dependencies

. Using the fact that inclusion dependencies are “representable as sets”, we point out how approximation schemes defined in [11] for frequent patterns also apply in our context. An heuristic solution is also proposed for this particular problem. Even if the quality of this approximation with respect to the best solution cannot be precisely defined, an interaction property between IND and FD may be used to justify this heuristic.

Some interesting perspectives of this work are pointed out from results obtained so far.

Fabien De Marchi, Jean-Marc Petit

Invited Session: Recent Developments in Bioinformatics

Frontmatter

Informatic Approaches to Molecular Translation

Recent results on the evolution of the genetic code are informally put in the context of Von Neumann’s theory of self-reproducing automata and the classical treatment of error-correcting codes in information theory. I discuss the sufficiency of genetic descriptions for self-reproduction. This implies a duality, a division of the information for maintenance and reproduction of organisms into two components, which is fundamental to life as we know it. The two components are programmatic and operational in nature, and they require the transmission and transduction of information to cooperate. In this context, I review recent results on the coevolution of genes and genetic codes. These results show how desirable informatic properties can evolve in systems of tapes and tape-players that are mutually co-dependent for their reproductive success.

David H. Ardell

An Approach to Mining Data with Continuous Decision Values

We propose a novel approach to discover useful patterns from ill-defined decision tables with a real value decision and nominal conditional attributes. The proposed solution is based on a two-layered learning algorithm. In the first layer the preference relation between objects is approximated from the data. In the second layer the approximated preference relation is used to create three applications: (1) to learn a ranking order on a collection of combinations, (2) to predict the real decision value, (3) to optimize the process of searching for the combination with maximal decision.

Hung Son Nguyen, Marta Łuksza, Ewa Mąkosa, Henryk Jan Komorowski

Soft Computing Approach to the Analysis of the Amino Acid Similarity Matrices

Amino acid similarity matrices are used for protein sequence comparison. A new method for studying connections between amino acid properties and similarity matrices is proposed. A representation of the amino acid similarity matrices, by means of the equivalence classes between amino acids, is proposed. It is shown that mappings between these equivalence classes and contiguous subsets of the amino acid property space are possible. These mappings are consistent between equivalence classes. There is a possibility of practical applications of this method in sequence similarity searches.

Witold R. Rudnicki, Jan Komorowski

Special Session: Artificial Immune Systems

Frontmatter

Computing with Idiotypic Networks

The paper presents a computer experiment inspired by the immune metaphor and based on the work of Farmer, Packard, and Perelson [FPP86]. We develop a model influenced by the way the immune system works that is well-suited to address a particular class of NP-hard problems. We discuss the results obtained when applying the model to an artificial vision problem denoted

the museum problem

, where artificial agents successfully accomplish a surveillance assignment to protect the pieces of an art exhibition from bad behaved visitors.

Francisco Martins, Neva Slani

Titel: Intelligent Information Processing and Web Mining
herausgegeben von: Prof. Dr. Mieczysław A. Kłopotek
Prof. Dr. Sławomir T. Wierzchoń
Dr. Krzysztof Trojanowski
Verlag: Springer Berlin Heidelberg
Electronic ISBN: 978-3-540-32392-1
Print ISBN: 978-3-540-25056-2
DOI: https://doi.org/10.1007/3-540-32392-9