Skip to main content
Top

2016 | Book

Machine Intelligence and Big Data in Industry

Editors: Dominik Ryżko, Piotr Gawrysiak, Marzena Kryszkiewicz, Henryk Rybiński

Publisher: Springer International Publishing

Book Series : Studies in Big Data

insite
SEARCH

About this book

This book presents valuable contributions devoted to practical applications of Machine Intelligence and Big Data in various branches of the industry. All the contributions are extended versions of presentations delivered at the Industrial Session the 6th International Conference on Pattern Recognition and Machine Intelligence (PREMI 2015) held in Warsaw, Poland at June 30- July 3, 2015, which passed through a rigorous reviewing process. The contributions address real world problems and show innovative solutions used to solve them. This volume will serve as a bridge between researchers and practitioners, as well as between different industry branches, which can benefit from sharing ideas and results.<

Table of Contents

Frontmatter

Text Processing

Frontmatter
Automatic Sentiment Analysis in Polish Language
Abstract
We introduce fully automated process for sentiment analysis in short texts in Polish language. Process consists of (a) generation of emotion lexicon using Twitter annotated messages (b) building sentiment data set using annotated messages and the generated lexicon, (c) training NEAT genetic algorithm using previously prepared data set and (d) the final evaluation using 10 fold cross validation. We show that this method provides good results and can be used to simplify sentiment analysis processes for Polish language content.
Antoni Sobkowicz
Learning Curve with Machine Translation Based on Parallel, Bilingual Corpora
Abstract
Machine Translation is a branch of computer science that automatically handles translation of a text from a source language to a target language. This article summarizes the experience gained during UKSW project, part of which deals with translation of legal phrases between English and Polish. The article describes consecutive steps of the project, i.e. collecting data and creating parallel, bilingual corpora, checking open source ready-made solutions and the novel, effective SMT solution that has been proposed. The final chapter summarizes the solution, together with the results based on BLEU metrics.
Maciej Kowalski
N-Gram Collection from a Large-Scale Corpus of Polish Internet
Abstract
The paper is devoted to the processing of multi-Terabyte web archive. The aim of this work is to create N-gram collection based on a large-scale corpus of all Polish sites on the Internet provided by The Common Crawl Foundation project [1]. The data after lexical processing is used to extract flat N-grams compilations. They have many successful applications in machine learning within natural language processing.
Szymon Roziewski, Wojciech Stokowiec, Antoni Sobkowicz
Study Fields Clustering Using KRK Competences
Abstract
The paper refers to the topic of study fields clustering using extracted information from semi-structured documents, namely documents describing study field’s KRK competences. KRK competences are the specialized descriptions of the qualifications, which students gain after graduation from the given study field. The proposed method enables extracting and processing KRK competences from diverse types of semi-structured documents. It consists of two stages: (1) entity extraction from documents (building vectors of KRK competences for each study field), and (2) study fields clustering using those competence representations. Polish KRK competence files, describing almost 3000 study fields in Poland, were used as a corpora. The method and its stages are thoroughly analyzed. The results allow to compare and identify similar study fields according to theirs final effects of education.
Marek Kozlowski
Semantic Textual Similarity Using Various Approaches
Abstract
The paper is devoted to the semantic textual similarity (STS) problem. Given two sentences of text, s1 and s2, the systems participating in this problem should compute how similar s1 and s2 are, returning a similarity score. We present our experience in this topic, ranging from the knowledge-poor approaches to some compact and easy applied knowledge-rich methods (using structured knowledge base frameworks like WordNet, Wikipedia or BabelNet). The evaluation of the proposed methods was performed using the datasets from SemEval-2014/15 tasks.
Maciej Kazuła, Marek Kozłowski

Data Mining

Frontmatter
Identification of Diabetes Disease Using Committees of Neural Network-Based Classifiers
Abstract
Diabetes mellitus is one of the most serious health challenges in both developing and developed countries. In this paper, we present a design of a classifier committee for the detection of diabetes disease based on the Pima Indian diabetic database from the UCI machine learning repository. The proposed method uses multi-layer perceptron (MLP) and cascade-forward back propagation network (CFBN) predictors as base classifiers. The combined committee is based on varying the parameters related to both the design and the training of the neural network classifiers. Our experimental evaluation confirms that the derived approach provides a robust classification system, and yields classification accuracies of 95.31 and 96.88 % based on using combined MLP and combined CFBN classifiers respectively. The experimental results obtained thus show that the proposed classifier committee can form as useful basis for automatic diagnosis of diabetes.
Ali Hassan El-Baz, Aboul Ella Hassanien, Gerald Schaefer
Enzyme Function Classification Based on Borda Count Ranking Aggregation Method
Abstract
Prediction of enzyme functions is an important research topic due to their role in chemical reactions. In this paper, we propose a model for enzyme function classification that combines the outputs of different pairwise sequence alignments based on local sequence alignment. The output of each pairwise sequence alignment is represented by a ranked list, while the main idea of the proposed model is to combine all ranked lists into one ranked list. The candidate of the highest rank is then assigned as the function of the unknown sequence. Unbalanced and balanced datasets are used for evaluation, and the obtained results show that our approach yields good performance and that ranking aggregation achieves results better compared to all single sequence alignments.
Mahir M. Sharif, Alaa Tharwat, Aboul Ella Hassanien, Hesham A. Hefny, Gerald Schaefer
Mining of Frequent Action Rules
Abstract
An action rule is constructed as a series of changes, or actions, which can be made to some of the flexible characteristics of the information system that ultimately triggers a change in the targeted attribute. The existing action rules discovery methods consider the input decision system as their search domain and are limited to expensive and ambiguous strategies. In this paper, we define and propose the notion of action base as the search domain for actions, and then propose a strategy based on the FP-Growth algorithm to achieve high performance in action rules extraction. This method was initially tested on real medical diabetic database. The obtained results are quite promising.
Agnieszka Dardzinska, Anna Romaniuk

Text and Multimedia Processing

Frontmatter
Automatic Translation of Multi-word Labels
Abstract
Application of semantic resources often requires linking phrases expressed in a natural language to formally defined notions. In case of ontologies lexical layers may be used for that purpose. In the paper we propose an automatic machine translation method for translating multi-word labels from lexical layers of domain ontologies. In the method we take advantage of Wikipedia and dictionaries services available on the Internet in order to provide translations of thematic texts from a given area of interest. Experimental evaluation shows usefulness of the proposed method in translating specialized thematic dictionaries.
Grzegorz Protaziuk, Marcin Kaczyński, Robert Bembenik
VTLN Using Different Warping Functions for Template Matching
Abstract
In most automatic speech recognition (ASR) systems, speaker differences are compensated by normalizing the vocal tract lengths of the speakers. This is implemented by warping the frequency-axis by appropriate warping factor. However, it is computationally expensive to find warping factor for each speaker. This problem is overcome by incorporating a universal warping function for all the speakers. Different psychoacoustic scales have been proposed over the past decade that are assumed to be similar to the frequency response of basilar membrane (BM) of human auditory system. In this paper, different warping functions are studied with an aim of vocal tract length normalization (VTLN) and template matching experiments are done using dynamic time warping (DTW) algorithm to test the performance of various warping functions. It was observed that features obtained by warping the frequency-axis by psychoacoustic scales improve the classification performance. In particular, Equivalent Rectangular Bandwidth (ERB)-scale based warping improves the precision by 7.17 % over state-of-the-art mel frequency cepstral coefficients (MFCC) for template matching done on isolated digits of TIDIGITS database and 6.16 % on words from TIMIT database.
Maulik C. Madhavi, Shubham Sharma, Hemant A. Patil
A Comparative Study on Music Genre Classification Algorithms
Abstract
Music Genre Classification is one of the fundamental tasks in the field of Music Information Retrieval (MIR). In this paper the performance of various music genre classification algorithms including Random Forests, Multi-class Support Vector Machines and Deep Belief Networks is being compared. The study is based on the “Million Song Dataset” a freely-available collection of audio features and metadata. The emphasis is put not only on classification accuracy but also on robustness and scalability of algorithms.
Wojciech Stokowiec

Software

Frontmatter
Information Selection and Data Compression RapidMiner Library
Abstract
We present an Information Selection and Data Compression RapidMiner Library, which contains several known instance selection algorithms and several algorithms developed by us for classification and regression tasks. We present the motivation for creating the library and the need for developing new instance selection algorithms or extending the existing ones. We discuss how the library works and how to use it.
Marcin Blachnik, Mirosław Kordos
Automatic Clustering Methods of Offers in an E-Commerce Marketplace
Abstract
This work describes fully automatic clustering methods of offers in an e-commerce marketplace. Three different grouping approaches are proposed. We also designed and applied quality measures of clustering based on user-generated events. We assessed the proposed methods of clustering and compared them.
Anna Wroblewska, Bartlomiej Twardowski, Pawel Zawistowski, Dominik Ryżko
Application of Machine Learning Algorithms for Bitcoin Automated Trading
Abstract
The aim of this paper is to compare and analyze different approaches to the problem of automated trading on the Bitcoin market. We compare simple technical analysis method with more complex machine learning models. Experimental results showed that the performance of tested algorithms is promising and that Bitcoin market is still in its youth, and further market opportunities can be found. To the best of our knowledge, this is the first work that tries to investigate applying machine learning methods for the purpose of creating trading strategies on the Bitcoin market.
Kamil Żbikowski

Complex Systems, Internet of Things and Agent Systems

Frontmatter
Maximal Discernibility Discretization of Attributes—A FPGA Approach
Abstract
In this paper we propose the design for hardware cuts generating module for FPGA. Calculations are supported by softcore CPU. Presented architecture has been simulated and tested in VHDL IDE on real data. Implemented algorithm uses Maximal Discernibility (MD) approach. Results show the big acceleration of the computation time using hardware supporting discretization in comparison to pure software implementation.
Maciej Kopczynski, Tomasz Grzes, Jaroslaw Stepaniuk
Big Data Solutions for Smart Grids and Smart Meters
Abstract
The article describes the architecture of Big Data systems, explains its architecture and also clarifies what Big Data is. The document also presents basic problems related to management Big Data in Smart Grids infrastructure and smart meters. Advanced Metering Infrastructure is described in the example of current implementation status, plans and perspectives on pilot projects already carried out by the PGE Dystrybucja SA in two locations: Łódź City and Augustów City. The current state of data generation of the Polish grid is analyzed and a future realistic scenario is illustrated.
Joanna Konopko
Intelligent System of Limited Resource Allocation for Large-Scale Agent Systems
Abstract
This paper describes an intelligent decision support system for semi-autonomous agents. The solution is developed for a large-scale network of double-interface mobile routers. The network is a part of advanced metering infrastructure deployed in the north of Poland. Limited capacity of cellular networks makes the system stability and performance dependent on smart management of radio resource utilization. A model of system phenomena was based on observations of a real system consisting of more than 10,000 devices. A network simulator was developed as a tool for solution testing outside of the real system. Computational Intelligence methods were found suitable for a large system scale. The paper introduces an iterative method based on cellular neural network model for determining the optimal resource allocation for the system.
Jakub Weclawski, Stanislaw Jankowski
Searching for Logical Patterns in Multi-sensor Data from the Industrial Internet
Abstract
Engineers analysing large volumes of multi-sensor data from vehicles, engines etc. often seek to search for events such as “hard-stops”, “lane passing” or “engine overload”. Apart from such visual analysis for engineering purposes, manufactures also need to count occurrences of such events via on-board monitoring sensors that ideally rely on classifiers; searching for patterns in available data is also useful for preparing training sets in this context. In this paper, we propose a method for searching for multi-sensor patterns in large volumes of sensor data using qualitative symbols (QSIM (Say, Functions representable in pure QSIM, 251–255, 1996, [1])) such as “steady”, “increasing”, “decreasing”. Patterns can include symbol-sequences for multiple sensors, as well as approximate duration, level or slope values. Logical symbols are extracted from multi-sensor time-series and registered in a trie-based index structure. We demonstrate the effectiveness of our retrieval and ranking technique on real-life vehicular sensor data in the visual analytics as well as classifier training and detection scenarios.
Mohit Yadav, Ehtesham Hassan, Gautam Shroff, Puneet Agarwal, Ashwin Srinivasan
Backmatter
Metadata
Title
Machine Intelligence and Big Data in Industry
Editors
Dominik Ryżko
Piotr Gawrysiak
Marzena Kryszkiewicz
Henryk Rybiński
Copyright Year
2016
Electronic ISBN
978-3-319-30315-4
Print ISBN
978-3-319-30314-7
DOI
https://doi.org/10.1007/978-3-319-30315-4

Premium Partner