Skip to main content

2012 | Buch

Intelligent Tools for Building a Scientific Information Platform

herausgegeben von: Robert Bembenik, Lukasz Skonieczny, Henryk Rybiński, Marek Niezgodka

Verlag: Springer Berlin Heidelberg

Buchreihe : Studies in Computational Intelligence

insite
SUCHEN

Über dieses Buch

This book is a selection of results obtained within one year of research performed under SYNAT - a nation-wide scientific project aiming to create an infrastructure for scientific content storage and sharing for academia, education and open knowledge society in Poland. The selection refers to the research in artificial intelligence, knowledge discovery and data mining, information retrieval and natural language processing, addressing the problems of implementing intelligent tools for building a scientific information platform.

The idea of this book is based on the very successful SYNAT Project Conference and the SYNAT Workshop accompanying the 19th International Symposium on Methodologies for Intelligent Systems (ISMIS 2011). The papers included in this book present an overview and insight into such topics as architecture of scientific information platforms, semantic clustering, ontology-based systems, as well as, multimedia data processing.

Inhaltsverzeichnis

Frontmatter
1. Semantic Search and Analytics over Large Repository of Scientific Articles
Abstract
We present the architecture of the system aimed at search and synthesis of information within document repositories originating from different sources, with documents provided not necessarily in the same format and the same level of detail. The system is expected to provide domain knowledge interfaces enabling the internally implemented algorithms to identify relationships between documents (as well as authors, institutions et cetera) and concepts (such as, e.g., areas of science) extracted from various types of knowledge bases. The system should be scalable by means of scientific content storage, performance of analytic processes, and speed of search. In case of compound computational tasks (such as production of richer semantic indexes for the search improvements), it should follow the paradigms of hierarchical modeling and computing, designed as an interaction between domain experts, system experts, and appropriately implemented intelligent modules.
Hung Son Nguyen, Dominik Ślęzak, Andrzej Skowron, Jan G. Bazan
2. On Designing the SONCA System
Abstract
The SYNAT project aims to develop a universal, open hosting and communication platform for network knowledge resources for science, education and open information society. The stage B13 of this project aims to develop methods and algorithms of semantic indexing, classification and retrieval using dictionaries, thesauri and ontologies as well as methods of processing and visualizing results. The methods and algorithms aim to support the dialogue with the repositories of text and multimedia resources gathered on some servers. To realize the objectives of the stages B13 and B14 of the SYNAT project we plan to develop and implement a system called SONCA (Search based on ONtologies and Compound Analytics).
We present ideas and proposals for the SONCA system. The main idea is to allow combination of metadata-based search, syntactic keyword-based search and semantic search, and to use ranks of objects. Semantic search criteria may be keywords, concepts, or objects (for checking similarity). Search criteria based on metadata play the role of exact restrictions, while syntactic keywords and semantic search criteria are fuzzy restrictions. To enable metadata-based search, an appropriate document representation is used. To enable syntactic keyword-based search, each document (object) is stored together with information about the terms occurring in its text attributes. The terms are normalized and only important ones are stored. To enable semantic search, the representation of each document (object) is extended further with the most important concepts that characterize the document. Such concepts belong to the main ontology of SONCA.
We provide an abstract model for the SONCA system, an instantiation of that model, some ideas for the user interface of SONCA as well as proposals for increasing efficiency of the query answering process.
Linh Anh Nguyen, Hung Son Nguyen
3. Retrieval and Management of Scientific Information from Heterogeneous Sources
Abstract
The paper describes the process of automated retrieval and management of scientific information from various sources including the Internet. Application of semantic methods in different phases of the process is described. The system envisaged in the project is a scientific digital library, with automated retrieval and hosting capabilities. An overall architecture for the system is proposed.
Piotr Gawrysiak, Dominik Ryżko, Przemysław Więch, Marek Kozłowski
4. RDBMS Model for Scientific Articles Analytics
Abstract
We present the relational database schema aimed at efficient storage and querying parsed scientific articles, as well as entities corresponding to researchers, institutions, scientific areas, et cetera. An important requirement in front of the proposed model is to operate with various types of entities, but with no increase of schema’s complexity. Another aspect is to store detailed information about parsed articles in order to conduct advanced analytics in combination with the domain knowledge about scientific topics, by means of standard SQL and RDBMS management. The overall goal is to enable offline, possibly incremental computation of semantic indexes supporting end users via other modules, optimized for fast search and not necessarily for fast analytics, as well as direct ad-hoc SQL access by the most advanced users.
Marcin Kowalski, Dominik Ślęzak, Krzysztof Stencel, Przemysław Pardel, Marek Grzegorowski, Michał Kijowski
5. Semantic Clustering of Scientific Articles with Use of DBpedia Knowledge Base
Abstract
A case study of semantic clustering of scientific articles related to Rough Sets is presented. The proposed method groups the documents on the basis of their content and with assistance of DBpedia knowledge base. The text corpus is first treated with Natural Language Processing tools in order to produce vector representations of the content and then matched against a collection of concepts retrieved from DBpedia. As a result, a new representation is constructed that better reflects the semantics of the texts. With this new representation, the documents are hierarchically clustered in order to form partition of papers that share semantic relatedness. The steps in textual data preparation, utilization of DBpedia and clustering are explained and illustrated with experimental results. Assessment of clustering quality by human experts and by comparison to traditional approach is presented.
Marcin Szczuka, Andrzej Janusz, Kamil Herba
6. Extended Document Representation for Search Result Clustering
Abstract
Organizing query results into clusters facilitates quick navigation through search results and helps users to specify their search intentions. Most meta-search engines group documents based on short fragments of source text called snippets. Such a model of data representation in many cases shows to be insufficient to reflect semantic correlation between documents. In this paper, we discuss a framework of document description extension which utilizes domain knowledge and semantic similarity. Our idea is based on application of Tolerance Rough Set Model, semantic information extracted from source text and domain ontology to approximate concepts associated with documents and to enrich the vector representation.
S. Hoa Nguyen, Wojciech Świeboda, Grzegorz Jaśkiewicz
7. Semantic Recognition of Digital Documents
Abstract
The paper presents methods developed by the Methods of Semantic Recognition of Scientific Documents group in the research within the scope of the SYNAT project. It describes document representation format together with a proof of concept system converting scientific articles in PDF format into this representation. Another topic presented in the article is an experiment with clustering documents by style.
Paweł Betliński, Paweł Gora, Kamil Herba, Trung Tuan Nguyen, Sebastian Stawicki
8. Towards Semantic Evaluation of Information Retrieval
Abstract
The paper discuss fundamentals of semantic evaluation of information retrieval systems. Semantic evaluation is understood in two ways. Semantic evaluation sensu stricto consists of automatic global methods of information retrieval evaluation which are based on knowledge representation systems. Semantic evaluation sensu largo includes also evaluation of retrieved results presented using new methods and comparing them to previously used which evaluated unordered set of documents or lists of ranked documents. Semantic information retrieval methods can be treated as storing meaning of words which are basic building blocks of retrieved texts. In the paper, ontologies are taken as systems which represent knowledge and meaning. Ontologies serve as a basis for semantic modeling of information needs, which are modeled as families of concepts. Semantic modeling depends also on algorithmic methods of assigning concepts to documents. Some algebraic and partially ordered set methods in semantic modeling are proposed leading to different types of semantic modeling. Then semantic value of a document is discussed, it is relativized to a family of concepts and essentially depends on the used ontology. The paper focuses on sematic relevance of documents, both binary and graded, together with semantic ranking of documents. Various types of semantic value and semantic relevance are proposed and also some semantic versions of information retrieval evaluation measures are given.
Piotr Wasilewski
9. Methods and Tools for Ontology Building, Learning and Integration – Application in the SYNAT Project
Abstract
One of the main goals of the SYNAT project is to equip scientific community with a knowledge-based infrastructure providing fast access to relevant scientific information. We have started building an experimental platform where different kinds of stored knowledge will be modeled with the use of ontologies, e.g. reference/system ontology, domain ontologies and auxiliary knowledge including lexical language ontology layers. In our platform we use system ontology defining “system domain” (a kind of meta knowledge) for the scientific community, covering concepts and activities related to the scientific life and domain ontologies dedicated to specific areas of science. Moreover the platform is supposed to include a wide range of tools for building and maintenance of ontologies throughout their life cycle as well as interoperation among the different introduced ontologies.
The paper makes a contribution to understanding semantically modeled knowledge and its incorporation into the SYNAT project. We present a review of ontology building, learning, and integration methods and their potential application in the project.
Anna Wróblewska, Teresa Podsiadły-Marczykowska, Robert Bembenik, Grzegorz Protaziuk, Henryk Rybiński
10. Transforming a Flat Metadata Schema to a Semantic Web Ontology: The Polish Digital Libraries Federation and CIDOC CRM Case Study
Abstract
This paper describes the transformation of the metadata schema used by the Polish Digital Libraries Federation to the CIDOC CRM model implemented in OWL as Erlangen CRM. The need to transform the data from a flat schema to a full-fledged ontology arose during preliminary works in the Polish research project SYNAT. The Digital Libraries Federation site offers aggregated metadata of more than 600,000 publications that constitute the first portion of data introduced into the Integrated Knowledge System - one of the services developed in the SYNAT project. To be able to perform the desired functions, IKS needs heavily linked data that can be searched semantically. The issue is not only one of mapping one schema element to another, as the conceptualization of CIDOC is significantly different from that of the DLF schema. In the paper we identify a number of problems that are common to all such transformations and propose solutions. Finally, we present statistics concerning the mapping process and the resulting knowledge base.
Cezary Mazurek, Krzysztof Sielski, Maciej Stroiński, Justyna Walkowska, Marcin Werla, Jan Węglarz
11. Modularized Knowledge Bases Using Contexts, Conglomerates and a Query Language
Abstract
The paper presents a novel approach to design and development of a modularized knowledge base. It is assumed that the knowledge base consists of a terminology (axioms) and a world description (assertions), both formulated in a Description Logics (DL) dialect. The approach is oriented towards decomposition of a knowledge base into logical components called contexts and further into semantic components called conglomerates. Both notions were elaborated separately elsewhere. The paper shows how contexts and conglomerates concepts can work in harmony to create a maintainable knowledge base. An architecture of a system that conforms to this approach, which additionally uses a query language called KQL (Knowledge Query Language), is presented. The approach is intended to be used to build a prototypical system that aims at integrating knowledge on cultural heritage coming from digital libraries, including user-defined libraries. The thorough discussion of related work is also given.
Krzysztof Goczyła, Aleksander Waloszek, Wojciech Waloszek, Teresa Zawadzka
12. FPGA Implementation of the Selected Parts of the Fast Image Segmentation
Abstract
This paper presents preliminary implementation results of the SVM (Support Vector Machine) algorithm. SVM is a dedicated mathematical formula which allows us to extract selective objects from a picture and assign them to an appropriate class. Consequently, a black and white images reflecting an occurrence of the desired feature is derived from an original picture fed into the classifier. This work is primarily focused on the FPGA implementation aspects of the algorithm as well as on comparison of the hardware and software performance. A human skin classifier was used as an example and implemented both on AMD AthlonII P320 Dual-Core2.10 GHz and Xilinx Spartan 6 FPGA. It is worth emphasizing that the critical hardware components were designed using HDL (Hardware Description Language), whereas the less demanding or standard ones such as communication interfaces, FIFO, FSMs were implemented in HLL (High Level Language). Such an approach allowed us both to cut a design time and preserve a high performance of the hardware classification module.
Maciej Wielgosz, Ernest Jamro, Dominik Żurek, Kazimierz Wiatr
13. Evaluation of Image Descriptors for Retrieval of Similar Images
Abstract
The paper addresses the issue of searching for similar images in an information repository. The contained images are annotated with the help of sparse descriptors. In the presented research different color and edge histogram descriptors were used. To measure distances among images the sets of their descriptors are compared. For this purpose different similarity measures were employed. Results of these experiments, as well as discussion of the advantages and limitations of different combinations of methods for retrieval of similar images are presented.
Rafał Frączek, Bogusław Cyganek
14. Online Sound Restoration for Digital Library Applications
Abstract
In this paper, a sound restoration system conceived and engineered at the Multimedia Systems Department of the Gdansk University of Technology is discussed with regard to the principles of its design, features of operation and the achieved results. The system has been designed so that: no special sound restoration software is needed to perform audio restoration; no skills in digital signal processing are required from the user; the process of online restoration employs automatic reduction of noise, wow and impulse distortions.
Andrzej Czyżewski, Adam Kupryjanow, Bożena Kostek
15. Connectionist Language Model for Polish
Abstract
This article describes a connectionist language model, which may be used as an alternative to the well known n-gram models. A comparison experiment between n-gram and connectionist language models is performed on a Polish text corpus. Statistical language modeling is based on estimating a joint probability function of a sequence of words in a given language. This task is made problematic due to a phenomenon known commonly as the “curse of dimensionality”. This occurs because the sequence of words used to test the model is most likely going to be different from anything present in the training data. Classic solutions to this problem are successfully achieved by using n-grams which generalize the data by concatenating short overlapping word sequences gathered from the training data. Connections models, however, can accomplish this by learning a distributed representation for words. They can simultaneously learn both the distributed representation for each word in the dictionary as well as the synaptic weights used for modeling the joint probability of word sequences. Generalization can be obtained thanks to the fact that if a sequence is made up of words that were already seen, it will receive a higher probability than an unseen sequence of words. In the experiments, perplexity is used as measure of language model quality.
Łukasz Brocki, Krzysztof Marasek, Danijel Koržinek
16. Commonly Accessible Web Service Platform — Wiki-WS
Abstract
Web service technology on the basis had to supply complete and reliable system components. Nowadays this technology is commonly used by companies providing results of their work to end users and hiding implementation details. This paper presents a SOA-enabled platform — Wiki-WS — that empowers users to deploy, modify, discover and invoke web services. Main concept of the Wiki-WS platform is searching and invocation of web services written by different workgroups in different technologies deployed on different servers. Wiki-based web service code modification allows engineers from any place to construct and to implement components, which are mature and ready to use. Components deployed in the platform cover different functionalities from various knowledge domains. Service categorization done by users and by the platform classification engine in the early deployment and every modification time tunes up future searching and usage decision processes. Fundamental architecture components and user categories characterization are described in this paper. There is also included presentation of sample scenarios of Wiki-WS usage and advantages derived from its deployment.
Henryk Krawczyk, Marek Downar
Backmatter
Metadaten
Titel
Intelligent Tools for Building a Scientific Information Platform
herausgegeben von
Robert Bembenik
Lukasz Skonieczny
Henryk Rybiński
Marek Niezgodka
Copyright-Jahr
2012
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-24809-2
Print ISBN
978-3-642-24808-5
DOI
https://doi.org/10.1007/978-3-642-24809-2

Premium Partner