nach oben

2015 | Buch

Kapitel lesen Erstes Kapitel lesen

Semantic Web Evaluation Challenges

Second SemWebEval Challenge at ESWC 2015, Portorož, Slovenia, May 31 - June 4, 2015, Revised Selected Papers

herausgegeben von: Fabien Gandon, Elena Cabrio, Milan Stankovic, Antoine Zimmermann

Verlag: Springer International Publishing

Buchreihe : Communications in Computer and Information Science

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

This book constitutes the thoroughly refereed post conference proceedings of the second edition of the Semantic Web Evaluation Challenge, SemWebEval 2015, co-located with the 12th European Semantic Web conference, held in Portorož, Slovenia, in May/June 2015.

This book includes the descriptions of all methods and tools that competed at SemWebEval 2015, together with a detailed description of the tasks, evaluation procedures and datasets. The contributions are grouped in the areas: open knowledge extraction challenge (OKE 2015); semantic publishing challenge (SemPub 2015); schema-agnostic queries over large-schema databases challenge (SAQ 2015); concept-level sentiment analysis challenge (CLSA 2015).

Inhaltsverzeichnis

Frontmatter

Open Knowledge Extraction Challenge (OKE-2015)

Frontmatter

Open Knowledge Extraction Challenge

Abstract

The Open Knowledge Extraction (OKE) challenge is aimed at promoting research in the automatic extraction of structured content from textual data and its representation and publication as Linked Data. We designed two extraction tasks: (1) Entity Recognition, Linking and Typing and (2) Class Induction and entity typing. The challenge saw the participations of four systems: CETUS-FOX and FRED participating to both tasks, Adel participating to Task 1 and OAK@Sheffield participating to Task 2. In this paper we describe the OKE challenge, the tasks, the datasets used for training and evaluating the systems, the evaluation method, and obtained results.

Andrea Giovanni Nuzzolese, Anna Lisa Gentile, Valentina Presutti, Aldo Gangemi, Darío Garigliotti, Roberto Navigli

CETUS - A Baseline Approach to Type Extraction

Abstract

The concurrent growth of the Document Web and the Data Web demands accurate information extraction tools to bridge the gap between the two. In particular, the extraction of knowledge on real-world entities is indispensable to populate knowledge bases on the Web of Data. Here, we focus on the recognition of types for entities to populate knowledge bases and enable subsequent knowledge extraction steps. We present CETUS, a baseline approach to entity type extraction. CETUS is based on a three-step pipeline comprising (i) offline, knowledge-driven type pattern extraction from natural-language corpora based on grammar-rules, (ii) an analysis of input text to extract types and (iii) the mapping of the extracted type evidence to a subset of the DOLCE+DnS Ultra Lite ontology classes. We implement and compare two approaches for the third step using the YAGO ontology as well as the FOX entity recognition tool.

Michael Röder, Ricardo Usbeck, René Speck, Axel-Cyrille Ngonga Ngomo

A Hybrid Approach for Entity Recognition and Linking

Abstract

Numerous research efforts are tackling the entity recognition and entity linking tasks resulting in a large body of literature. One could roughly categorize the proposed approaches in two different strategies: linguistic-based and semantic-based methods. In this paper, we present our participation to the OKE challenge, where we experiment with a hybrid approach, which combines the strength of a linguistic-based method augmented by a high coverage in the annotation obtained by using a large knowledge base as entity dictionary. The main goal of this hybrid approach is to improve the extraction and recognition level to get the best recall in order to apply a pruning step. On the training set, the results are promising and the breakdown figures are comparable with the state of the art performance of top ranked systems. Our hybrid approach has been ranked first to the OKE Challenge on the test set.

Julien Plu, Giuseppe Rizzo, Raphaël Troncy

Using FRED for Named Entity Resolution, Linking and Typing for Knowledge Base Population

Abstract

FRED is a machine reader for extracting RDF graphs that are linked to LOD and compliant to Semantic Web and Linked Data patterns. We describe the capabilities of FRED as a semantic middleware for semantic web applications. In particular, we will show (i) how FRED recognizes and resolves named entities, (ii) how it links them to existing knowledge base, and (iii) how it gives them a type. Given a sentence in any language, it provides different semantic functionalities (frame detection, topic extraction, named entity recognition, resolution and coreference, terminology extraction, sense tagging and disambiguation, taxonomy induction, semantic role labeling, type induction) by means of a versatile user-interface, which can be recalled as REST Web service. The system can be freely used at http://wit.istc.cnr.it/stlab-tools/fred.

Sergio Consoli, Diego Reforgiato Recupero

Exploiting Linked Open Data to Uncover Entity Types

Abstract

Extracting structured information from text plays a crucial role in automatic knowledge acquisition and is at the core of any knowledge representation and reasoning system. Traditional methods rely on hand-crafted rules and are restricted by the performance of various linguistic pre-processing tools. More recent approaches rely on supervised learning of relations trained on labelled examples, which can be manually created or sometimes automatically generated (referred as distant supervision). We propose a supervised method for entity typing and alignment. We argue that a rich feature space can improve extraction accuracy and we propose to exploit Linked Open Data (LOD) for feature enrichment. Our approach is tested on task-2 of the Open Knowledge Extraction challenge, including automatic entity typing and alignment. Our approach demonstrate that by combining evidences derived from LOD (e.g. DBpedia) and conventional lexical resources (e.g. WordNet) (i) improves the accuracy of the supervised induction method and (ii) enables easy matching with the Dolce+DnS Ultra Lite ontology classes.

Jie Gao, Suvodeep Mazumdar

Semantic Publishing Challenge (SemPub2015)

Frontmatter

Semantic Publishing Challenge - Assessing the Quality of Scientific Output by Information Extraction and Interlinking

Abstract

The Semantic Publishing Challenge series aims at investigating novel approaches for improving scholarly publishing using Linked Data technology. In 2014 we had bootstrapped this effort with a focus on extracting information from non-semantic publications – computer science workshop proceedings volumes and their papers – to assess their quality. The objective of this second edition was to improve information extraction but also to interlink the 2014 dataset with related ones in the LOD Cloud, thus paving the way for sophisticated end-user services.

Angelo Di Iorio, Christoph Lange, Anastasia Dimou, Sahar Vahdati

Information Extraction from Web Sources Based on Multi-aspect Content Analysis

Abstract

Information extraction from web pages is often recognized as a difficult task mainly due to the loose structure and insufficient semantic annotation of their HTML code. Since the web pages are primarily created for being viewed by human readers, their authors usually do not pay much attention to the structure and even validity of the HTML code itself. The CEUR Workshop Proceedings pages are a good illustration of this. Their code varies from an invalid HTML markup to fully valid and semantically annotated documents while preserving a kind of unified visual presentation of the contents. In this paper, as a contribution to the ESWC 2015 Semantic Publishing Challenge, we present an information extraction approach based on analyzing the rendered pages rather than their code. The documents are represented by an RDF-based model that allows to combine the results of different page analysis methods such as layout analysis and the visual and textual feature classification. This allows to specify a set of generic rules for extracting a particular information from the page independently on its code.

Martin Milicka, Radek Burget

Extracting Contextual Information from Scientific Literature Using CERMINE System

Abstract

CERMINE is a comprehensive open source system for extracting structured metadata and references from born-digital scientific literature. Among other information, the system is able to extract information related to the context the article was written in, such as the authors and their affiliations, the relations between them or references to other articles. Extracted information is presented in a structured, machine-readable form. CERMINE is based on a modular workflow, whose loosely coupled architecture allows for individual components evaluation and adjustment, enables effortless improvements and replacements of independent parts of the algorithm and facilitates future architecture expanding. The implementation of the workflow is based mostly on supervised and unsupervised machine-learning techniques, which simplifies the procedure of adapting the system to new document layouts and styles. In this paper we outline the overall workflow architecture, describe key aspects of the system implementation, provide details about training and adjusting of individual algorithms, and finally report how CERMINE was used for extracting contextual information from scientific articles in PDF format in the context of ESWC 2015 Semantic Publishing Challenge. CERMINE system is available under an open-source licence and can be accessed at http://cermine.ceon.pl.

Dominika Tkaczyk, Łukasz Bolikowski

Machine Learning Techniques for Automatically Extracting Contextual Information from Scientific Publications

Abstract

Scholarly publishing increasingly requires automated systems that semantically enrich documents in order to support management and quality assessment of scientific output. However, contextual information, such as the authors’ affiliations, references, and funding agencies, is typically hidden within PDF files. To access this information we have developed a processing pipeline that analyses the structure of a PDF document incorporating a diverse set of machine learning techniques. First, unsupervised learning is used to extract contiguous text blocks from the raw character stream as the basic logical units of the article. Next, supervised learning is employed to classify blocks into different meta-data categories, including authors and affiliations. Then, a set of heuristics are applied to detect the reference section at the end of the paper and segment it into individual reference strings. Sequence classification is then utilised to categorise the tokens of individual references to obtain information such as the journal and the year of the reference. Finally, we make use of named entity recognition techniques to extract references to research grants, funding agencies, and EU projects. Our system is modular in nature. Some parts rely on models learnt on training data, and the overall performance scales with the quality of these data sets.

Stefan Klampfl, Roman Kern

MACJa: Metadata and Citations Jailbreaker

Abstract

This paper presents the Metadata And Citations Jailbreaker (a.k.a. MACJa – IPA /’matsja/), i.e., a method for processing the research papers available in CEUR-WS.org and stored as PDF files in order to extract relevant semantic data and publish them in a RDF triplestore according to the Semantic Publishing And Referencing (SPAR) Ontologies. In particular, the extraction of all the information needed for addressing the queries of the Semantic Publishing Challenge 2015 (task 2) is guaranteed by MACJa by using techniques based on Natural Language Processing (i.e., Combinatory Categorial Grammar, Discourse Representation Theory, Linguistic Frames), Semantic Web technologies and good Ontology Design practices (i.e., Content Analysis, Ontology Design Patterns, Discourse Referent Extraction and Linking, Topic Extraction).

Andrea Giovanni Nuzzolese, Silvio Peroni, Diego Reforgiato Recupero

Automatic Construction of a Semantic Knowledge Base from CEUR Workshop Proceedings

Abstract

We present an automatic workflow that performs text segmentation and entity extraction from scientific literature to primarily address Task 2 of the Semantic Publishing Challenge 2015. The goal of Task 2 is to extract various information from full-text papers to represent the context in which a document is written, such as the affiliation of its authors and the corresponding funding bodies. Our proposed solution is composed of two subsystems: (i) A text mining pipeline, developed based on the GATE framework, which extracts structural and semantic entities, such as authors’ information and references, and produces semantic (typed) annotations; and (ii) a flexible exporting module, the LODeXporter, which translates the document annotations into RDF triples according to custom mapping rules. Additionally, we leverage existing Named Entity Recognition (NER) tools to extract named entities from text and ground them to their corresponding resources on the Linked Open Data cloud, thus, briefly covering Task 3 objectives, which involves linking of detected entities to resources in existing open datasets. The output of our system is an RDF graph stored in a scalable TDB-based storage with a public SPARQL endpoint for the task’s queries.

Bahar Sateli, René Witte

CEUR-WS-LOD: Conversion of CEUR-WS Workshops to Linked Data

Conversion of CEUR-WS Workshops to Linked Data

Abstract

CEUR-WS.org is a well-known place for publishing proceedings of workshops and very popular among Computer Science community. Because of that it’s an interesting source for different kinds of analytics, e.g. measurement of workshop series popularity or person’s contribution to the field by organizing workshops and etc. For realizing an insightful and effective analytics one needs to combine information from different places that can supplement each other. And this brings a lot of challenges which can be mitigated by using Semantic Web technologies.

Maxim Kolchin, Eugene Cherny, Fedor Kozlov, Alexander Shipilo, Liubov Kovriguina

Metadata Extraction from Conference Proceedings Using Template-Based Approach

Abstract

The paper describes a number of metadata extraction procedures based on rule-based approach and pattern matching from CEUR Workshop proceedings Cf. http://ceur-ws.org and its converting to a Linked Open Data (LOD) dataset in the framework of ESWC 2015 Semantic Publishing Challenge Cf. http://github.com/ceurws/lod/wiki/SemPub2015.

Liubov Kovriguina, Alexander Shipilo, Fedor Kozlov, Maxim Kolchin, Eugene Cherny

Semantically Annotating CEUR-WS Workshop Proceedings with RML

Abstract

In this paper, we present our solution for the first task of the second Semantic Publishing Challenge. The task requires extracting and semantically annotating information regarding ceur-ws workshops, their chairs and conference affiliations, as well as their papers and their authors, from a set of html-encoded workshop proceedings volumes. Our solution builds on last year’s submission, while we address a number of shortcomings, assess the generated dataset for its quality and publish the queries as sparql query templates. This is accomplished using the rdf Mapping Language (rml) to define the mappings, the rmlprocessor to execute them, the rdfunit to both validate the mapping documents and assess the generated dataset’s quality, and the datatank to publish the sparql query templates. This results in an overall improved quality of the generated dataset that is reflected in the query results.

Pieter Heyvaert, Anastasia Dimou, Ruben Verborgh, Erik Mannens, Rik Van de Walle

On the Automated Generation of Scholarly Publishing Linked Datasets: The Case of CEUR-WS Proceedings

Abstract

The availability of highly-informative semantic descriptions of scholarly publishing contents enables an easier sharing and reuse of research findings as well as a better assessment of the quality of scientific productions. In the context of the ESWC2015 Semantic Publishing Challenge, we present a system that automatically generates rich RDF datasets from CEUR-WS workshop proceedings and exposes them as Linked Data. Web pages of proceedings and textual contents of papers are analyzed through proper text processing pipelines. Semantic annotations are added by a set of SVM classifiers and refined by heuristics, gazetteers and rule-based grammars. Web services are exploited to link annotations to external datasets like DBpedia, CrossRef, FundRef and Bibsonomy. Finally, the data is modelled and published as an RDF graph.

Francesco Ronzano, Beatriz Fisas, Gerard Casamayor del Bosque, Horacio Saggion

Schema-Agnostic Queries over Large-Schema Databases Challenge (SAQ-2015)

Frontmatter

The Schema-Agnostic Queries (SAQ-2015) Semantic Web Challenge: Task Description

Abstract

As datasets grow in schema-size and heterogeneity, the development of infrastructures which can support users querying and exploring the data, without the need to fully understand the conceptual model behind it, becomes a fundamental functionality for contemporary data management. The first edition of the Schema-agnostic Queries Semantic Web Challenge (SAQ-2015) aims at creating a test collection to evaluate schema-agnostic/schema-free query mechanisms, i.e. mechanisms which are able to semantically match user queries expressed in their own vocabulary to dataset elements, allowing users to be partially or fully abstracted from the representation of the data.

André Freitas, Christina Unger

UMBC_Ebiquity-SFQ: Schema Free Querying System

Abstract

Users need better ways to explore large complex linked data resources. Using SPARQL requires not only mastering its syntax and semantics but also understanding the RDF data model, the ontology and URIs for entities of interest. Natural language question answering systems solve the problem, but these are still subjects of research. The Schema agnostic SPARQL queries task defined in SAQ-2015 challenge consists of schema-agnostic queries following the syntax of the SPARQL standard, where the syntax and semantics of operators are maintained, while users are free to choose words, phrases and entity names irrespective of the underlying schema or ontology. This combination of query skeleton with keywords helps to remove some of the ambiguity. We describe our framework for handling schema agnostic or schema free queries and discuss enhancements to handle the SAQ-2015 challenge queries. The key contributions are the robust methods that combine statistical association and semantic similarity to map user terms to the most appropriate classes and properties used in the underlying ontology and type inference for user input concepts based on concept linking.

Zareen Syed, Lushan Han, Muhammad Rahman, Tim Finin, James Kukla, Jeehye Yun

Concept-Level Sentiment Analysis Challenge (CLSA2015)

Frontmatter

ESWC 15 Challenge on Concept-Level Sentiment Analysis

Abstract

A consequence of the massive use of social networks, blogs, wikis, etc., is the change of users’ behaviour on, and their interaction with, the Web: opinions, emotions and sentiments are now expressed differently from the past. Lexical understanding of text is not anymore enough to detect sentiment polarities. Semantics became key for sentiment detection. This generates potential business opportunities, especially within the marketing area, and key stakeholders need to catch up with the latest technology if they want to be compelling in the market. Therefore, understanding the opinions and its peculiarities from a written text involves a deep understanding of natural language text and the semantics behind it. Recently, it has been proved that the use of semantics improves the accuracy of existing sentiment analysis systems, which are mainly based on pure machine learning or other statistical approaches. The second Edition of the Concept Level Sentiment Analysis challenge aims to provide a further stimulus in this direction by offering to researchers an event where they can learn and experiment on how to employ Semantic Web features within their sentiment analysis systems, aiming at reaching higher performance.

Diego Reforgiato Recupero, Mauro Dragoni, Valentina Presutti

The Benefit of Concept-Based Features for Sentiment Analysis

Abstract

Sentiment analysis is an active field of research, moving from the traditional algorithms that operated on complete documents to fine-grained variants where aspects of the topic being discussed are extracted, as well as their associated sentiment. Recently, a move from traditional word-based approaches to concept-based approaches has started. In this work, it is shown by using a simple machine learning baseline, that concepts are useful as features within a machine learning framework. In all our experiments, the performance increases when including the concept-based features.

Kim Schouten, Flavius Frasincar

An Information Retrieval-Based System for Multi-domain Sentiment Analysis

Abstract

This paper describes the SHELLFBK system that participated in ESWC 2015 Sentiment Analysis challenge. Our system takes a supervised approach that builds on techniques from information retrieval. The algorithm populates an inverted index with pseudo-documents that encode dependency parse relationships extracted from the sentences in the training set. Each record stored in the index is annotated with the polarity and domain of the sentence it represents; this way, it is possible to have a more fine-grained representation of the learnt sentiment information. When the polarity of a new sentence has to be computed, the new sentence is converted to a query and a two-steps computation is performed: firstly, a domain is assigned to the sentence by comparing the sentence content with domain contextual information learnt during the training phase, and, secondly, once the domain is assigned to the sentence, the polarity is computed and assigned to the new sentence. Preliminary results on an in-vitro test case demonstrated promising results.

Giulio Petrucci, Mauro Dragoni

Detecting Sentiment Polarities with Sentilo

Abstract

We present the tool used for the Concept-Level Sentiment Analysis Challenge ESWC-CLSA 2015 Task #1, concerning binary polarity detection of the sentiment of a sentence. Our tool is a little modification of Sentilo [7], an unsupervised, domain-independent system, previously developed by our group, that performs sentiment analysis by hybridizing natural language processing techniques with semantic web technologies. Sentilo is able to recognize the opinion holder and measure the sentiment expressed on topics and sub-topics. The knowledge extracted from the text is represented by means of an RDF graph. Holders and topics are linked to external knowledge. Sentilo is available as a REST service as well as a user-friendly demo.

Andrea Giovanni Nuzzolese, Misael Mongiovì

Supervised Opinion Frames Detection with RAID

Abstract

Most systems for opinion analysis focus on the classification of opinion polarities and rarely consider the task of identifying the different elements and relations forming an opinion frame. In this paper, we present RAID, a tool featuring a processing pipeline for the extraction of opinion frames from text with their opinion expressions, holders, targets and polarities. RAID leverages a lexical, syntactic and semantic analysis of text, using several NLP tools such as dependency parsing, semantic role labelling, named entity recognition and word sense disambiguation. In addition, linguistic resources such as SenticNet and the MPQA Subjectivity Lexicon are used both to locate opinions in the text and to classify their polarities according to a fuzzy model that combines the sentiment values of different opinion words. RAID was evaluated on three different datasets and is released as open source software under the GPLv3 license.

Alessio Palmero Aprosio, Francesco Corcoglioniti, Mauro Dragoni, Marco Rospocher

Backmatter

Titel: Semantic Web Evaluation Challenges
herausgegeben von: Fabien Gandon
Elena Cabrio
Milan Stankovic
Antoine Zimmermann
Verlag: Springer International Publishing
Electronic ISBN: 978-3-319-25518-7
Print ISBN: 978-3-319-25517-0
DOI: https://doi.org/10.1007/978-3-319-25518-7