Skip to main content

About this book

This book constitutes the refereed proceedings of the 22nd International Conference on Asia-Pacific Digital Libraries, ICADL 2020, which was planned to be held in Kyoto, Japan, in November/December 2020, but it was held virtually due to the COVID-19 pandemic.

The 10 full, 15 short, 4 practitioners, and 10 work-in-progress papers presented in this volume were carefully reviewed and selected from 79 submissions. The papers were organized in topical sections named: natural language processing; knowledge structures; citation data analysis; user analytics; application of cultural and historical data; social media; metadata and infrastructure; and scholarly data mining.

Table of Contents


Natural Language Processing


Improving Scholarly Knowledge Representation: Evaluating BERT-Based Models for Scientific Relation Classification

With the rapid growth of research publications, there is a vast amount of scholarly knowledge that needs to be organized in digital libraries. To deal with this challenge, techniques relying on knowledge-graph structures are being advocated. Within such graph-based pipelines, inferring relation types between related scientific concepts is a crucial step. Recently, advanced techniques relying on language models pre-trained on large corpora have been popularly explored for automatic relation classification. Despite the remarkable contributions that have been made, many of these methods were evaluated under different scenarios, which limits their comparability. To address this shortcoming, we present a thorough empirical evaluation of eight Bert-based classification models by focusing on two key factors: 1) Bert model variants, and 2) classification strategies. Experiments on three corpora show that domain-specific pre-training corpus benefits the Bert-based classification model to identify the type of scientific relations. Although the strategy of predicting a single relation each time achieves a higher classification accuracy than the strategy of identifying multiple relation types simultaneously in general, the latter strategy demonstrates a more consistent performance in the corpus with either a large or small number of annotations. Our study aims to offer recommendations to the stakeholders of digital libraries for selecting the appropriate technique to build knowledge-graph-based systems for enhanced scholarly information organization.

Ming Jiang, Jennifer D’Souza, Sören Auer, J. Stephen Downie

A Framework for Classifying Temporal Relations with Question Encoder

Temporal-relation classification plays an important role in the field of natural language processing. Various deep learning-based classifiers, which can generate better models using sentence embedding, have been proposed to address this challenging task. These approaches, however, do not work well because of the lack of task-related information. To overcome this problem, we propose a novel framework that incorporates prior information by employing awareness of events and time expressions (time–event entities) as a filter. We name this module “question encoder.” In our approach, this kind of prior information can extract task-related information from sentence embedding. Our experimental results on a publicly available Timebank-Dense corpus demonstrate that our approach outperforms some state-of-the-art techniques.

Yohei Seki, Kangkang Zhao, Masaki Oguni, Kazunari Sugiyama

When to Use OCR Post-correction for Named Entity Recognition?

In the last decades, a huge number of documents has been digitised, before undergoing optical character recognition (OCR) to extract their textual content. This step is crucial for indexing the documents and to make the resulting collections accessible. However, the fact that documents are indexed through their OCRed content is posing a number of problems, due to the varying performance of OCR methods over time. Indeed, OCR quality has a considerable impact on the indexing and therefore the accessibility of digital documents. Named entities are among the most adequate information to index documents, in particular in the case of digital libraries, for which log analysis studies have shown that around 80% of user queries include a named entity. Taking full advantage of the computational power of modern natural language processing (NLP) systems, named entity recognition (NER) can be operated over enormous OCR corpora efficiently. Despite progress in OCR, resulting text files still have misrecognised words (or noise for short) which are harming NER performance. In this paper, to handle this challenge, we apply a spelling correction method to noisy versions of a corpus with variable OCR error rates in order to quantitatively estimate the contribution of post-OCR correction to NER. Our main finding is that we can indeed consistently improve the performance of NER when the OCR quality is reasonable (error rates respectively between 2% and 10% for characters (CER) and between 10% and 25% for words (WER)). The noise correction algorithm we propose is both language-independent and with low complexity.

Vinh-Nam Huynh, Ahmed Hamdi, Antoine Doucet

Semi-supervised Named-Entity Recognition for Product Attribute Extraction in Book Domain

Products sold in today’s marketplace are very numerous and varied. One of them is the book product. Detail information about the book, such as the title of the book, author, and publisher, is often presented in unstructured format in the product title. In order to be useful for the commercial applications, for example catalogs, search functions, and recommendation systems, the attributes need to be extracted from the product title. In this study, we apply Named-Entity Recognition model in semi-supervised style to extract the attributes of e-commerce products in book domain. We experiment with the number of features extraction, i.e. lexical, position, word shape, and embedding features. We extract the book attributes from near to 30K product title data with F-1 measure 65%.

Hadi Syah Putra, Faisal Satrio Priatmadji, Rahmad Mahendra

Knowledge Structures


Semantic Segmentation of MOOC Lecture Videos by Analyzing Concept Change in Domain Knowledge Graph

Long lecture video metadata needs to have topic wise annotation information for quick topic searching and video browsing. In this work we perform topical segmentation of long MOOC lecture videos to obtain start-time and end-time of different topics taught by the instructor. During teaching instructor uses different concepts to explain a topic. So instructor has his own way of selecting and binding these concepts to represent a topic. Additionally knowledge graph of a subject domain contains inherent domain knowledge. In this work we analyze how the instructor changes concepts during topic change, the inherent knowledge available in a domain knowledge graph, semantic similarity and contextual relationship between different concepts to perform topical segmentation of long lecture videos. As output, we get semantically coherent topics taught by the instructor along with their interval (start-time and end-time). We tested our approach on 61 long NPTEL [1] videos delivered on software engineering domain. Experimentally we find that the topic intervals generated by our system has $$\sim $$ ∼ 83% similarity with the intervals present in the ground truth. Holistic evaluation shows that our approach performs better than the other approaches in the literature.

Ananda Das, Partha Pratim Das

Towards Customizable Chart Visualizations of Tabular Data Using Knowledge Graphs

Scientific articles are typically published as PDF documents, thus rendering the extraction and analysis of results a cumbersome, error-prone, and often manual effort. New initiatives, such as ORKG, focus on transforming the content and results of scientific articles into structured, machine-readable representations using Semantic Web technologies. In this article, we focus on tabular data of scientific articles, which provide an organized and compressed representation of information. However, chart visualizations can additionally facilitate their comprehension. We present an approach that employs a human-in-the-loop paradigm during the data acquisition phase to define additional semantics for tabular data. The additional semantics guide the creation of chart visualizations for meaningful representations of tabular data. Our approach organizes tabular data into different information groups which are analyzed for the selection of suitable visualizations. The set of suitable visualizations serves as a user-driven selection of visual representations. Additionally, customization for visual representations provides the means for facilitating the understanding and sense-making of information.

Vitalis Wiens, Markus Stocker, Sören Auer

Wikipedia-Based Entity Linking for the Digital Library of Polish and Poland-Related News Pamphlets

The paper presents a series of experiments related to enhancing the content of digital library items with links to relevant Wikipedia entries that could offer the reader additional background information. Two methods of gathering such links are investigated: a Wikifier-based solution and search in Wikipedia using its integrated engine. The results are additionally filtered using frequency information from a large corpus and additional rules.

Maciej Ogrodniczuk, Włodzimierz Gruszczyński

Representing Semantified Biological Assays in the Open Research Knowledge Graph

In the biotechnology and biomedical domains, recent text mining efforts advocate for machine-interpretable, and preferably, semantified, documentation formats of laboratory processes. This includes wet-lab protocols, (in)organic materials synthesis reactions, genetic manipulations and procedures for faster computer-mediated analysis and predictions. Herein, we present our work on the representation of semantified bioassays in the Open Research Knowledge Graph (ORKG). In particular, we describe a semantification system work-in-progress to generate, automatically and quickly, the critical semantified bioassay data mass needed to foster a consistent user audience to adopt the ORKG for recording their bioassays and facilitate the organisation of research, according to FAIR principles.

Marco Anteghini, Jennifer D’Souza, Vitor A. P. Martins dos Santos, Sören Auer

Construction of Dunhuang Cultural Heritage Knowledge Base: Take Cave 220 as an Example

Based on the research of Dunhuang resources and the Silk Road culture and history, this paper discusses applying digital culture to humanities research. With advanced computing technology and network technology, this paper aims to create an open collaboration environment for academic resources, and it finally proposes a new knowledge organization and management paradigm which is helpful for “Belt and Road Initiative” studies and Dunhuang studies.

Xiaofei Sun, Ting Zhang, Lei Chen, Xiaoyang Wang, Jiakeng Tang

Citation Data Analysis


ReViz: A Tool for Automatically Generating Citation Graphs and Variants

A systematic literature review provides an overview of multiple scientific publications in an area of research and visualizations of the data of the systematic review enable further in-depth analyses. The creation of such a review and its visualizations is a very time- and labor-intensive process. For this reason, we propose a tool for automatically generating visualizations for systematic reviews. Using this tool, the citations between the included articles can be depicted in a citation graph. However, because the clearness of the information contained in the citation graph is highly dependent on the number of included publications, several strategies are implemented in order to reduce the complexity of the graph without loosing (much) information. The generated graphs and developed strategies are evaluated using different instruments, including an user survey, in which they are rated positively.

Sven Groppe, Lina Hartung

A Large-Scale Analysis of Cross-lingual Citations in English Papers

Citation data is an important source of insight into the scholarly discourse and the reception of publications. Outcomes of citation analyses and the applicability of citation based machine learning approaches heavily depend on the completeness of citation data. One particular shortcoming of scholarly data nowadays is language coverage. That is, non-English publications are often not included in data sets, or language metadata is not available. While national citation indices exist, these are often not interconnected to other data sets. Because of this, citations between publications of differing languages (cross-lingual citations) have only been studied to a very limited degree. In this paper, we present an analysis of cross-lingual citations based on one million English papers, covering three scientific disciplines and a time span of 27 years. Our results unveil differences between languages and disciplines, show developments over time, and give insight into the impact of cross-lingual citations on scholarly data mining as well as the publications that contain them. To facilitate further analyses, we make our collected data and code for analysis publicly available.

Tarek Saier, Michael Färber

How Do Retractions Influence the Citations of Retracted Articles?

Scientific retraction helps purge the continued use of flawed research. However, the practical influence of it needs to be identified and quantified. In this study, we analyzed the citations of 106 psychological articles from Web of Science to explore the influence of retraction using quantitative methods. Our results show that 1) retraction caused a significant decline (1.6–1.8 times) in the post-retraction citations; 2) retractions from open accessed or high-quality journals are effective; 3) retraction is incapable to eliminate the dissemination of flawed results thoroughly. Our findings may provide useful insights for scholars and practitioners to understand and integrate the retraction system.

Siluo Yang, Fan Qi

Identification of Research Data References Based on Citation Contexts

In this paper, a method for the automatic identification of research data references in publications is proposed for automatically generating research data repositories. The International Conference on Language Resources and Evaluation (LREC) requires authors to list research data references separately from other publication references. The goal of our research is to automate the discrimination process. We investigated the reference lists in LREC papers and the citation contexts to find characteristic features that are useful for identifying research data references. We confirmed that key phrases appeared in the citation contexts and the bibliographical elements in the reference lists. Our proposed method uses the presence or absence of key phrases to identify research data references. Experiments on LREC proceedings papers proved the effectiveness of using key phrases in the citation context.

Tomoki Ikoma, Shigeki Matsubara

User Analytics


A Predictive Model for Citizens’ Utilization of Open Government Data Portals

Open government data (OGD) initiatives for building OGD portals have not yet delivered the expected benefits of OGD to the whole of society. Although citizens’ reluctance to use OGD has become a key problem in the present OGD development, limited studies have been carried out to investigate citizens’ actual usage of OGD and OGD portals. In order to fill this research gap, this study primarily focuses on predicting citizens’ actual utilization of OGD portals. To find features influencing citizens’ utilization of OGD portals and to predict their actual usage of OGD portals, an experiment was designed and carried out in China. A predictive model was built with C5.0 algorithm based on data collected through the experiment, with a predictive accuracy rate of 84.81%. Citizens’ monthly income, the compatibility of OGD portals, and citizens’ attentiveness regarding their interactions with OGD portals are found to be the most important factors influencing citizens’ actual utilization of OGD portals. Positive effects of compatibility, attentiveness, and perceived usefulness on citizens’ usage of OGD portals are noticed.

Di Wang, Deborah Richards, Ayse Aysin Bilgin, Chuanfu Chen

Extracting User Interests from Operation Logs on Museum Devices for Post-Learning

Nowadays, a variety of information on museum collections online has been stored as digital archives. With the increasing use of smartphones and tablets in daily life, visitors can obtain various knowledge of museum exhibits for pre-learning by using mobile devices and applications. Also, interactive learning systems in museums are very active in the field of information engineering, and interactive on-site learning is necessary for recent education. However, existing learning support systems mainly focused on support for pre-learning or on-site learning, and they are not enough to provide more advanced learning in per-learning or to deepen user interests in on-site learning. Therefore, it is necessary to support diverse knowledge levels of users on museum education for post-learning. In this paper, we aim to utilize video materials related to museums to support post-learning based on user interests by analyzing user interactions for exhibits on multimedia museum devices. For this, we propose a scoring method based on four features of user operation log data: keyword appearance frequency, keyword transition, media type, and media transition. Finally, we verified and discussed the effectiveness of our proposed scoring method through a user study.

Yuanyuan Wang, Yukiko Kawai, Kazutoshi Sumiya

A Motivational Design Approach to Integrate MOOCs in Traditional Classrooms

Despite the promising benefits of blended Massive Open Online Courses (MOOCs) over traditional face-to-face class, it is still unclear how MOOCs should be integrated in the classroom. The findings regarding the effectiveness of such blended learning approach is mixed and inconclusive. The present study aims to address this gap by investigating how MOOCs can be embedded in traditional classrooms. An embedded MOOC learning approach is proposed, in which students use MOOCs together with their classmates during class under the guidance of their class instructors. Drawing from a motivational design perspective, we adopted the ARCS model (i.e. Attention, Relevance, Confidence and Satisfaction) to evaluate the proposed learning approach and compare it with traditional classroom learning and blended learning approaches. The results showed that students in the embedded MOOC learning group had higher evaluations regarding attention, satisfaction and relevance perceptions than those in the traditional face-to-face learning group. In addition, the embedded MOOC learning approach received higher scores in terms of attention, relevance, confidence and satisfaction perceptions compared to traditional approach of blended MOOCs. The implications for research, educators and practitioners are discussed at the end of the paper.

Long Ma, Chei Sian Lee

Analysis of Crowdsourced Multilingual Keywords in the Futaba Digital Archive: Lessons Learned for Better Metadata Collection

Metadata reflecting user needs is necessary to facilitate multilingual access to a digital archive. This paper describes the lessons learned from our experience of crowdsourcing the addition of multilingual keywords to the contents of the Futaba Digital Archive Project. We analyzed keywords offered for pictures of materials collected from evacuation shelters. We found that (1) the type of keyword differs according to the language, and (2) the term used for the same item is not always the same between languages. We propose to provide categories in the input interface and to create a keyword correspondence table for automatic completion of keywords for multilingual access.

Mari Kawakami, Tetsuo Sakaguchi, Tetsuya Shirai, Masaki Matsubara, Takamitsu Yoshino, Atsuyuki Morishima

Aging Well with Health Information: Examining Health Literacy and Information Seeking Behavior Using a National Survey Dataset

Health literacy is critical in disease prevention particularly in the older population. This secondary data analysis of a national survey is to determine the levels of health literacy, and to investigate how it links to health information seeking behavior, disease prevention behavior, and personal characteristics in adults aged 50 and above in Taiwan. Data were obtained from the Taiwan Longitudinal Study on Aging (TLSA) conducted in 2015 (N = 8,300). Cluster analysis and comparison analyses were used in this study. Health literacy was measured using self-rated questions about the barriers to communicate or learn health-related information in clinical and daily living scenarios. Health information seeking behavior was measured based on the engagement and frequency in using health information sources. Self-perceived health was measured based on self-rated health conditions. Disease prevention behavior was measured using self-reported activities regarding disease prevention. Two clusters of health literacy were identified: high (69.58%) and low (30.42%). The participants in the high health literacy cluster tended to have higher levels of education, younger age, and be male. In addition, high health literacy is associated with more frequent health information seeking behavior, better self-perceived health, and participation in more activities to prevent chronic diseases. Health professionals in geriatrics and librarians should pay more attention to those at risk with lower health literacy and facilitate the accessibility of health information sources. Social and regional characteristics of older adults’ health literacy can be further explored for a better design of interventions to help people age well in the future.

Fang-Lin Kuo, Tien-I Tsai

Application of Cultural and Historical Data


Entity Linking for Historical Documents: Challenges and Solutions

Named entities (NEs) are among the most relevant type of information that can be used to efficiently index and retrieve digital documents. Furthermore, the use of Entity Linking (EL) to disambiguate and relate NEs to knowledge bases, provides supplementary information which can be useful to differentiate ambiguous elements such as geographical locations and peoples’ names. In historical documents, the detection and disambiguation of NEs is a challenge. Most historical documents are converted into plain text using an optical character recognition (OCR) system at the expense of some noise. Documents in digital libraries will, therefore, be indexed with errors that may hinder their accessibility. OCR errors affect not only document indexing but the detection, disambiguation, and linking of NEs. This paper aims at analysing the performance of different EL approaches on two multilingual historical corpora, CLEF HIPE 2020 (English, French, German) and NewsEye (Finnish, French, German, Swedish), while proposes several techniques for alleviating the impact of historical data problems on the EL task. Our findings indicate that the proposed approaches not only outperform the baseline in both corpora but additionally they considerably reduce the impact of historical document issues on different subjects and languages.

Elvys Linhares Pontes, Luis Adrián Cabrera-Diego, Jose G. Moreno, Emanuela Boros, Ahmed Hamdi, Nicolas Sidère, Mickaël Coustaty, Antoine Doucet

Using Deep Learning to Recognize Handwritten Thai Noi Characters in Ancient Palm Leaf Manuscripts

Extracting knowledge from ancient palm leaf manuscripts is essential for historians and other scholars who would like to access accumulated knowledge in the Thai Noi language manuscripts. In the absence of Thai Noi language readers, computer technologies play an important role in fulfilling this need. This research aims to apply deep learning approaches to recognize Thai Noi characters written in palm leaf manuscripts. The experiments were carried out by firstly collecting the page images of the manuscripts archived in the Museum of Art and Culture of Loei. Then the page images were preprocessed by converting to grayscale. To recognize Thai Noi characters, four convolutional neural network models based on inception and mobilenet networks namely Inception-v3, Inception-v4, MobileNetV1, and MobileNetV2 were evaluated. Handwritten Thai Noi characters were segmented from the grayscale images based on 26 Thai Noi characters. In this process, 100 images of each character were segmented and the whole dataset contained 2,600 images. Two image augmentation methods were applied to increase the amount of training data. Three experiments were carried out with three different datasets based on a 10-fold cross-validation design. The results indicate that MobileNetV1 outperformed other models in all experiments with an accuracy rate higher than 90%, while MobileNetV2 showed an interesting performance, which was almost equivalent to MobileNetV1 in the last experiment.

Wichai Puarungroj, Narong Boonsirisumpun, Pongsakon Kulna, Thanapong Soontarawirat, Nattiya Puarungroj

Unchiku Generation Using Narrative Explanation Mechanism

In this paper, the authors propose an “unchiku generation mechanism” they have developed to create deep rhetorical structures (narrative discourses) in a narrative. The system uses a mechanism that can generate unchiku, which refers to the detailed and excessive knowledge regarding a specific object, theme, or topic. In particular, the attribute information of each noun concept is automatically extracted from the Japanese Wikipedia and stored in the noun conceptual dictionary of an integrated narrative generation system. The proposed generation mechanism enables the generation of unchiku information related to various objects and topics by inserting parts of the extracted unchiku knowledge content into various points in the story created by the integrated narrative generation system. The attribute information related to Kabuki is derived from the Japanese Wikipedia and utilized by the formulated unchiku generation mechanism.

Jumpei Ono, Takashi Ogata

Analyzing the Stage Performance Structure of a Kabuki-Dance, Kyoganoko Musume Dojoji, Using an Animation System

Although Kabuki-dance Kyōganoko Musume Dōjōji is a type of sequel to the original Dōjōji legend, it has been performed by several excellent onnagata actors since the Edo era as a masterpiece that has original content beyond the original legends. Referring to the analysis of Kyōganoko Musume Dōjōji by Tamotsu Watanabe, this study aims to analyze in detail its “stage performance structures” that include characters, background (stage setting), music (instruments, musicians, and genres), poetry, prose, speech, and core conceptual themes of scenes as the main elements. Furthermore, using a system called KOSERUBE, that the authors have developed, as an animation tool for a narrative generation system, this study builds its stage performance structures as an easy visual image. The future goal of the application of the system as a representation method for narrative generation systems, computer games, and automatic generation content, among others.

Miku Kawai, Jumpei Ono, Takashi Ogata

Artwork Information Embedding Framework for Multi-source Ukiyo-e Record Retrieval

Ukiyo-e culture has endured throughout Japanese art history to this day. With its high artistic value, ukiyo-e remains an important part of art history. Possibly more than one million ukiyo-e prints have been collected by institutions and individuals worldwide. Many public ukiyo-e databases of various scales have been created in different languages. The sharing of ukiyo-e culture could advance to a new stage if the information from all the databases could be shared without differences in information. However, understanding different languages in different databases, redundant data, missing data, uncertain data, and inconsistent data are all barriers to knowledge discovery in each database. Therefore, this paper uses Ukiyo-e Portal Database [1] prints that were released from the Art Research Center (ARC) of Ritsumeikan University as examples, explains the challenges that are currently solvable, and proposes a multi-source artwork information embedding framework for multimodal and multilingual retrieval.

Kangying Li, Biligsaikhan Batjargal, Akira Maeda, Ryo Akama

A Preliminary Attempt to Evaluate Machine Translations of Ukiyo-e Metadata Records

Providing multilingual metadata records for digital objects is a way expanding access to digital cultural collections. Recent advancements in deep learning techniques have made machine translation (MT) more accurate. Therefore, we evaluate the performance of three well-known MT systems (i.e., Google Translate, Microsoft Translator, and DeepL Translator) in translating metadata records of ukiyo-e images from Japanese to English. We evaluate the quality of their translations with an automatic evaluation metric BLEU. The evaluation results show that DeepL Translator is better at translating ukiyo-e metadata records than Google Translate or Microsoft Translator, with Microsoft Translator performing the worst.

Yuting Song, Biligsaikhan Batjargal, Akira Maeda

Social Media


Collective Sensemaking and Location-Related Factors in the Context of a Brand-Related Online Rumor

This paper examines collective sensemaking over the life cycle of an online rumor while considering two location-related factors: geographical proximity and cultural context. It has drawn data for a rumor case where a US-based customer claiming that Kentucky Fried Chicken (KFC) had served a fried rat. The rumor became viral on the Internet but was eventually debunked. The data included tweets across the three stages—parturition, diffusion, and control—of the rumor life cycle. Content analysis was employed followed by chi-square tests and binary logistic regression. Based on content analysis of 1,276 tweets, opinion-related posts were found to be prevalent at the onset of the rumor life cycle while information-related entries continued to swell through the stages. Tweets from both within as well as outside the US were evident in the early stages but they became localized before the rumor subsided. While there was a blurring of high and low cultural context in opinion-related tweets, information-related tweets reflected the communication of low-context culture as the process of collective sensemaking unfolded. The paper augments the rumor literature by exploring geographical proximity and cultural context in the process of collective sensemaking over the three stages of the rumor life cycle. It offers implications for practitioners to deal with online rumors.

Alton Yeow Kuan Chua, Anjan Pal, Dion Hoe-Lian Goh

Identifying the Types of Digital Footprint Data Used to Predict Psychographic and Human Behaviour

Digital footprints can be defined any data related to any online activity. When engaging, the user leaves digital footprints that can be tracked across a range of digital activities, such as web explorer, checked-in location, YouTube, photo-tag and record purchase. Indeed, the use of all social media applications is also part of the digital footprint. This research was, therefore conducted to classify the types of digital footprint data used to predict psychographic and human behaviour. A systematic analysis of 48 studies was undertaken to examine which form of digital footprint was taken into account in ongoing research. The results show that there are different types of data from digital footprints, such as structured data, unstructured data, geographic data, time-series data, event data, network data, and linked data. In conclusion, the use of digital footprint data is a practically new way of completing research into predicting psychographic and human behaviour. The use of digital footprint data also provides a tremendous opportunity for enriching insights into human behaviour.

Aliff Nawi, Zalmizy Hussin, Chua Chy Ren, Nurfatin Syahirah Norsaidi, Muhammad Syafiq Mohd Pozi

Profiling Bot Accounts Mentioning COVID-19 Publications on Twitter

This paper presents preliminary findings regarding automated bots mentioning scientific papers about COVID-19 publications on Twitter. A quantitative approach was adopted to characterize social and posting patterns of bots, in contrast to other users, in Twitter scholarly communication. Our findings indicate that bots play a prominent role in research dissemination and discussion on the social web. We observed 0.45% explicit bots in our sample, producing 2.9% of tweets. The results implicate that bots tweeted differently from non-bot accounts in terms of the volume and frequency of tweeting, the way handling the content of tweets, as well as preferences in article selection. In the meanwhile, their behavioral patterns may not be the same as Twitter bots in another context. This study contributes to the literature by enriching the understanding of automated accounts in the process of scholarly communication and demonstrating the potentials of bot-related studies in altmetrics research.

Yingxin Estella Ye, Jin-Cheon Na

Uncovering Topics Related to COVID-19 Pandemic on Twitter

The World Health Organization declared COVID-19 as a pandemic on 11 March 2020 due to its rapid spread worldwide. This work-in-progress paper aims to uncover topics related to COVID-19 discussed on Twitter. Using topic modelling, we analyzed two weeks of tweets (11 March–25 March 2020) in English and found 17 latent topics, covering a broad range of issues such as health and economic impact, political and legislative responses, prevention measures, as well as disruption to individuals’ daily lives. The results of this preliminary study show a helpful step to understand public communications about the virus and thus inform health practitioners to propose effective safety measures against COVID-19.

Han Zheng, Dion Hoe-Lian Goh, Edmund Wei Jian Lee, Chei Sian Lee, Yin-Leng Theng

Classification in the LdoD Archive: A Crowdsourcing and Gamification Approach

This article presents a solution developed on top of the LdoD Archive for the classification of fragments in the context of a virtual edition, through the use of a serious game strategy. Participants select a classification for a fragment after following a series of steps that require them to propose tags for fragments and then vote on other participants’ tags. The goal of the game is twofold: it can be used as a crowdsourced tool to classify texts from the Book of Disquiet, in the context of a virtual edition, and it functions as a collaborative learning tool for the reading and analysis of texts from the Book of Disquiet.

Gonçalo Montalvão Marques, António Rito Silva, Manuel Portela

Metadata and Infrastructure


SchenQL: Evaluation of a Query Language for Bibliographic Metadata

Information access needs to be uncomplicated, as users may not benefit from complex and potentially richer data that may be less easy to obtain. A user’s demand for answering more sophisticated research questions including aggregations could be fulfilled by the usage of SQL. However, this comes with the cost of high complexity, which requires for a high level of expertise even for trained programmers. A domain-specific query language could provide a straightforward solution to this problem. Although less generic, it is desirable that users not familiar with query construction are supported in the formulation of complex information needs.In this paper, we extend and evaluate SchenQL, a simple and applicable query language that is accompanied by a prototypical GUI. SchenQL focuses on querying bibliographic metadata while using the vocabulary of domain-experts. The easy-to-learn domain-specific query language is suitable for domain-experts as well as casual users while still providing the possibility to answer complicated queries. Query construction and information exploration is supported by the prototypical GUI. Eventually, the complete system is evaluated: interviews with domain-experts and a bipartite quantitative user study demonstrate SchenQL’s suitability and high level of users’ acceptance.

Christin Katharina Kreutz, Michael Wolz, Benjamin Weyers, Ralf Schenkel

Domain-Focused Linked Data Crawling Driven by a Semantically Defined Frontier

A Cultural Heritage Case Study in Europeana

We propose a method for focused crawling of linked data with a frontier based on the semantic data elements in use within a knowledge domain. This method addresses the challenges of crawling large volumes of heterogeneous linked data, aiming to achieve improvements in crawling efficiency and accuracy. We present the results obtained by our method in a case study on the cultural heritage domain, more specifically on Europeana, the European Union digital platform for cultural heritage. We have evaluated the crawling method in two Europeana data providers that are publishing linked metadata with elements. We conclude that the proposed focused crawling method worked well in the case study, but it may need to be complemented with complementary frontier delimiting strategies when applied to other domains.

Nuno Freire, Mário J. Silva

The Intellectual Property Risks of Integrating Public Digital Cultural Resources in China

To integrate public digital cultural resources is to cluster, integrate, and reorganize the scattered and relatively independent digital resources from libraries, museums, art galleries, cultural museums, and other public cultural institutions, to form an orderly joined digital resource system. However, there are intellectual property risks in the integration process. This paper discusses the intellectual property risks of integrating public digital cultural resources from three aspects: the change of the subject, object, and content. Moreover, some management measures are put forward, including easing restrictions on the integration of public digital cultural resources, establishing a copyright collective agency system, publishing necessary copyright statements, and protecting independent intellectual property.

Yi Chen, Si Li

Metadata Interoperability for Institutional Repositories: A Case Study in Malang City Academic Libraries

The aim of this study is to understand, describe, and analyze metadata interoperability in Universitas Brawijaya Library that used Brawijaya Knowledge Garden (BKG) and Eprints software, University of Muhammadiyah Malang Library that used Ganesha Digital Library (GDL) and Eprints software, and Malang State Library that used Muatan Lokal (Mulok) software. This study also discussed supporting and inhibiting factors for interoperability metadata. This study employed a case study-qualitative approach. The finding indicates that the metadata interoperability can be performed by using metadata crosswalks. Implementation metadata crosswalks by mapping BKG fields and GDL Fields to the Dublin Core Metadata Element Set (DCMES). The results in the mapping of appropriate metadata schemes without removing the existing metadata scheme element and demonstrating technical specifications for standard metadata. Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) features can use for metadata interoperability in the union catalog that have been being developed by the National Library of Indonesia, called Indonesia OneSearch. The supporting factors in metadata interoperability are standard metadata and a standard protocol for interoperability; whereas, the inhibiting factor is the minimum human resources having metadata capability and the open access policy that has not applied to each academic libraries. All of those academic libraries need to make an effort to external interoperability to union catalogs to improve visibility digital content and applied open access policies.

Gani Nur Pramudyo, Muhammad Rosyihan Hendrawan

MetaProfiles - A Mechanism to Express Metadata Schema, Privacy, Rights and Provenance for Data Interoperability

Documenting datasets in an actionable way is an essential approach to ensure data interoperability. Guidelines like FAIR (Findability, Accessibility, Interoperability, and Reusability) ensures better use-cases for the data. Proposals like metadata applications profiles provide mechanisms to express constraints and metadata schema of the datasets. In order to provide Ethical, Legal, and Social Aspects/Implications (ELSA/ELSI), datasets require more than the application profiles. Along with the schema, expressing privacy aspects of the data and constraints on rights and licenses also ensures proper ELSI. A good dataset profile needs validation rules provided in actionable formats and with human-readable documentation. A sample data will help the consumers to streamline the process of adapting the datasets. Different solutions exist to express these various components required to represent the datasets, such as DCAT to express the datasets, ShEx, and SHACL to provide validation for datasets, Datapackage for providing the schema for tabular data, DCAP for creating metadata application profiles, vocabularies like DPV to provide privacy constraints and ORDL to express rights of datasets. However, there is no simplified mechanism to interlink and distinguish these various elements in an actionable format. This research is intended to devise a mechanism to express a complete profile package for datasets, as ‘MetaProfile.’ MetaProfile is intended to cover a dataset’s profile with privacy, rights, and other essential components to ensure ELSI and interoperability of datasets. This research’s expected outcome is to provide a format and vocabulary to fill in the gaps of existing solutions for interlinking and notating different components of a profile.

Nishad Thalhath, Mitsuharu Nagamori, Tetsuo Sakaguchi

Scholarly Data Mining


Creating a Scholarly Knowledge Graph from Survey Article Tables

Due to the lack of structure, scholarly knowledge remains hardly accessible for machines. Scholarly knowledge graphs have been proposed as a solution. Creating such a knowledge graph requires manual effort and domain experts, and is therefore time-consuming and cumbersome. In this work, we present a human-in-the-loop methodology used to build a scholarly knowledge graph leveraging literature survey articles. Survey articles often contain manually curated and high-quality tabular information that summarizes findings published in the scientific literature. Consequently, survey articles are an excellent resource for generating a scholarly knowledge graph. The presented methodology consists of five steps, in which tables and references are extracted from PDF articles, tables are formatted and finally ingested into the knowledge graph. To evaluate the methodology, 92 survey articles, containing 160 survey tables, have been imported in the graph. In total, $$2\,626$$ 2 626 papers have been added to the knowledge graph using the presented methodology. The results demonstrate the feasibility of our approach, but also indicate that manual effort is required and thus underscore the important role of human experts.

Allard Oelen, Markus Stocker, Sören Auer

A Novel Researcher Search System Based on Research Content Similarity and Geographic Information

Collaborative research is becoming increasingly important because it yields effective results and helps difficult research projects run smoothly. Previous studies have proposed many kinds of collaborator recommendation methods based on research features, such as specialty fields. However, few studies have constructed systems in which users can discover experts who have similar research interests using recommendation techniques. This paper proposes a novel researcher search system where users can efficiently discover potential candidates whose work locations are near theirs. Researchers are visualized on a map by our proposed system and users can use researcher’s names and research keywords to narrow down the search. Specifically, given a researcher’s name as a query, the system displays its relevant individuals based on either one of the following measures among researchers: research content similarity or collaborative relationship similarity. Our experiments demonstrated that recommendation results of these two similarity measures are minimally overlapped one another, indicating that our system could potentially help researchers discover collaborator candidates.

Tetsuya Takahashi, Koya Tango, Yuto Chikazawa, Marie Katsurai

Predicting Response Quantity from Linguistic Characteristics of Questions on Academic Social Q&A Sites

Academic social Q&A websites have a lower response quantity than other types of social Q&A. To help academic social Q&A platforms implement mechanisms to improve the quantities of responses to questions that are rarely answered and to predict these quantities, this study uses 93 features representing the linguistic characteristics of academic questions, and compares several methods of prediction to determine the one that delivers the best performance. It also identifies the most useful feature set for such predictions.

Lei Li, Anrunze Li, Xue Song, Xinran Li, Kun Huang, Edwin Mouda Ye

An Empirical Study of Importance of Different Sections in Research Articles Towards Ascertaining Their Appropriateness to a Journal

Deciding the appropriateness of a manuscript to the aims and scope of a journal is very important in the first stage of peer review. Editors should be confident about the article’s suitability to the intended journal to further channel its progress through the steps in the review process. However, not all sections in a research article are equally contributory or essential to determine its aptness to the journal under consideration. Here in this work, we investigate which sections in a manuscript are more significant to decide on its belongingness to the intended journal’s scope. Our empirical studies on two Computer Science journals suggest that the meta information from bibliography and author profiles can reach a competitive benchmark to full-text performance. The features we develop in this study display the potential to evolve as a decision support system for the journal editors to identify out-of-scope submissions.

Tirthankar Ghosal, Rajeev Verma, Asif Ekbal, Sriparna Saha, Pushpak Bhattacharyya

On the Correlation Between Research Complexity and Academic Competitiveness

Academic capacity is a common way to reflect the educational level of a country or district. The aim of this study is to explore the difference between the scientific research level of institutions and countries. By proposing an indicator named Citation-weighted Research Complexity Index (CRCI), we profile the academic capacity of universities and countries with respect to research complexity. The relationships between CRCI of universities and other relevant academic evaluation indicators are examined. To explore the correlation between academic capacity and economic level, the relationship between research complexity and GDP per capita is analysed. With experiments on the Microsoft Academic Graph data set, we investigate publications across 183 countries and universities from the Academic Ranking of World Universities in 19 research fields. Experimental results reveal that universities with higher research complexity have higher fitness. In addition, for developed countries, the development of economics has a positive correlation with scientific research. Furthermore, we visualize the current level of scientific research across all disciplines from a global perspective.

Jing Ren, Ivan Lee, Lei Wang, Xiangtai Chen, Feng Xia


Additional information

Premium Partner

    Image Credits