A survey on scholarly data: From big data perspective

https://doi.org/10.1016/j.ipm.2017.03.006Get rights and content

Highlights

  • Survey of big scholarly data with respect to the different phases of the big data lifecycle.

  • Identifies the different big data tools and technologies that can be used for development of scholarly applications.

  • Investigates research challenges and limitations specific to big scholarly data and its applications.

  • Provides research directions and paves way towards the development of a generic and comprehensive big scholarly data platform.

Abstract

Recently, there has been a shifting focus of organizations and governments towards digitization of academic and technical documents, adding a new facet to the concept of digital libraries. The volume, variety and velocity of this generated data, satisfies the big data definition, as a result of which, this scholarly reserve is popularly referred to as big scholarly data. In order to facilitate data analytics for big scholarly data, architectures and services for the same need to be developed. The evolving nature of research problems has made them essentially interdisciplinary. As a result, there is a growing demand for scholarly applications like collaborator discovery, expert finding and research recommendation systems, in addition to several others. This research paper investigates the current trends and identifies the existing challenges in development of a big scholarly data platform, with specific focus on directions for future research and maps them to the different phases of the big data lifecycle.

Introduction

The digital world is facing the aftermath of data explosion. In view of this, several terms like data deluge, which is a phrase used to describe the excessively huge volume of data generated at a regularly increasing basis in the world, have come into existence. A significant implication of data deluge is that it has made the scientific method completely obsolete (Anderson, 2008), as a result of which right questions need to be framed that this data can answer. This paradigm shift has given birth to the concept of big data analytics.

Big data analytics suffers from two fundamental challenges. Firstly, owing to the huge volume, variety and velocity of data involved, the storage and processing requirements of the system are rather overwhelming. Secondly, the analytics techniques and algorithms are complex, which makes big data analytics a computing-intensive task. In order to support the storage and processing requirements of big data analytics applications, cloud has been found as the most appropriate infrastructural solution (Chen & Zhang, 2014). Cloud computing offers a cost-effective solution for storing, processing and managing big data for analytical purposes, enabling the implementation of distributed and parallel paradigms for meeting the efficiency requirements.

Big data analytics is a vast field that has found applications in diverse domains and studies. Some of the most impactful researches that have merged big data analytics with other fields of study include business analytics (Duan & Xiong, 2015), multi-scale climate data analytics (Lu et al., 2011), banking customer analytics (Sun, Morris, Xu, Zhu, & Xie, 2014), smart cities (Khan et al., 2015), e-commerce recommender systems (Hammond & Varde, 2013), social media analytics (Burnap et al., 2014), healthcare data analytics (Raghupathi & Raghupathi, 2014), intelligent transport management systems (Chandio et al., 2016) and railway assets management system (Thaduri, Galar, & Kumar, 2015). One of the lesser-explored applications of big data analytics lies in scholarly data. Moreover, the use of this synergistic approach to develop a big scholarly data platform for implementation of diverse scholarly applications needs to be explored.

The need for research in ‘big scholarly data’ and its analytics can be summarized as the lack of scholarly platforms and tools that can use this huge reservoir of data for creating applications that can benefit the research community, at large. Effective and efficient management of big scholarly data using the cloud infrastructure can facilitate the processes involved in big data analytics like data acquisition, storage, processing, analytics and visualization to support research data management and its analytical uses.

Scholarly documents are generated on a daily basis in the form of research papers, project proposals, technical reports and academic documents, by researchers and students from all over the world. Moreover, there have been many initiatives by Governments and organizations to digitize existing literary and academic resources (Meity 2016, IFLA 2016, Christenson, 2016). However, it is important to note that this is a generalized description and the definition may vary from one scholarly community to another. For instance, Google Scholar does not count patents as a scholarly resource. It is the huge reservoir of data that is popularly referred to as ‘scholarly data’. Owing to the massive volume of these digital resources, the data needs to be looked upon from the big data perspective.

The use of big data analytics in the scholarly ecosystem for, what can be called ‘research analytics’ has far-reaching implications on the ease with which scholarly documents are managed and research is performed. Primarily, analytics for big scholarly data can be divided into five categories namely, research management, collaborator discovery, expert finder systems, recommender systems and visualization tools. Such analytics have gained immense importance and relevance lately particularly with the advent of multi-disciplinary research projects.

Such projects have increased the scale and complexity of research problems manifold and emphasize on the pressing need for collaboration among researchers as well as institutes or organizations. Research collaboration is not a neoconcept. However, there has been a recent shift in the manner in which collaborations are initiated. Traditionally, researchers and scholars used to meet periodically in conferences and symposiums to explore new research domains and possibility for collaborations.

With the increasing popularity of Internet, these platforms have been complemented with academic search web engines like Google Scholar and academic social networking portals like ResearchGate1 and Academia.2 While these platforms allow researchers to follow each other's research activities and interests, they have also created a sense of realization in the research community that the final published article is merely a milestone in research. Other aspects of research like dataset used and supporting material considered for the research are equally important.

In view of the overwhelming volume, variety and velocity characteristics of this data, scholarly data has been popularly named ‘big scholarly data’. In order to develop advanced analytical applications for big scholarly data, several cloud-based tools and technologies can be used. Hadoop exists as the most popular framework for big data storage and processing, apart from a plethora of other tools like Zeppelin that are popularly used for data acquisition and visualization.

There are research challenges and limitations specific to big scholarly data at every stage of the big data lifecycle. However, some specific services that a big data platform needs to support include user data analytics and information extraction. Reliability and accuracy of information extraction methods remains a major area of concern for the reason that the accuracy of analytics results is directly dependent on the accuracy of the method employed. Moreover, there is a dearth of innovative applications that can make use of the big scholarly data reserve, with applications like research management, recommendation systems and time-evolution of research needing attention.

Another important aspect of big scholarly data management and analytics is the subject-specificity of data and applications. Generalised solutions that are cross-domain and generic need to be developed to create comprehensive, commercially viable analytical solutions for this domain. Other areas of research that have gained attention recently are academic social networks analysis and research evaluation. The motivation behind this survey is a lack of a comprehensive survey in the field of scholarly data that views this data reserve from the big data perspective, keeping the different stages of the big data lifecycle in consideration.

The results of the survey shall play a crucial role in putting the pieces together for integrating a big scholarly data platform for development of effective and efficient applications in this domain. The contributions of this research paper are as follows: (1) study big scholarly data with respect to the different phases of the big data lifecycle namely data management, analytics and visualization (2) identify the challenges that exist specific to every phase and their sub-phases (3) investigate the research issues for development of big scholarly data analytics applications (4) explore the future domains of research in this field with specific focus on creation of innovative applications that can find commercial ground and real-world adoption.

This paper surveys the existing literature on the challenges faced by the implementation of analytics techniques on big scholarly data using cloud computing. This paper is structured as follows – Section 2 covers the background and methodology followed for this survey, elaborating on the concepts, platforms and frameworks that rule the big data scenario, in general and specifically for big scholarly data; Section 3 covers data acquisition, pre-processing, storage and processing phases of the big data lifecycle; Section 4 elaborates on the challenges associated with integrating these established concepts, in the big scholarly data perspective. Section 5 discusses recent trends and tools that are used for supporting visualization of big scholarly data and Section 6 concludes the results of the survey to predict scope for future research in this field.

Section snippets

Overview of survey

This paper conducts the survey of scholarly data systematically from the point of view of big data. The big data lifecycle can be broadly divided into four categories namely data generation, acquisition, storage and processing. However, Assuncao et al. (2015) described the typical analytical workflow as composed of the following phases: (1) data management (2) model building and scoring (3) visualization and user interaction. A typical workflow for big data analytics given by has been

Big scholarly data platforms

Big data analytics require the use of mathematical, statistical and optimization techniques, in addition to several others. Besides this, use of machine learning, signal processing, visualization techniques and neural networks is also common. In order to implement the techniques mentioned above, Chen and Zhang (2014) provided an extensive survey on the tools, techniques and technologies used for big data analytics. Although, the research work paid little heed to deploying Hadoop on the Cloud,

Data management

Data is generated in many diverse forms in any scholarly platform. One of the primary sources of data is the huge reservoir of existing scholarly documents on the Internet. In addition to this, there are author webpages, academic social networks and secondary sources of scholarly information like institution and organization webpages that also render significant data for a comprehensive analysis of the scholarly ecosystem.

The three main characteristics of big data are volume, variety and

Analytics and applications

Systems need to analyze static as well as stream data. In order to create generic solutions and suffice these requirements, there is a need to integrate different programming models in the analytics engine. Moreover, energy efficiency and optimal resource usage also have to be taken into account. Specifically, there is a need for standardization in solutions and the development of most effective and efficient data processing solutions need to be emphasized (Assuncao et al., 2015).

Many scholarly

Visualization

Broadly, in the area of visualization and user interaction, real-time visualization of data is an important area of research. The research community is yet to devise solutions that can visualize data at the rate at which the same is generated and in the amounts that it exists. Parallel research in the development of cost-effective devices for large-scale visualization is also underway (Assuncao et al., 2015).

With specific reference to scholarly data, visualization poses several challenges.

Conclusion

This survey includes a detailed study of big scholarly data and the use of big data in the scholarly ecosystem. Besides this, it also discusses the current trends and existing challenges in the different sub-systems of the big scholarly data platform, with specific focus on directions for future research in this area.

Scholarly data is a huge data reserve, which is substantially appended on a daily basis and includes a variety of data. As a result, it is popularly termed as big scholarly data.

Acknowledgments

This work was supported by a grant from “Young Faculty Research Fellowship” under Visvesvaraya PhD Scheme for Electronics and IT, Department of Electronics & Information Technology (DeitY), Ministry of Communications & IT, Government of India.

References (146)

  • Anderson, C. (2008). The end of theory: the data deluge makes the scientific method obsolete. www.wired.com Retrieved 7...
  • M. Bahrami et al.

    The Role of cloud computing architecture in big data

    Studies In Big Data

    (2014)
  • J. Bar-Ilan

    Which h-index? — A comparison of WoS, scopus and google scholar

    Scientometrics

    (2007)
  • J. Bar-Ilan

    Citations to the “Introduction to informetrics” indexed by WOS, Scopus and Google Scholar

    Scientometrics

    (2010)
  • Bauer, F. & Kaltenböck, M. (2016). Linked open data: the essentials. Semantic Web. Retrieved 8 November 2016, from...
  • J. Beel et al.

    Docear's PDF inspector

  • Beel, J., Langer, S., Kapitsaki, G.M., & Gipp, B. Mind-Map based user modeling and research paper recommender systems,...
  • J. Beel et al.

    Research-paper recommender systems: A literature survey

    International Journal on Digital Libraries

    (2015)
  • S. Bethard et al.

    Who should I cite: Learning literature search models from citation behavior

  • B-C. Björk et al.

    Scientific journal publishing: Yearly volume and open access availability

    Information Research

    (2009)
  • K. Bollacker et al.

    CiteSeer

  • P. Burnap et al.

    COSMOS: Towards an integrated and scalable service for analysing social media on demand

    International Journal of Parallel, Emergent And Distributed Systems

    (2014)
  • C. Caragea et al.

    CiteSeer x : A Scholarly Big Dataset

    Lecture Notes In Computer Science

    (2014)
  • S. Carberry et al.

    Information graphics

  • A. Chandio et al.

    Big-data processing techniques and their challenges in transport domain

    ZTE Communications

    (2016)
  • ChenH. et al.

    CollabSeer

  • ChenH. et al.

    CSSeer

  • ChenC.P. et al.

    Data-intensive applications, challenges, techniques and technologies: A survey on Big Data

    Information Sciences

    (2014)
  • ChenC.

    Grand challenges in measuring and characterizing scholarly impact

    Frontiers in Research Metrics and Analytics

    (2016)
  • S. Choudhury et al.

    A figure search engine architecture for a chemistry digital library

  • S. Choudhury et al.

    Figure metadata extraction from digital documents

  • Christenson, H. (2016), Mass Digitization Overview: California Digital Library. Cdlib.org. Retrieved 7 November 2016,...
  • L.M. Collins et al.

    ScienceSifter: Facilitating activity awareness in collaborative research groups through focused information feeds

  • E. Cortez et al.

    FLUX-CIM

  • I. Councill et al.

    ParsCit: An open-source CRF reference string parsing package

  • Crystal, D. (2001). Weaving a Web of linguistic diversity. the Guardian. Retrieved 3 March 2017, from...
  • A. Cuzzocrea

    Provenance research issues and challenges in the big data era

  • Debattista, J., Lange, C., Scerri, S., & Auer, S. (2015). Linked'Big'Data: towards a manifold increase in big data...
  • C. Demmans Epp et al.

    Uncertainty representation in visualizations of learning analytics for learners: current approaches and opportunities

    IEEE Transactions on Learning Technologies

    (2015)
  • Y. Dong et al.

    Can scientific impact be predicted?

    IEEE Transactions on Big Data

    (2016)
  • DuanL. et al.

    Big data analytics and business analytics

    Journal of Management Analytics

    (2015)
  • N. Ehsan et al.

    Candidate document retrieval for cross-lingual plagiarism detection using two-level proximity information

  • D. Faensen et al.

    Hermes: A notification service for digital libraries

  • FungI.

    Citation of non-English peer review publications – some Chinese examples

    Emerging Themes in Epidemiology

    (2008)
  • B. Gipp

    Citation-based plagiarism detection

    Citation-based plagiarism detection

    (2014)
  • L. Gao et al.

    Structure extraction from PDF-based book documents

  • S.D. Gollapalli et al.

    Similar researcher search in academic environments

  • S. Gollapalli et al.

    Ranking experts using author-document-topic graphs

  • W. Habib et al.

    A new approach for scholars matching using universal quantifier queries

  • HanH. et al.

    Automatic document metadata extraction using support vector machines

  • Cited by (118)

    • Preprocessing framework for scholarly big data management

      2023, Multimedia Tools and Applications
    View all citing articles on Scopus
    View full text