A survey on scholarly data: From big data perspective
Introduction
The digital world is facing the aftermath of data explosion. In view of this, several terms like data deluge, which is a phrase used to describe the excessively huge volume of data generated at a regularly increasing basis in the world, have come into existence. A significant implication of data deluge is that it has made the scientific method completely obsolete (Anderson, 2008), as a result of which right questions need to be framed that this data can answer. This paradigm shift has given birth to the concept of big data analytics.
Big data analytics suffers from two fundamental challenges. Firstly, owing to the huge volume, variety and velocity of data involved, the storage and processing requirements of the system are rather overwhelming. Secondly, the analytics techniques and algorithms are complex, which makes big data analytics a computing-intensive task. In order to support the storage and processing requirements of big data analytics applications, cloud has been found as the most appropriate infrastructural solution (Chen & Zhang, 2014). Cloud computing offers a cost-effective solution for storing, processing and managing big data for analytical purposes, enabling the implementation of distributed and parallel paradigms for meeting the efficiency requirements.
Big data analytics is a vast field that has found applications in diverse domains and studies. Some of the most impactful researches that have merged big data analytics with other fields of study include business analytics (Duan & Xiong, 2015), multi-scale climate data analytics (Lu et al., 2011), banking customer analytics (Sun, Morris, Xu, Zhu, & Xie, 2014), smart cities (Khan et al., 2015), e-commerce recommender systems (Hammond & Varde, 2013), social media analytics (Burnap et al., 2014), healthcare data analytics (Raghupathi & Raghupathi, 2014), intelligent transport management systems (Chandio et al., 2016) and railway assets management system (Thaduri, Galar, & Kumar, 2015). One of the lesser-explored applications of big data analytics lies in scholarly data. Moreover, the use of this synergistic approach to develop a big scholarly data platform for implementation of diverse scholarly applications needs to be explored.
The need for research in ‘big scholarly data’ and its analytics can be summarized as the lack of scholarly platforms and tools that can use this huge reservoir of data for creating applications that can benefit the research community, at large. Effective and efficient management of big scholarly data using the cloud infrastructure can facilitate the processes involved in big data analytics like data acquisition, storage, processing, analytics and visualization to support research data management and its analytical uses.
Scholarly documents are generated on a daily basis in the form of research papers, project proposals, technical reports and academic documents, by researchers and students from all over the world. Moreover, there have been many initiatives by Governments and organizations to digitize existing literary and academic resources (Meity 2016, IFLA 2016, Christenson, 2016). However, it is important to note that this is a generalized description and the definition may vary from one scholarly community to another. For instance, Google Scholar does not count patents as a scholarly resource. It is the huge reservoir of data that is popularly referred to as ‘scholarly data’. Owing to the massive volume of these digital resources, the data needs to be looked upon from the big data perspective.
The use of big data analytics in the scholarly ecosystem for, what can be called ‘research analytics’ has far-reaching implications on the ease with which scholarly documents are managed and research is performed. Primarily, analytics for big scholarly data can be divided into five categories namely, research management, collaborator discovery, expert finder systems, recommender systems and visualization tools. Such analytics have gained immense importance and relevance lately particularly with the advent of multi-disciplinary research projects.
Such projects have increased the scale and complexity of research problems manifold and emphasize on the pressing need for collaboration among researchers as well as institutes or organizations. Research collaboration is not a neoconcept. However, there has been a recent shift in the manner in which collaborations are initiated. Traditionally, researchers and scholars used to meet periodically in conferences and symposiums to explore new research domains and possibility for collaborations.
With the increasing popularity of Internet, these platforms have been complemented with academic search web engines like Google Scholar and academic social networking portals like ResearchGate1 and Academia.2 While these platforms allow researchers to follow each other's research activities and interests, they have also created a sense of realization in the research community that the final published article is merely a milestone in research. Other aspects of research like dataset used and supporting material considered for the research are equally important.
In view of the overwhelming volume, variety and velocity characteristics of this data, scholarly data has been popularly named ‘big scholarly data’. In order to develop advanced analytical applications for big scholarly data, several cloud-based tools and technologies can be used. Hadoop exists as the most popular framework for big data storage and processing, apart from a plethora of other tools like Zeppelin that are popularly used for data acquisition and visualization.
There are research challenges and limitations specific to big scholarly data at every stage of the big data lifecycle. However, some specific services that a big data platform needs to support include user data analytics and information extraction. Reliability and accuracy of information extraction methods remains a major area of concern for the reason that the accuracy of analytics results is directly dependent on the accuracy of the method employed. Moreover, there is a dearth of innovative applications that can make use of the big scholarly data reserve, with applications like research management, recommendation systems and time-evolution of research needing attention.
Another important aspect of big scholarly data management and analytics is the subject-specificity of data and applications. Generalised solutions that are cross-domain and generic need to be developed to create comprehensive, commercially viable analytical solutions for this domain. Other areas of research that have gained attention recently are academic social networks analysis and research evaluation. The motivation behind this survey is a lack of a comprehensive survey in the field of scholarly data that views this data reserve from the big data perspective, keeping the different stages of the big data lifecycle in consideration.
The results of the survey shall play a crucial role in putting the pieces together for integrating a big scholarly data platform for development of effective and efficient applications in this domain. The contributions of this research paper are as follows: (1) study big scholarly data with respect to the different phases of the big data lifecycle namely data management, analytics and visualization (2) identify the challenges that exist specific to every phase and their sub-phases (3) investigate the research issues for development of big scholarly data analytics applications (4) explore the future domains of research in this field with specific focus on creation of innovative applications that can find commercial ground and real-world adoption.
This paper surveys the existing literature on the challenges faced by the implementation of analytics techniques on big scholarly data using cloud computing. This paper is structured as follows – Section 2 covers the background and methodology followed for this survey, elaborating on the concepts, platforms and frameworks that rule the big data scenario, in general and specifically for big scholarly data; Section 3 covers data acquisition, pre-processing, storage and processing phases of the big data lifecycle; Section 4 elaborates on the challenges associated with integrating these established concepts, in the big scholarly data perspective. Section 5 discusses recent trends and tools that are used for supporting visualization of big scholarly data and Section 6 concludes the results of the survey to predict scope for future research in this field.
Section snippets
Overview of survey
This paper conducts the survey of scholarly data systematically from the point of view of big data. The big data lifecycle can be broadly divided into four categories namely data generation, acquisition, storage and processing. However, Assuncao et al. (2015) described the typical analytical workflow as composed of the following phases: (1) data management (2) model building and scoring (3) visualization and user interaction. A typical workflow for big data analytics given by has been
Big scholarly data platforms
Big data analytics require the use of mathematical, statistical and optimization techniques, in addition to several others. Besides this, use of machine learning, signal processing, visualization techniques and neural networks is also common. In order to implement the techniques mentioned above, Chen and Zhang (2014) provided an extensive survey on the tools, techniques and technologies used for big data analytics. Although, the research work paid little heed to deploying Hadoop on the Cloud,
Data management
Data is generated in many diverse forms in any scholarly platform. One of the primary sources of data is the huge reservoir of existing scholarly documents on the Internet. In addition to this, there are author webpages, academic social networks and secondary sources of scholarly information like institution and organization webpages that also render significant data for a comprehensive analysis of the scholarly ecosystem.
The three main characteristics of big data are volume, variety and
Analytics and applications
Systems need to analyze static as well as stream data. In order to create generic solutions and suffice these requirements, there is a need to integrate different programming models in the analytics engine. Moreover, energy efficiency and optimal resource usage also have to be taken into account. Specifically, there is a need for standardization in solutions and the development of most effective and efficient data processing solutions need to be emphasized (Assuncao et al., 2015).
Many scholarly
Visualization
Broadly, in the area of visualization and user interaction, real-time visualization of data is an important area of research. The research community is yet to devise solutions that can visualize data at the rate at which the same is generated and in the amounts that it exists. Parallel research in the development of cost-effective devices for large-scale visualization is also underway (Assuncao et al., 2015).
With specific reference to scholarly data, visualization poses several challenges.
Conclusion
This survey includes a detailed study of big scholarly data and the use of big data in the scholarly ecosystem. Besides this, it also discusses the current trends and existing challenges in the different sub-systems of the big scholarly data platform, with specific focus on directions for future research in this area.
Scholarly data is a huge data reserve, which is substantially appended on a daily basis and includes a variety of data. As a result, it is popularly termed as big scholarly data.
Acknowledgments
This work was supported by a grant from “Young Faculty Research Fellowship” under Visvesvaraya PhD Scheme for Electronics and IT, Department of Electronics & Information Technology (DeitY), Ministry of Communications & IT, Government of India.
References (146)
- et al.
Big data computing and clouds: Trends and future directions
Journal Of Parallel And Distributed Computing
(2015) - et al.
Collaborator recommendation in interdisciplinary computer science using degrees of collaborative forces, temporal evolution of research interest, and comparative seniority status
Knowledge-Based Systems
(2015) - et al.
Scholarly publishing in the Internet age: A citation analysis of computer science literature
Information Processing & Management
(2001) - et al.
The rise of “big data” on cloud computing: Review and open research issues
Information Systems
(2015) - et al.
Parallel data processing with MapReduce
ACM SIGMOD Record
(2012) - et al.
Information extraction from research papers using conditional random fields
Information Processing & Management
(2006) - et al.
Detecting emerging research fronts based on topological measures in citation networks of scientific publications
Technovation
(2008) - et al.
Coherent citation-based summarization of scientific papers
Is Google Scholar useful for bibliometrics? A webometric analysis
Scientometrics
(2011)- et al.
Comprehensive personalized information access in an educational digital library
The Role of cloud computing architecture in big data
Studies In Big Data
Which h-index? — A comparison of WoS, scopus and google scholar
Scientometrics
Citations to the “Introduction to informetrics” indexed by WOS, Scopus and Google Scholar
Scientometrics
Docear's PDF inspector
Research-paper recommender systems: A literature survey
International Journal on Digital Libraries
Who should I cite: Learning literature search models from citation behavior
Scientific journal publishing: Yearly volume and open access availability
Information Research
CiteSeer
COSMOS: Towards an integrated and scalable service for analysing social media on demand
International Journal of Parallel, Emergent And Distributed Systems
CiteSeer x : A Scholarly Big Dataset
Lecture Notes In Computer Science
Information graphics
Big-data processing techniques and their challenges in transport domain
ZTE Communications
CollabSeer
CSSeer
Data-intensive applications, challenges, techniques and technologies: A survey on Big Data
Information Sciences
Grand challenges in measuring and characterizing scholarly impact
Frontiers in Research Metrics and Analytics
A figure search engine architecture for a chemistry digital library
Figure metadata extraction from digital documents
ScienceSifter: Facilitating activity awareness in collaborative research groups through focused information feeds
FLUX-CIM
ParsCit: An open-source CRF reference string parsing package
Provenance research issues and challenges in the big data era
Uncertainty representation in visualizations of learning analytics for learners: current approaches and opportunities
IEEE Transactions on Learning Technologies
Can scientific impact be predicted?
IEEE Transactions on Big Data
Big data analytics and business analytics
Journal of Management Analytics
Candidate document retrieval for cross-lingual plagiarism detection using two-level proximity information
Hermes: A notification service for digital libraries
Citation of non-English peer review publications – some Chinese examples
Emerging Themes in Epidemiology
Citation-based plagiarism detection
Citation-based plagiarism detection
Structure extraction from PDF-based book documents
Similar researcher search in academic environments
Ranking experts using author-document-topic graphs
A new approach for scholars matching using universal quantifier queries
Automatic document metadata extraction using support vector machines
Cited by (118)
ArZiGo: A recommendation system for scientific articles
2024, Information SystemsEffects of members’ response styles in an online depression community based on text mining and empirical analysis
2023, Information Processing and ManagementBivariate, cluster, and suitability analysis of NoSQL solutions for big graph applications
2023, Advances in ComputersDesign of a Decision Support and Service System for Academic Big Data in Universities
2024, Communications in Computer and Information SciencePreprocessing framework for scholarly big data management
2023, Multimedia Tools and ApplicationsModeling the Big Data challenges in context of smart cities – an integrated fuzzy ISM-DEMATEL approach
2023, International Journal of Building Pathology and Adaptation