Skip to main content
Top

2021 | Book

Trends of Data Science and Applications

Theory and Practices

Editors: Dr. Siddharth Swarup Rautaray, Phani Pemmaraju, Prof. Dr. Hrushikesha Mohanty

Publisher: Springer Singapore

Book Series : Studies in Computational Intelligence

insite
SEARCH

About this book

This book includes an extended version of selected papers presented at the 11th Industry Symposium 2021 held during January 7–10, 2021. The book covers contributions ranging from theoretical and foundation research, platforms, methods, applications, and tools in all areas. It provides theory and practices in the area of data science, which add a social, geographical, and temporal dimension to data science research. It also includes application-oriented papers that prepare and use data in discovery research. This book contains chapters from academia as well as practitioners on big data technologies, artificial intelligence, machine learning, deep learning, data representation and visualization, business analytics, healthcare analytics, bioinformatics, etc. This book is helpful for the students, practitioners, researchers as well as industry professional.

Table of Contents

Frontmatter
NLP for Sentiment Computation
Abstract
Sentiment is natural to human beings. Sentiments are expressed in different forms ranging from written, spoken to exhibiting. Our ancient scriptures record classifications of sentiments and propose Rasa Theory as the earliest study on human mind and its expressions. At the advent of digital era, sentiments are being poured into social media. The study on computational linguistics has become encouraging in developing computational models for automatic detection of sentiments in a given text be it a document or a social media posting though both the varieties project different class of problems for finding sentiments with them. This article discusses on some linguistic approaches known in the paradigm of natural language processing and their uses in sentiment detection. The approaches discussed are lexical analysis, corpora based, aspect based, social semantics and the trends of research in this field. It also discusses on exploring sentiments in a text of multiple domains.
Hrushikesha Mohanty
Productizing an Artificial Intelligence Solution for Intelligent Detail Extraction—Synergy of Symbolic and Sub-Symbolic Artificial Intelligence Techniques
Abstract
Businesses are today moving from a data-driven focus for becoming ‘Insight-driven’ businesses. This paradigm shift hinges on the use of artificial intelligence technology. This shift needs accurate and smooth input of all pertinent details. This is facilitated by using artificial intelligence to provide intelligent detail extraction from documents and other sources. The techniques behind building an intelligent detail extraction product are discussed. The path from an initial solution through identifying and developing the characteristics that mark a product are described. A powerful synergy is created by appropriately using symbolic machine learning as well as deep learning techniques.
Arunkumar Balakrishnan
Digital Consumption Pattern and Impacts of Social Media: Descriptive Statistical Analysis
Abstract
Dependence on digital media has increased manifold during the COVID—pandemic lockdowns, across the globe. Most of the offices and academic institutions started operating, on ‘work-from-home’ mode, on digital platform. Even senior citizens, non-working homemakers and kids spent more time in social (networking and communication) media. The ‘difficult times’ of COVID—pandemic has also shown the world the ‘different times’ and the difference in our preferred way of functioning. The ‘digital consumption pattern’ changed substantially, both by its scale and by diversity. The present paper discusses the issue of ‘effectiveness and impact’ of ‘digital medium’ on ‘digital life’ of its users (/digital consumers), particularly during this extended lockdown. The objective is to discuss issues relating to the ‘effect of extended/longer use of net, during this pandemic, for academic and professional activities from home, continuously’. The study is about the life in virtual world, particularly during this extended lockdown. The paper adopts a method of descriptive statistical analysis, covering different categories of users of Eastern India. Using a structured questionnaire method, descriptive primary data, relating to digital consumption, were collected from around 1350 respondents, from Odisha and its neighbouring states. Result shows how different categories prefer and use their preferred social media. Study finds a significant contribution of Internet-based social media (SM), countering ‘social isolation’ of senior citizens.
Rabi N. Subudhi
Applicational Statistics in Data Science and Machine Learning
Abstract
In the domain of data science and machine learning, statistics plays a huge role. When it comes to gaining insights and building quality features out of the data to train any model, statistical tools and techniques along with the concepts of exploratory data analysis assist in doing the same. A data scientist or data analyst is incomplete without the knowledge of statistics because this is the building block of a machine learning or deep learning model which has learned or needs to learn trends and patterns from the features which were built by analysing the data end-to-end, be it in any tabular form or in picture format or video format. Also, as it covers a lot many concepts under statistics like variables, sampling, correlation, outlier treatment and much more, this chapter solely aims to take the reader to a tour of applicational statistics and how it can be combined with exploratory data analysis to easily work on data science and machine learning. Also, data analysis and machine learning are domains that are experiment heavy and need correct statistical methods for correct inferencing. Hence, for these experiments, the different statistical methods in place are discussed here in detail. There are different languages like Python, MATLAB, R and much more which have libraries for statistical mathematics and make simple API calls to do the required experiments within any dataset.
Indrashis Das, Anoushka Mishra
Evolutionary Algorithms-Based Machine Learning Models
Abstract
Machine learning models have found immense applications in various sectors such as energy, stock market, demand–supply chain, logistic management, health and many more, but their efficiency and accuracy depend on a very important factor, i.e., data. Data plays a major role in affecting the performance of these models. Mostly dataset with several features is fed as input to the machine learning models. A dataset containing crisp and relevant features can highly improve the accuracy whereas a dataset containing redundant and irrelevant features can detoriate the same. While performing data collection, basically one tries to collect as much information as it can about a specific domain, but when we need to draw any viable inference from that data, definitely, the inference is drawn, basing on a particular perspective and we may not need all the features of the collected data for this purpose. So, choosing some appropriate features of the dataset, which need to be fed to the models, has always been a crucial task. It is an obvious fact that complexity of feature selection increases with the increase in number of features and choosing some features out of human intuition is, of course, not appreciable. Evolutionary algorithms have been proved highly beneficial for solving such issues, for their stochastic nature. This chapter discusses some recent application of evolutionary algorithms such as genetic algorithm, particle swarm optimization, artificial bee colony, etc. to optimize the parameters of machine learning algorithms, e.g., support vector regression, artificial neural networks, random forest, etc. and their uses in various sectors like engineering, applied sciences, disaster management, finance and economy and health sector.
Junali Jasmine Jena, Manjusha Pandey, Siddharth Swarup Rautaray, Sushovan Jena
Application to Predict the Impact of COVID-19 in India Using Deep Learning
Abstract
The COVID-19 pandemic has hit almost all the parts of the world. Originating in Wuhan, China, to spreading all across the world, it is safe to say now that the pandemic has shocked the people. Despite all the advancements in medicine and science, it is quite frankly a realization to the world that a virus can rip apart everyone’s lives [1]. The USA, being the worst impacted by the same [2], countries like India are experiencing the growth of the virus rapidly. From social distancing to living with the virus, we humans are finding out different ways to survive this pandemic and return to normalcy. The importance of understanding the situation reels’ core to the lives of everyone. In this chapter, we keep update about the change in events, cases reported, people deceased, people recovered, contact for essential service, the current status of the state/country is essential to provide a system that people could depend on during the time of the pandemic. This research works on the core principle of providing real-time information to the people, by supplementing it with the state-wise report, national reports, essential services, and contacts provided by the state. This chapter also explains the power of AI/ML which will predict how the cases would progress in the coming days. The model used details of a convolutional neural network to predict the spread of the pandemic, hence gives an idea of how the pandemic would spread with a close approximation to the real-world data.
Kiran S. Raj, Priyanka Kumar
Role of Data Analytics in Bio Cyber Physical Systems
Abstract
Data science has proved its versatility in all dynamics of the field known to mankind, making decision making faster, and accurate over the past two decades. Coupled with IoT devices and their setups, these have been forerunners in terms of data generation and accurate prognosis. According to advisory firm International Data Corporation (IDC), the number of IoT devices is forecasted to reach 41.6 Billion by 2025, and the data generated from these devices is expected to be 79.4 Zettabytes. One broad sector which has emerged as a gold mine for data generation is the Bio Cyber Physical Systems. Bio Cyber Physical Systems are based on the incorporation of computational elements with biological processes of the human body. The following chapter aims to discuss a new design, implementation of a system based on Bio-CPS, focused primarily on health wearable technologies equipped with state-of-the-art sensors, couple their data with machine learning algorithms to detect real-time health complications primarily in a diabetic person and use of long short-term memory (LSTM) for prediction of such health complications.
Utkarsh Singh
Evolution of Sentiment Analysis: Methodologies and Paradigms
Abstract
With the advent of the digital age, almost everything has come down to better understanding of the data. Natural language processing is equally in pursuit and is rather among the most researched areas of computer science. Post 1980, a major revolution in NLP embarked with the emergence of machine learning algorithms resulting from steady escalation in computational power. Unlike other data, text semantics becomes more complex both because of its contextual nature and daily evolving language usage. While the continuous efforts of improving language representation for logical units interpretation is still prevalent, much to our realization, traditional, and long established recurrent neural networks which were supposed to grasp a bi-directional context of language have been surpassed by attention models in constructing improved embeddings allowing systems to better understand language. Among numerous applications circumventing, understanding sentiment of text has been widespread in fields including but not limited to customer reviews, stock market, elections, healthcare analytics, online, and social media analytics. From binary classification of it to more challenging cases such as negation handling, sarcasm, toxicity, multiple attitudes, or polarity, this research chapter explores the evolution of sentiment analysis in the light of emerging text processing and the transition of text understanding from rule-based to a statistical one with a comparison of benchmark performance from state-of-the-art models over various applications and datasets.
Aseer Ahmad Ansari
Healthcare Analytics: An Advent to Mitigate the Risks and Impacts of a Pandemic
Abstract
Healthcare analytics is a broad term for a specific facet of analytics which sweeps in a wide sash of the healthcare industry, providing macro- and micro-level insights on patient data, risk scoring for chronic diseases, hospital management, optimizing costs, expediting diagnosis and so on. An efficacious way of implementing healthcare analytics on this sector is to combine predictive analysis, data visualization tools and business suites to get valuable information that can both deliver actionable insights to support decisions, as well as reduce variations to optimize utilization. In order to device a more potent approach to prepare for the next pandemic, the clinical data has to be improved which would provide more granulated information to augment treatment effectiveness and success rate. Confirmatory data analysis combined with predictive modeling can be used to furnish unexpected gaps in terms of getting the required clinical data, which is vital for the success of this approach. The healthcare analytics domain has got the potential to monitor the health of the masses to identify disease trends and can provided enhanced health strategies based on demographics, geography and socio-economics. Hence, healthcare analytics is indeed an advent to counter the imminent threats of outbreaks such as epidemics and pandemics provided it gets the necessary course of action to fulfill its grail.
Shubham Kumar
Image Classification for Binary Classes Using Deep Convolutional Neural Network: An Experimental Study
Abstract
Convolutional neural networks (CNNs) have proved itself a well-built model for image recognition in these modern computing days. Inclined by CNN's successes, we present an elaborative experimental assessment of CNN on image classification using a newly fabricated dataset of high-resolution images belonging to two different classes. The dataset partitioned into two distinct categories of high-resolution images of cats and dogs. This chapter presents an extensive experimental study of training size on training and validation accuracy and loss. We designed a fine-tuned predictive two-class image classification model for a large training size, which achieved a training accuracy of 100%, with validation accuracy close to 99.13%.
Biswajit Jena, Amiya Kumar Dash, Gopal Krishna Nayak, Puspanjali Mohapatra, Sanjay Saxena
Leveraging Analytics for Supply Chain Optimization in Freight Industry
Abstract
We live in a country whose logistics industry is slated to be worth $160 billion. Several start-ups have emerged in India trying to crack this globally yet unsolved problem statement. Moreover, the freight scene, being age old, has its segment of bottlenecks to deal with which are inclusive but not limited to problems like fragmentation, inflated costs, lack of visibility into lanes, limited digital capabilities and back hauling. Tech-driven start-ups provide solutions to overcome these challenges for the fleet owners in terms of optimized demand–supply matching, brokerage eliminations by connecting supply with relevant demand thereby reducing costs, expose truckers and fleet owners to unchartered lanes and also facilitate reverse loads to optimize costs for these businesses. This is where companies are leveraging data science and analytics to tackle these issues and help the businesses grow. Companies like Uber Freight, BlackBuck and Rivigo are using the best of technologies to monetize this industry. Data when logged in the right manner can help industries understand the intricacies of issues and help them overcome the same. A typical example of implementation would be using a simple regression technique to predict demand in a specific region so that supply can be exposed well within time in order to avoid idling period by these truckers. Tracking key metrics like supply turn around time (TAT), truck in transit duration (TiT), placement index and others can help organizations determine and optimize on these metrics to maximize revenue. Being convoluted of a system, this industry has been a tough nut to crack, especially in a country like India. This chapter discusses how companies set up the entire data platform and infrastructure, thereby facilitating the usage of data for advanced analytics techniques to solve some crucial supply chain problems in the freight industry. The chapter also talks about some of the use cases for analytics and machine learning to solve problems related to the freight industry. Firstly, we demonstrate some visualizations and representations as to how insights are drawn through analytics to solve these kinds of problems. We follow this up by discussing how data infrastructures are set up in organizations to collect freight data and then finally we showcase some ML techniques that are used in the freight sector of businesses. This in turn would help users understand the nuances of decision science and analytics with its capabilities in scaling businesses.
Kashyap Barua, Parikshit Barua, Sandeep Agarwal
Trends and Application of Data Science in Bioinformatics
Abstract
Advancement of sequencing technologies, rapid advances in omics generated an extensive volume of biological data in recent years. It requires sophisticated analytical tools to analyze and draw conclusions from such massive amount of data. Bioinformatics is an inter-disciplinary science of analyzing and interpreting biological data by application of statistics, computational methodologies, and information technology. As huge volume of genomic, proteomic, and other data is generated, analysis and interpretation of such biological data sets involves use of data science and data mining tools. Hence, researchers are required to rely increasingly on data-science tools to store and analyze the data. Data science is an inter-disciplinary science that uses algorithms and scientific methods to derive information and insights from the big data. Data science extracts scientific work out of a wide variety of subjects viz., computer science, mathematics, statistics, databases, machine learning and optimization, etc. These strategies promote investigation and advancement of innovative methods to improve the incorporation of big data and data science into biological research. Advancements in computing and data science offers viable analytical techniques for processing huge biological data. Consequently, there is a huge possibility to enhance the interaction between bioinformatics and data science. Future applications of data science should concentrate on creating high-end integrated technologies for relatively low-cost processing of enormous biological data, greater efficiency, and reliable protection measures to advance bioinformatics research.
P. Supriya, Balakrishnan Marudamuthu, Sudhir Kumar Soam, Cherukumalli Srinivasa Rao
Mathematical and Algorithmic Aspects of Scalable Machine Learning
Abstract
Although a number of machine learning models have been proposed and successfully deployed in organizations, there is a new emerging challenge that the organizations are going to face in the upcoming years. As various organizations are relying on data for decision-making and optimization of processes, the volume of data is an important factor for developing precise models. The ever-increasing volume of data and its storage is contributed by the advancement of communication technology and storage services like cloud computing. The large volume of data collected is usually stored in a distributed storage and computing environment to ensure fault tolerance and scalability. The development of machine learning models is quite inefficient in a distributed environment using traditional machine learning algorithms. The inefficiency is attributed to the distributed nature of the dataset and computing. The development of the models needs to be carried out in a distributed manner. Thus, additional challenges related to distributed computing need to be addressed by the machine learning algorithms. Scalable machine learning is an updation of traditional machine learning in a distributed environment. As the nature of computing changes, the mathematical formulas and equations need to be revisited along with the algorithms to make it suitable for a distributed environment. This chapter discusses the challenges faced by the traditional machine learning algorithms in distributed environments, the various mathematical backgrounds of scalable machine learning models, and the state-of-the-art distributed algorithms for scalable machine learning models.
Gananath Bhuyan, Mainak Bandyopadhyay
An Implementation of Text Mining Decision Feedback Model Using Hadoop MapReduce
Abstract
A very large amount of unstructured text data is generated everyday on the Internet as well as in real life. Text mining has dramatically lifted the commercial value of these data by pulling out the unknown comprehensive potential patterns from these data. Text mining uses the algorithms of data mining, statistics, machine learning, and natural language processing for hidden knowledge discovery from the unstructured text data. This paper hosts the extensive research done on text mining in recent years. Then, the overall process of text mining is discussed with some high-end applications. The entire process is classified into different modules which are test parsing, text filtering, transformation, clustering, and predictive analytics. A more efficient and more sophisticated text mining model is also proposed with a decision feedback perception in which it is a way advanced than the conventional models providing a better accuracy and attending broader objectives. The text filtering module is discussed in detail with the implementation of word stemming algorithms like Lovins stemmer and Porter stemmer using MapReduce. The implementation set up has been done on a single node Hadoop cluster operating in pseudo-distributed mode. An enhanced implementation technique has been also proposed which is Porter stemmer with partitioner (PSP). Then, a comparative analysis using MapReduce has been done considering above three algorithms where the PSP provides a better stemming performance than Lovins stemmer and Porter stemmer. Experimental result shows that PSP provides 20–25% more stemming capacity than Lovins stemmer and 3–15% more stemming capacity then Porter stemmer algorithm.
Swagat Khatai, Siddharth Swarup Rautaray, Swetaleena Sahoo, Manjusha Pandey
Business Analytics: Process and Practical Applications
Abstract
Today, automation of business processes and devices like IoT for monitoring/activating services generate massive raw data, though they stand alone may not look useful but together carry domain specific signatures that are immensely useful for decision making. The problem of deducing strategic information in detecting patterns, analyzing, reasoning over it, and learning on business trends is popularly known as business analytics and uses artificial intelligence and machine intelligence techniques. This chapter while introducing basics of characteristics of business data analytics, presents types and uses of analytics, and standard processes. Further, this chapter would include an approach to design a recommendation system (with techniques such as content-based filtering, collaborative filtering, and Hybrid recommendations methods). This chapter would do a comparative analysis as well between process of business analytics, various types, and choice of recommendation systems.
Amit Kumar Gupta
Challenges and Issues of Recommender System for Big Data Applications
Abstract
The knowledge overload is a big problem in today’s world. In reality, the information overload implies the availability of so much data or knowledge that goes beyond the user’s manageable limits and causes a great difficulty in all kinds of decision taking. The main reason we need a recommender system in modern society is that because of the proliferation of the Internet, people have so many options to choose from. A recommender system refers to a system that can predict a user’s future preference for a set of items and recommends the top items. It is a knowledge retrieval application that enhances accessibility as well as efficiently and effectively suggests relevant items to users by considering the user interests and preferences. A recommendation framework tries to tackle the problem of overloading information. There are so many other instances such as these, where we have plenty of data, but we can't decide what we want. Even though volume of information has increased, a new problem has arisen as people have had difficulty selecting the items they actually want to see. Recommenders systems have the ability to change the way websites interact with users and allow businesses to optimize their Return on Investment (ROI); based on the information, they can collect on the preferences and purchases of each customer. A traditional recommendation system cannot do its work without sufficient information, and big data offers plenty of user data such as past transactions, browsing history, and reviews for recommendation systems in order to provide accurate and efficient recommendations. In short, even the most advanced recommenders without big data can’t be successful. This work primarily discusses and reflects on current issues, challenges, and research gaps in the production of high-quality recommender systems. Such problems and challenges will present new paths for study, and the target can be accomplished for high-quality recommender systems. The entire research is broken into major tasks including the study of state-of-the-art approaches for recommender system and big data applications; overcoming the problem of cold start, scalability, and building a proactive recommender system; this research considers the process of development for a generic intelligent recommender system that can be work on more than one domain; it also expands the basic recommender program definition.
Chandrima Roy, Siddharth Swarup Rautaray
Metadata
Title
Trends of Data Science and Applications
Editors
Dr. Siddharth Swarup Rautaray
Phani Pemmaraju
Prof. Dr. Hrushikesha Mohanty
Copyright Year
2021
Publisher
Springer Singapore
Electronic ISBN
978-981-336-815-6
Print ISBN
978-981-336-814-9
DOI
https://doi.org/10.1007/978-981-33-6815-6

Premium Partner