Skip to main content
Top

2022 | Book

Advances in Big Data Analytics

Theory, Algorithms and Practices

insite
SEARCH

About this book

Today, big data affects countless aspects of our daily lives. This book provides a comprehensive and cutting-edge study on big data analytics, based on the research findings and applications developed by the author and his colleagues in related areas. It addresses the concepts of big data analytics and/or data science, multi-criteria optimization for learning, expert and rule-based data analysis, support vector machines for classification, feature selection, data stream analysis, learning analysis, sentiment analysis, link analysis, and evaluation analysis. The book also explores lessons learned in applying big data to business, engineering and healthcare. Lastly, it addresses the advanced topic of intelligence-quotient (IQ) tests for artificial intelligence.
Since each aspect mentioned above concerns a specific domain of application, taken together, the algorithms, procedures, analysis and empirical studies presented here offer a general picture of big data developments. Accordingly, the book can not only serve as a textbook for graduates with a fundamental grasp of training in big data analytics, but can also show practitioners how to use the proposed techniques to deal with real-world big data problems.

Table of Contents

Frontmatter

Concept and Theoretical Foundation

Frontmatter
Chapter 1. Big Data and Big Data Analytics
Abstract
Big data now is a common term. However, the evolution of big data comes from twofold. The creation of the computer in the 1940s gradually provides tools for human beings to collect massive data, while the term “big data” becomes a popular slogan to represent the collection, processing, and analysis of various data [1]. The data has been exponentially growing for the last 70 decades. EMC2 [2] estimated that the world generated 1.8 zettabytes of data (1.8 multiple 21 zeros) by 2011. In fact, this figure has grown to 44 zettabytes, about 24 times in 2020. Big Data Analytics has arisen as the technical means dealing with both theory and application of big data. This chapter elaborates on the understanding of big data and its analytics. Section 1.1 briefly describes big data evolution and challenges. Section 1.2 is about big data’s current status, including its development in the world as well as in China. Section 1.3 explores big data analysis and data science problems.
Yong Shi
Chapter 2. Multiple Criteria Optimization Classification
Abstract
As the increasingly strong computational power of computers fills the shortage of human brain at calculating, data mining, a major component of data science, has emerged as the times require due to its merit of being capable of extracting novel and useful knowledge which has potential value from large scale of complex data. However, from the mathematical perspective, some data mining methods, such as decision tree, genetic algorithm, and association rules could be considered as heuristic algorithms: which means to select a “better solution” from several alternative solutions as the criterion of classification. These methods lack of exploring how to locate the “best solution” systematically.
Yong Shi
Chapter 3. Support Vector Machine Classification

Support vector machine (SVM) has been a popular technique in data analytics. Shi et al. [1] has reported some SVM algorithms. They vary from leave-one-out (LOO) bounds approaches, multi-class, unsupervised, semi-supervised and robust SVMs. Following the direction of the research afterwards, this Chapter provides five sections about advances of SVM in big data analytics. Section 3.1 has two subsections. The first one outlines the recent findings of the author’s research team on SVM [2] while the second one is about two new decomposition algorithms for training bound-constrained SVM [3]. Section 3.2 describes different twin SVM in classification with four subsections. The first one explores the improved twin SVM [4]. The second one is extending twin SVM for multi-category classification problems [5]. The third one provides robust twin SVM for pattern classification [6]. The fourth one elaborates structural twin SVM for classification [7]. Section 3.3 shows nonparallel SVM with four subsections. The first one is about a nonparallel SVM for a classification problem with universum learning [8]. The second one is about a divide-and-combine method for large scale nonparallel SVM [9]. The third one explores nonparallel SVM for pattern classification [4]. The fourth one is a multi-instance learning algorithm based on nonparallel classifier [10]. Section 3.4 shows Laplacian SVM classifiers with two subsections. One is about successive overrelaxation for Laplacian SVM [11] while another one is about Laplacian twin SVM for semi-supervised classification [12]. Finally, Sect. 3.5 discusses loss functions of SVM classification with three subsections. The first one is about the ramp loss least squares SVM [13]. The second is about the ramp loss nonparallel SVM for pattern classification [14]. The third one is about a classification model using privileged information and its application [10].

Yong Shi

Functional Analysis

Frontmatter
Chapter 4. Feature Selection
Abstract
In big data analytics, irrelevant and redundant features may not only deteriorate the performances of classifiers, but also slow down the prediction process. Although there is the availability of many classification models for prediction, it is a challenge to choose a set of important features that can lead to a satisfactory classifier. This chapter outlines some achievements of feature selection research in the last decade. Section 4.1 has three subsections. The first is an integrated scheme for feature selection and classifier evaluation in the context of prediction [1]. The second is about two-stage hybrid feature selection algorithms [2]. The third one is the feature selection with attributes clustering by maximal information coefficient [3]. Section 4.2 presents two regularizations for feature selections. They are feature selection with MCP2 regularization [4] and feature selection with 2, 1 − 2 regularization [5]. Finally, Sect. 4.3 describes two distance-based feature selections. They are the spatial distance join based feature selection [1] and a domain driven two-phase feature selection method based on bhattacharyya distance and kernel distance measurements [6].
Yong Shi
Chapter 5. Data Stream Analysis
Abstract
Data stream is a typical big data. Data stream can be founded in many real-life applications, such as wireless sensor networks, power consumption, information security and financial market. Data stream classification has drawn increasing attention from the data mining community in recent years. Data stream classification in such real-world applications is typically subject to three major challenges: concept drifting, large volumes, and partial labeling. As a result, training examples in data streams can be very diverse and it is very hard to learn accurate models with efficiency. This chapter provides two related research findings in the field. Section 5.1 describes a novel framework for application-driven classification of data streams [1]. The section first reviews the concepts of data stream, then categorizes diverse training examples into four types and assign learning priorities to them. Following the discussion, it derives four learning cases based on the proportion and priority of the different types of training examples. Finally, the respective support vector machine models are presented. Section 5.2 studies the problem of learning from concept drifting data streams with noise, where samples in a data stream may be mislabeled or contain erroneous values [2]. It has three subsections. The first one is about noisy description for data stream, the second one is the ensemble frameworks for mining data stream and the third one is the theoretical studies of the Aggregate Ensemble.
Yong Shi
Chapter 6. Learning Analysis
Abstract
It is event that most big data represented as non-structured or semi-structured forms, such as images, text and others. It is important to study how to use an abstract form to show data, either structured, non-structured or semi-structured or use label proportions to categorize the nature of data so that a data mining or data analytic algorithm can be performed smoothly. Leaning methods are very useful tools for understanding the data. Learning algorithms can be considered from different aspects, such as cognitive computing, mathematics, and machine learning.
Yong Shi
Chapter 7. Sentiment Analysis
Abstract
Sentiment analysis (SA) refers to the use of computational linguistics to identify and extract subjective information in source material, usually unstructured and heterogeneous text data. This chapter summarizes the recent findings of the authors’ research team on SA. It has two sections. Section 7.1 is word embedding with two Sect. 7.1.1 is about single sense model vs. multiple sense model while Sect. 7.1.2 is about intrinsic vs extrinsic evaluation. Section 7.2 outlines the SA applications.
Yong Shi
Chapter 8. Link Analysis
Abstract
Link analysis has been recognized as an effective technique in data science to explore the relationships of objects. The objects can be social events, people, organization and even business transactions. This chapter reports the practical models of link analysis in various data-driven application areas. Section 8.1 presents a recommendation system for marketing optimization [1]. Section 8.2 is about advertisement clicking prediction [2]. Section 8.3 presents a model for customer churn prediction [3]. Section 8.4 provides node coupling clustering approaches for link prediction [4]. Finally, Sect. 8.5 discusses a pyramid scheme model for consumption rebate frauds [5].
Yong Shi
Chapter 9. Evaluation Analysis
Abstract
Evaluation is one of the key steps in big data analytics, which determines the merit of data analysis towards the experimental objectives. It usually relates a trade-off comparison of multiple criteria which may conflict each other or complex interpretations of the problems in nature. This chapter provides several of evaluation models of the recent studies on data science. Section 9.1 reviews three evaluation formations for the known methodologies. Section 9.1.1 describes a decision-making support for the evaluation of clustering algorithms based on multiple criteria decision making (MCDM) [1]. Section 9.1.2 is about evaluation of classification algorithms using MCDM and rank correlation [2]. Section 9.1.3 discusses the public blockchain evaluation using entropy and Technique of Order Preference Similarity to the Ideal Solution (TOPSIS) [3]. Section 9.2 outlines two evaluation methods for Software. Section 9.2.1 is about a classifier evaluation for software defect prediction [4], while Sect. 9.2.2 is about an ensemble of software defect predictors by AHP-based evaluation method [5]. Section 9.3 describes four evaluation methods for sociology and economics. Section 9.3.1 is about a delivery efficiency and supplier performance evaluation in China’s E-retailing industry [6]. Section 9.3.2 is about the credit risk evaluation with Kernel-based affine subspace nearest points learning method [7]. Section 9.3.3 is a dynamic assessment method for urban eco-environmental quality evaluation [8], while Sect. 9.3.4 is an empirical study of classification algorithm evaluation for financial risk prediction [9].
Yong Shi

Application and Future Analysis

Frontmatter
Chapter 10. Business and Engineering Applications
Abstract
By implementing the algorithms for big data analytics described in the previous chapters, this chapter outlines three sections about related business and engineering applications. Section 10.1 relates to banking and financial market analysis with three subsections. The first one is about domestic systemically important banks: a quantitative analysis for the Chinese banking system [1]. The second is about how does credit portfolio diversification affect banks’ return and risk: evidence from Chinese listed commercial banks [2]. The third one is about an approach of integrating piecewise linear representation and weighted support vector machine for forecasting stock turning points [3]. Section 10.2 describes an agriculture problem that is the classification of orange varieties based on near infrared spectroscopy [4]. Section 10.3 provides two engineering applications. The first one is about automatic road crack detection using random structured forests [5] while the second one is efficient railway tracks detection and turnouts recognition method using HOG features [6].
Yong Shi
Chapter 11. Healthcare Applications
Abstract
Healthcare is also a very hot application area of data science, especially in the COVID-19 pandemic around the world since the beginning of 2020. This chapter provides two sections of the related healthcare applications. Section 11.1 deals with the evaluation of medical doctor’s performance by using ordinal regression-based approach [1], while Sect. 11.2 outlines a cutting-edge research finding to learn transmission patterns of COVID-19 outbreak by using an age-specific social contact characterization [2].
Yong Shi
Chapter 12. Artificial Intelligence IQ Test
Abstract
Since 2015, “artificial intelligence” has become a popular topic in science, technology, and industry. New products such as intelligent refrigerators, intelligent air conditioning, smart watches, smart robots, and of course, artificially intelligent mind emulators produced by companies such as Google and Baidu continue to emerge. However, the view that artificial intelligence is a threat remains persistent. An operation is that if we compare the developmental levels of artificial intelligence products and systems with measured human intelligence quotients (IQs), can we develop a quantitative analysis method to assess the problem of artificial intelligence threat?
Yong Shi
Backmatter
Metadata
Title
Advances in Big Data Analytics
Author
Prof. Yong Shi
Copyright Year
2022
Publisher
Springer Nature Singapore
Electronic ISBN
978-981-16-3607-3
Print ISBN
978-981-16-3606-6
DOI
https://doi.org/10.1007/978-981-16-3607-3

Premium Partner