Top

2022 | Book

Read chapter Read first chapter

Advances in Big Data Analytics

Theory, Algorithms and Practices

Author: Prof. Yong Shi

Publisher: Springer Nature Singapore

Part of: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

About this book

Today, big data affects countless aspects of our daily lives. This book provides a comprehensive and cutting-edge study on big data analytics, based on the research findings and applications developed by the author and his colleagues in related areas. It addresses the concepts of big data analytics and/or data science, multi-criteria optimization for learning, expert and rule-based data analysis, support vector machines for classification, feature selection, data stream analysis, learning analysis, sentiment analysis, link analysis, and evaluation analysis. The book also explores lessons learned in applying big data to business, engineering and healthcare. Lastly, it addresses the advanced topic of intelligence-quotient (IQ) tests for artificial intelligence.
Since each aspect mentioned above concerns a specific domain of application, taken together, the algorithms, procedures, analysis and empirical studies presented here offer a general picture of big data developments. Accordingly, the book can not only serve as a textbook for graduates with a fundamental grasp of training in big data analytics, but can also show practitioners how to use the proposed techniques to deal with real-world big data problems.

Frontmatter

Concept and Theoretical Foundation

Frontmatter

Chapter 1. Big Data and Big Data Analytics

Abstract

Big data now is a common term. However, the evolution of big data comes from twofold. The creation of the computer in the 1940s gradually provides tools for human beings to collect massive data, while the term “big data” becomes a popular slogan to represent the collection, processing, and analysis of various data [1]. The data has been exponentially growing for the last 70 decades. EMC2 [2] estimated that the world generated 1.8 zettabytes of data (1.8 multiple 21 zeros) by 2011. In fact, this figure has grown to 44 zettabytes, about 24 times in 2020. Big Data Analytics has arisen as the technical means dealing with both theory and application of big data. This chapter elaborates on the understanding of big data and its analytics. Section 1.1 briefly describes big data evolution and challenges. Section 1.2 is about big data’s current status, including its development in the world as well as in China. Section 1.3 explores big data analysis and data science problems.

Yong Shi

Chapter 2. Multiple Criteria Optimization Classification

Abstract

As the increasingly strong computational power of computers fills the shortage of human brain at calculating, data mining, a major component of data science, has emerged as the times require due to its merit of being capable of extracting novel and useful knowledge which has potential value from large scale of complex data. However, from the mathematical perspective, some data mining methods, such as decision tree, genetic algorithm, and association rules could be considered as heuristic algorithms: which means to select a “better solution” from several alternative solutions as the criterion of classification. These methods lack of exploring how to locate the “best solution” systematically.

Yong Shi

Chapter 3. Support Vector Machine Classification

Support vector machine (SVM) has been a popular technique in data analytics. Shi et al. [1] has reported some SVM algorithms. They vary from leave-one-out (LOO) bounds approaches, multi-class, unsupervised, semi-supervised and robust SVMs. Following the direction of the research afterwards, this Chapter provides five sections about advances of SVM in big data analytics. Section 3.1 has two subsections. The first one outlines the recent findings of the author’s research team on SVM [2] while the second one is about two new decomposition algorithms for training bound-constrained SVM [3]. Section 3.2 describes different twin SVM in classification with four subsections. The first one explores the improved twin SVM [4]. The second one is extending twin SVM for multi-category classification problems [5]. The third one provides robust twin SVM for pattern classification [6]. The fourth one elaborates structural twin SVM for classification [7]. Section 3.3 shows nonparallel SVM with four subsections. The first one is about a nonparallel SVM for a classification problem with universum learning [8]. The second one is about a divide-and-combine method for large scale nonparallel SVM [9]. The third one explores nonparallel SVM for pattern classification [4]. The fourth one is a multi-instance learning algorithm based on nonparallel classifier [10]. Section 3.4 shows Laplacian SVM classifiers with two subsections. One is about successive overrelaxation for Laplacian SVM [11] while another one is about Laplacian twin SVM for semi-supervised classification [12]. Finally, Sect. 3.5 discusses loss functions of SVM classification with three subsections. The first one is about the ramp loss least squares SVM [13]. The second is about the ramp loss nonparallel SVM for pattern classification [14]. The third one is about a classification model using privileged information and its application [10].

Yong Shi

Functional Analysis

Frontmatter

Chapter 4. Feature Selection

Abstract

In big data analytics, irrelevant and redundant features may not only deteriorate the performances of classifiers, but also slow down the prediction process. Although there is the availability of many classification models for prediction, it is a challenge to choose a set of important features that can lead to a satisfactory classifier. This chapter outlines some achievements of feature selection research in the last decade. Section 4.1 has three subsections. The first is an integrated scheme for feature selection and classifier evaluation in the context of prediction [1]. The second is about two-stage hybrid feature selection algorithms [2]. The third one is the feature selection with attributes clustering by maximal information coefficient [3]. Section 4.2 presents two regularizations for feature selections. They are feature selection with MCP² regularization [4] and feature selection with ℓ _{2, 1 − 2} regularization [5]. Finally, Sect. 4.3 describes two distance-based feature selections. They are the spatial distance join based feature selection [1] and a domain driven two-phase feature selection method based on bhattacharyya distance and kernel distance measurements [6].

Yong Shi

Chapter 5. Data Stream Analysis

Abstract

Data stream is a typical big data. Data stream can be founded in many real-life applications, such as wireless sensor networks, power consumption, information security and financial market. Data stream classification has drawn increasing attention from the data mining community in recent years. Data stream classification in such real-world applications is typically subject to three major challenges: concept drifting, large volumes, and partial labeling. As a result, training examples in data streams can be very diverse and it is very hard to learn accurate models with efficiency. This chapter provides two related research findings in the field. Section 5.1 describes a novel framework for application-driven classification of data streams [1]. The section first reviews the concepts of data stream, then categorizes diverse training examples into four types and assign learning priorities to them. Following the discussion, it derives four learning cases based on the proportion and priority of the different types of training examples. Finally, the respective support vector machine models are presented. Section 5.2 studies the problem of learning from concept drifting data streams with noise, where samples in a data stream may be mislabeled or contain erroneous values [2]. It has three subsections. The first one is about noisy description for data stream, the second one is the ensemble frameworks for mining data stream and the third one is the theoretical studies of the Aggregate Ensemble.

Yong Shi

Chapter 6. Learning Analysis

Abstract

It is event that most big data represented as non-structured or semi-structured forms, such as images, text and others. It is important to study how to use an abstract form to show data, either structured, non-structured or semi-structured or use label proportions to categorize the nature of data so that a data mining or data analytic algorithm can be performed smoothly. Leaning methods are very useful tools for understanding the data. Learning algorithms can be considered from different aspects, such as cognitive computing, mathematics, and machine learning.

Yong Shi

Chapter 7. Sentiment Analysis

Abstract

Sentiment analysis (SA) refers to the use of computational linguistics to identify and extract subjective information in source material, usually unstructured and heterogeneous text data. This chapter summarizes the recent findings of the authors’ research team on SA. It has two sections. Section 7.1 is word embedding with two Sect. 7.1.1 is about single sense model vs. multiple sense model while Sect. 7.1.2 is about intrinsic vs extrinsic evaluation. Section 7.2 outlines the SA applications.

Yong Shi

Chapter 8. Link Analysis

Abstract

Link analysis has been recognized as an effective technique in data science to explore the relationships of objects. The objects can be social events, people, organization and even business transactions. This chapter reports the practical models of link analysis in various data-driven application areas. Section 8.1 presents a recommendation system for marketing optimization [1]. Section 8.2 is about advertisement clicking prediction [2]. Section 8.3 presents a model for customer churn prediction [3]. Section 8.4 provides node coupling clustering approaches for link prediction [4]. Finally, Sect. 8.5 discusses a pyramid scheme model for consumption rebate frauds [5].

Yong Shi

Chapter 9. Evaluation Analysis

Abstract

Evaluation is one of the key steps in big data analytics, which determines the merit of data analysis towards the experimental objectives. It usually relates a trade-off comparison of multiple criteria which may conflict each other or complex interpretations of the problems in nature. This chapter provides several of evaluation models of the recent studies on data science. Section 9.1 reviews three evaluation formations for the known methodologies. Section 9.1.1 describes a decision-making support for the evaluation of clustering algorithms based on multiple criteria decision making (MCDM) [1]. Section 9.1.2 is about evaluation of classification algorithms using MCDM and rank correlation [2]. Section 9.1.3 discusses the public blockchain evaluation using entropy and Technique of Order Preference Similarity to the Ideal Solution (TOPSIS) [3]. Section 9.2 outlines two evaluation methods for Software. Section 9.2.1 is about a classifier evaluation for software defect prediction [4], while Sect. 9.2.2 is about an ensemble of software defect predictors by AHP-based evaluation method [5]. Section 9.3 describes four evaluation methods for sociology and economics. Section 9.3.1 is about a delivery efficiency and supplier performance evaluation in China’s E-retailing industry [6]. Section 9.3.2 is about the credit risk evaluation with Kernel-based affine subspace nearest points learning method [7]. Section 9.3.3 is a dynamic assessment method for urban eco-environmental quality evaluation [8], while Sect. 9.3.4 is an empirical study of classification algorithm evaluation for financial risk prediction [9].

Yong Shi

Application and Future Analysis

Frontmatter

Chapter 10. Business and Engineering Applications

Abstract

By implementing the algorithms for big data analytics described in the previous chapters, this chapter outlines three sections about related business and engineering applications. Section 10.1 relates to banking and financial market analysis with three subsections. The first one is about domestic systemically important banks: a quantitative analysis for the Chinese banking system [1]. The second is about how does credit portfolio diversification affect banks’ return and risk: evidence from Chinese listed commercial banks [2]. The third one is about an approach of integrating piecewise linear representation and weighted support vector machine for forecasting stock turning points [3]. Section 10.2 describes an agriculture problem that is the classification of orange varieties based on near infrared spectroscopy [4]. Section 10.3 provides two engineering applications. The first one is about automatic road crack detection using random structured forests [5] while the second one is efficient railway tracks detection and turnouts recognition method using HOG features [6].

Yong Shi

Chapter 11. Healthcare Applications

Abstract

Healthcare is also a very hot application area of data science, especially in the COVID-19 pandemic around the world since the beginning of 2020. This chapter provides two sections of the related healthcare applications. Section 11.1 deals with the evaluation of medical doctor’s performance by using ordinal regression-based approach [1], while Sect. 11.2 outlines a cutting-edge research finding to learn transmission patterns of COVID-19 outbreak by using an age-specific social contact characterization [2].

Yong Shi

Chapter 12. Artificial Intelligence IQ Test

Abstract

Since 2015, “artificial intelligence” has become a popular topic in science, technology, and industry. New products such as intelligent refrigerators, intelligent air conditioning, smart watches, smart robots, and of course, artificially intelligent mind emulators produced by companies such as Google and Baidu continue to emerge. However, the view that artificial intelligence is a threat remains persistent. An operation is that if we compare the developmental levels of artificial intelligence products and systems with measured human intelligence quotients (IQs), can we develop a quantitative analysis method to assess the problem of artificial intelligence threat?

Yong Shi

Backmatter

Title: Advances in Big Data Analytics
Author: Prof. Yong Shi
Publisher: Springer Nature Singapore
Electronic ISBN: 978-981-16-3607-3
Print ISBN: 978-981-16-3606-6
DOI: https://doi.org/10.1007/978-981-16-3607-3

Springer Professional

Advances in Big Data Analytics

Theory, Algorithms and Practices

About this book

Table of Contents

Frontmatter

Concept and Theoretical Foundation

Frontmatter

Chapter 1. Big Data and Big Data Analytics

Chapter 2. Multiple Criteria Optimization Classification

Chapter 3. Support Vector Machine Classification

Functional Analysis

Frontmatter

Chapter 4. Feature Selection

Chapter 5. Data Stream Analysis

Chapter 6. Learning Analysis

Chapter 7. Sentiment Analysis

Chapter 8. Link Analysis

Chapter 9. Evaluation Analysis

Application and Future Analysis

Frontmatter

Chapter 10. Business and Engineering Applications

Chapter 11. Healthcare Applications

Chapter 12. Artificial Intelligence IQ Test

Backmatter

Premium Partner