Skip to main content
main-content

2022 | Buch

Data Classification and Incremental Clustering in Data Mining and Machine Learning

share
TEILEN
insite
SUCHEN

Über dieses Buch

This book is a comprehensive, hands-on guide to the basics of data mining and machine learning with a special emphasis on supervised and unsupervised learning methods. The book lays stress on the new ways of thinking needed to master in machine learning based on the Python, R, and Java programming platforms. This book first provides an understanding of data mining, machine learning and their applications, giving special attention to classification and clustering techniques. The authors offer a discussion on data mining and machine learning techniques with case studies and examples. The book also describes the hands-on coding examples of some well-known supervised and unsupervised learning techniques using three different and popular coding platforms: R, Python, and Java. This book explains some of the most popular classification techniques (K-NN, Naïve Bayes, Decision tree, Random forest, Support vector machine etc,) along with the basic description of artificial neural network and deep neural network. The book is useful for professionals, students studying data mining and machine learning, and researchers in supervised and unsupervised learning techniques.

Inhaltsverzeichnis

Frontmatter
Chapter 1. Introduction to Data Mining and Knowledge Discovery
Abstract
Data mining is a process of discovering some necessary hidden patterns from a large chunk of data that can be stored in multiple heterogeneous resources. It has an enormous use to make strategic decisions by business executives after analyzing the hidden truth of data. Data mining one of the steps in the knowledge-creation process. A data mining system consists of a data warehouse, a database server, a data mining engine, a pattern analysis module, and a graphical user interface. Data mining techniques include mining the frequent patterns and association learning rules with analysis, sequence analysis. Data mining technique is applicable on the top of various kinds of intelligent data storage systems such as data warehouses. It provides some analysis processes to make some useful strategic decisions. There are various issues and challenges faced by a data mining system in large databases. It provides a great place to work for data researchers and developers. Data mining is the process of classification, which can be executed based on the examination of training data (i.e., objects whose class label is predefined). With the help of an expert set of previous class objects with known class labels, it can find a model that can predict a class object with an unknown class label. These classification models can be classified into a variety of categories, including nearest neighbor, neural network, and others. Bayesian model, decision tree, neural network Random forest, decision trees Support vector machine, random forest SVM (support vector machine), for example. By analyzing the most common class among k closest samples, the K-Nearest Neighbor (KNN) technique aids in predicting of the class object with the unknown class label. It’s an easy-to-use strategy that yields a solid classification result from any distribution. The Naive Bayes theory helps to perform the classification. It is one of the fastest classification algorithms, capable of efficiently handling real-world discrete data.
Sanjay Chakraborty, SK Hafizul Islam, Debabrata Samanta
Chapter 2. A Brief Concept on Machine Learning
Abstract
Machine learning is a subset of AI. It’s a research project aimed at gathering computer programs capable of performing intelligent actions based on prior facts or experiences. Most of us utilize various machine learning techniques every day when we use Netflix, YouTube, Spotify recommendation algorithms, and Google and Yahoo search engines and voice assistants like Google Home and Amazon Alexa. All of the data is labeled, and algorithms learn to anticipate the output from the input. The algorithms learn from the data’s underlying structure, which is unlabelled. Because some data is labeled, but not all are, a combination of supervised and unsupervised techniques can be used.
Sanjay Chakraborty, SK Hafizul Islam, Debabrata Samanta
Chapter 3. Supervised Learning-Based Data Classification and Incremental Clustering
Abstract
Using supervised learning-based data classification and incremental clustering, an unknown example can be classified using the most common class among K-nearest examples. The KNN classifier claims, “Tell me who your neighbors are, and it will tell you who you are”. The supervised learning-based data classification and incremental clustering technique is a simple yet powerful approach with applications in computer vision, pattern recognition, optical character recognition, facial recognition, genetic pattern recognition, and other fields. It’s also known as a slacker learner because it doesn’t develop a model to classify a given test tuple until the very last minute. When we say “yes” or “no,” there may be an element of chance involved. However, the fact that a diner can recognise an invisible food using his senses of taste, flavour, and smell is highly fascinating. At first, there can be a brief data collection phase: what are the most noticeable spices, aromas, and textures? Is the flavour of the food savoury or sweet? This information can then be used by the diner to compare the bite to other items he or she has had in the past. Earthy flavours may conjure up images of mushroom-based dishes, while briny flavours may conjure up images of fish. We view the discovery process through the lens of a slightly modified adage: if it smells like a duck and tastes like a chicken, you’re probably eating chicken. This is a case of supervised learning in action. Machine learning can benefit from supervised learning, which is a concept that can be applied to it (ML).
Sanjay Chakraborty, SK Hafizul Islam, Debabrata Samanta
Chapter 4. Data Classification and Incremental Clustering Using Unsupervised Learning
Abstract
Data modelling, which is based on mathematics, statistics, and numerical analysis, is used to look at clustering. Clusters in machine learning allude to hidden patterns; unsupervised learning is used to find clusters, and the resulting system is a data concept. As a result, clustering is the unsupervised discovery of a hidden data concept. The computing needs of clustering analysis are increased because data mining deals with massive databases. As a result of these challenges, data mining clustering algorithms that are both powerful and widely applicable have emerged. Clustering is also known as data segmentation in some applications because it splits large datasets into categories based on their similarities. Outliers (values that are “far away” from any cluster) can be more interesting than typical examples; hence outlier detection can be done using clustering. Outlier detection applications include the identification of credit card fraud and monitoring unlawful activities in Internet commerce. As a result, multiple runs with alternative initial cluster center placements must be scheduled to identify near-optimal solutions using the K-means method. A global K-means algorithm is used to solve this problem, which is a deterministic global optimization approach that uses the K-means algorithm as a local search strategy and does not require any initial parameter values. Instead of selecting initial values for all cluster centers at random, as most global clustering algorithms do, the proposed technique operates in stages, preferably adding one new cluster center at a time.
Sanjay Chakraborty, SK Hafizul Islam, Debabrata Samanta
Chapter 5. Research Intention Towards Incremental Clustering
Abstract
Incremental clustering is nothing but a process of grouping new incoming or incremental data into classes or clusters. It mainly clusters the randomly new data into a similar group of clusters. The existing K-means and DBSCAN clustering algorithms are inefficient to handle the large dynamic databases because, for every change in the incremental database, they simply run their algorithms repeatedly, taking lots of time to properly cluster those new ones coming data. It takes too much time and has also been realized that applying the existing algorithm frequently for updated databases may be too costly. So, the existing K-means clustering algorithm is not suitable for a dynamic environment. That’s why incremental versions of K-means and DBSCAN have been introduced in our work to overcome these challenges. To address the aforementioned issue, incremental clustering algorithms were developed to measure new cluster centers by simply computing the distance of new data from the means of current clusters rather than rerunning the entire clustering procedure. Both the K-means and the DBSCANDBSCAN algorithms use a similar approach. As a result, it specifies the delta change in the original database at which incremental K-means or DBSCANDBSCAN clustering outperforms prior techniques.
Sanjay Chakraborty, SK Hafizul Islam, Debabrata Samanta
Chapter 6. Real-Time Application with Data Mining and Machine Learning
Abstract
Data mining and machine learning are the most expressive research and application domain. All real-time application directly or indirectly depends on data mining and machine learning. There are many relevant fields, like data analysis in finance, retail, telecommunications sector, analyzing biological data, other scientific uses, and intrusion detection. The most expressive research and application domain is data mining and machine learning. Data mining and machine learning are used in all real-time applications, whether directly or indirectly. Data analysis in finance, retail, telecommunications, biological data analysis, extra scientific applications, and intrusion detection are just a few examples where it can be used. Because it captures a lot of data from sales, client purchase histories, product transportation, consumption, and services, DM has a lot of applications in the retail industry. It's only logical that the amount of data collected will continue to climb as the Internet's accessibility, cost, and popularity increase. In the retail industry, DM assists in the detection of customer buying behaviors and trends, resulting in improved customer service and increased customer retention and satisfaction.
Sanjay Chakraborty, SK Hafizul Islam, Debabrata Samanta
Chapter 7. Feature Subset Selection Techniques with Machine Learning
Abstract
Scientists and analysts of machine learning and data mining have a problem when it comes to high-dimensional data processing. Variable selection is an excellent method to address this issue. It removes unnecessary and repetitive data, reduces computation time, improves learning accuracy, and makes the learning strategy or data easier to comprehend. This chapter describes various commonly used variable selection evaluation metrics before surveying supervised, unsupervised and semi-supervised variable selection techniques that tend to be often employed in machine learning tasks including classification and clustering. Finally, ensuing variable selection difficulties are addressed. Variant selection is an essential topic in machine learning and pattern recognition, and numerous methods have been suggested. This chapter scrutinizes the performance of various variable selection techniques utilizing public domain datasets. We assessed the quantity of decreased variants and the increase in learning assessment with the selected variable selection techniques and then evaluated and compared each approach based on these measures. The evaluation criteria for the filter model are critical. Meanwhile, the embedded model selects variations during the learning model's training process, and the variable selection result is automatically outputted when the training process is concluded. While the sum of squares of residuals in regression coefficients is less than a constant, Lasso minimizes the sum of squares of residuals, resulting in rigorous regression coefficients. The variables are then trimmed using the AIC and BIC criteria, resulting in a dimension reduction. Lasso-dependent variable selection strategies, such as the Lasso in the regression model and others, provide a high level of stability. Lasso techniques are prone to high computing costs or overfitting difficulties when dealing with high-dimensional data.
Sanjay Chakraborty, SK Hafizul Islam, Debabrata Samanta
Chapter 8. Data Mining-Based Variant Subset Features
Abstract
A subset of accessible variants data is chosen for the learning approaches during the variant selection procedure. It includes the important one with the fewest dimensions and contributes the most to learner accuracy. The benefit of variant selection would be that essential information about a particular variant isn’t lost, but if just a limited number of variants are needed, and the original variants are extremely varied, there tends to be a risk of information being lost since certain variants must be ignored. Dimensional reduction, also based on variant extraction, on the other hand, allows the size of the variant space to be reduced without losing information from the original variant space. Filters, wrappers, and entrenched approaches are the three categories of variant selection procedures. Wrapper strategies outperform filter methods because the variation selection procedure is suited for the classifier to be used. Wrapper techniques, on the other hand, are too expensive to use for large variant spaces due to their high computational cost; therefore each variant set must be evaluated using the trained classifier, which slows down the variant selection process. Filter techniques have a lower computing cost and are faster than wrapper procedures, but they have worse classification reliability and are better suited to high-dimensional datasets. Hybrid techniques, which combine the benefits of both filters and wrappers approaches, are now being organized.
Sanjay Chakraborty, SK Hafizul Islam, Debabrata Samanta
Backmatter
Metadaten
Titel
Data Classification and Incremental Clustering in Data Mining and Machine Learning
verfasst von
Sanjay Chakraborty
Sk Hafizul Islam
Dr. Debabrata Samanta
Copyright-Jahr
2022
Electronic ISBN
978-3-030-93088-2
Print ISBN
978-3-030-93087-5
DOI
https://doi.org/10.1007/978-3-030-93088-2