Skip to main content
Top

2020 | Book

Classification and Data Analysis

Theory and Applications

insite
SEARCH

About this book

This volume gathers peer-reviewed contributions on data analysis, classification and related areas presented at the 28th Conference of the Section on Classification and Data Analysis of the Polish Statistical Association, SKAD 2019, held in Szczecin, Poland, on September 18–20, 2019. Providing a balance between theoretical and methodological contributions and empirical papers, it covers a broad variety of topics, ranging from multivariate data analysis, classification and regression, symbolic (and other) data analysis, visualization, data mining, and computer methods to composite measures, and numerous applications of data analysis methods in economics, finance and other social sciences. The book is intended for a wide audience, including researchers at universities and research institutions, graduate and doctoral students, practitioners, data scientists and employees in public statistical institutions.

Table of Contents

Frontmatter

Methods

Frontmatter
Comparison of Proposals of Transformation of Nominants into Stimulants on the Example of Financial Ratios of Companies Listed on the Warsaw Stock Exchange
Abstract
In case of linear ordering, it is important to determine the character of the variables describing the examined objects. When the set of variables contains nominants next to stimulants and destimulants, there is need to transform nominants into stimulants in order to have comparable variables. The paper focuses on the formulas of transformation of nominants with a recommended range of values into stimulants with the range of [0; 1]. Because during the linear ordering of some companies due to their financial condition, it turned out that after the transformation of nominant indicators with a recommended range of values, the obtained standardized stimulants for companies with values outside the range of recommended values did not always order companies in accordance with the expectations resulting from the original values of these indicators. Therefore, the aim of the study was to compare selected formulas of transformation of nominants into stimulants and to indicate those transformations on the basis of which the order of companies before and after the transformation was characterized by greater consistency. The Authors also proposed modifications for selected formulas which allow to maintain this consistency. The data on two financial indicators: current ratio and debt ratio which are considered in the literature as nominants with the recommended range of values were used. The data on the ratios come from Notoria Serwis and concern companies from the Machinery industry sector listed on the Warsaw Stock Exchange in 2016.
Barbara Batóg, Katarzyna Wawrzyniak
Silhouette Index as Clustering Evaluation Tool
Abstract
Silhouette index is commonly used in cluster analysis for finding the optimal number of clusters, as well as for final clustering validation and evaluation as a synthetic indicator allowing to measure the general quality of clustering (relative compactness and separability of clusters—see Walesiak and Gatnar in Statystyczna analiza danych z wykorzystaniem programu R. PWN, Warszawa, p. 420, 2009). Its advantage is low computational complexity and simple interpretation rules. Recently, some proposals have appeared to use this index directly as basis of clustering algorithms. The paper is a tryout of the evaluation of such approach. In the paper examples, when the “mechanical” use of the silhouette index leads to the results that do not correspond to the actual structure of the classes are shown, the recommendations on the principles of the correct application of the index are presented.
Andrzej Dudek
The Role of Discretization of Continuous Variables in Socioeconomic Classification Models on the Example of Logistic Regression Models and Artificial Neural Networks
Abstract
Logistic regression models and artificial neural networks require the use of appropriate quality data. One of the methods of improving the quality of raw data is the discretization of continuous variables. It can be a way to deal with outliers and influential observations and can be helpful when the assumptions of some of the models are not met. This paper shows that despite the fact that the discretization of continuous variables means that reduced information is used for the modeling, it can improve the classification accuracy of machine learning models. This is particularly important when searching for the best predictive model when a limited set of explanatory variables is available, as well as when analyzing large data sets. In addition, by selecting the methods used to discretize continuous variables we decide about the number and type of variables that are included in the model and, as a result, are subject to interpretation. The selection of cut-off points matching the purpose of the research can be made using supervised discretization methods. In this study, the data from the Generations and Gender Survey (GGS) for Poland was used. The status of respondents on the labor market was considered. For the considered data, the advantages of using supervised discretization of continuous variables based on the entropy criterion and the Gini criterion were pointed out. Importantly, discretization based on these methods provided predictive models of better classification accuracy than the models based on discretization procedure frequently applied in socioeconomic studies.
Wioletta Grzenda
Intuitionistic Fuzzy Synthetic Measure for Ordinal Data
Abstract
The paper presents an intuitionistic fuzzy synthetic measure for ordinal data based on Hellwig’s pattern of the development method. The intuitionistic fuzzy synthetic measure allows for a comparative analysis of objects due to the complex phenomenon described by ordinal measurement scales. It also allows for taking into account the uncertainty in comparing objects expressed in the form of neutral points on the ordinal measurement scales. The proposed approach is a part of the research into ordinal data using the fuzzy set theory. The method of the construction of the proposed synthetic measure was presented on the example of the subjective quality of life research of the residents of the communes of the Kraina Łęgów Odrzańskich region in Poland.
Bartłomiej Jefmański
Improving Classification Accuracy of Ensemble Learning for Symbolic Data Trough Neural Networks’ Feature Extraction
Abstract
The key element that has a major impact on the modeling process is the method selection and selection of variables (information that will be used in the model). One of the approaches that allow to improve model’s accuracy is the selection of variables and the second one is the transformation of variables. The paper presents a procedure that combines these two approaches—extracting variables from neural networks (multilayer perceptron for symbolic data) as the method of variable selection for the purposes of ensemble learning for symbolic data. The main aim of the paper is to analyze the usefulness of the proposed approach for the prediction power of the ensemble model. In the empirical part, a symbolic data set describing a thousand German borrowers is used.
Marcin Pełka

Applications in Finance

Frontmatter
Inequality Restricted Least Squares (IRLS) Model of Real Estate Prices
Abstract
The aim of the paper is developing an econometric model that may support the process of real estate mass appraisal. The research hypothesis assumes that a model with restrictions enables a more precise determination of the impact of real property attributes on the prices than an analogous model without restrictions. The so-called Szczecin algorithm of real estate mass appraisal serves as a starting point for the model determination. A unitary price of undeveloped land real properties designated for low-rise residential development constitutes an explained variable. A set of explanatory variables is comprised of the following real estate attributes: surface area, plot physical properties, utilities, transport availability, real estate neighbourhood. The impact of a location was considered through dummy variables adopted for city surveying sections. All the variables were introduced into the model taking into account the measurement scales best suited for each of them. Two types of restrictions, (1) non-negativity of an attribute impact and (2) monotonicity of an attribute impact, will be imposed on the model parameters. These restrictions refer to the parameters with variables (attributes) other than surface area, which is measured in m2. The procedure of estimation of a model with restrictions will be discussed. The model will be verified with the use of a real transaction database from the Szczecin real estate market concerning undeveloped land real estate designated for low-rise residential development.
Mariusz Doszyń
Application of Hill Estimator to Assess Extreme Risks in the Metals Market
Abstract
Rare phenomena are extremely important for risk assessment. The main reason is that their occurrence usually results in significant consequences. There are many methods for identifying such events. Some of these include extreme statistics, which are part of the extreme value theory. High-order quantiles allow to determine the level of risk for which the probability of a risky event occurring is negligible. In this paper, the issue of estimation of tail index of probability distribution using Hill estimator and its modification is presented. The results of comparative analysis for selected nonparametric and parametric models were compared. Empirical analysis was carried out on the example of assets from the base metals market: aluminium and copper.
Dominik Krężołek
Segmentation of Enterprises on the Basis of Their Duration Using Survival Trees—Results of an Analysis for Legal Persons and Organizational Entities Without Legal Personality in the Łódzkie Voivodship
Abstract
The studies, carried out so far, on established and liquidated enterprises in the Łódzkie Voivodship show that in terms of legal persons and organizational entities without legal personality, firm’s duration is significantly longer than in terms of natural persons conducting economic activity. These entities constitute a clearly different group of enterprises. The article presents the results of an analysis, whose aim was the segmentation of legal persons and organizational entities without legal personality by its duration. A total of 10,562 enterprises have been studied. Survival trees (CTree algorithm) have been used to define groups of enterprises similar in duration, while a specific legal form, firm’s location (county—“powiat”), type of conducted activity, size (measured in terms of the number of employees) and type of ownership have been used as explanatory variables. The use of recursive partitioning method made it possible to divide sets of objects into homogenous subsets. Then, estimation of survival function has been made in each of the obtained subsets with the use of Kaplan–Meier method. Such an approach to analysis enables more precise estimate of firm’s duration than the use of Kaplan–Meier function for the total data. Prediction error curves based on the bootstrap cross-validation estimates of the prediction error have been used to assess and compare predictions obtained from both models.
Artur Mikulec, Małgorzata Misztal
Corporate Bankruptcy Prediction with the Use of the Logit Leaf Model
Abstract
Various data classification methods are used for bankruptcy prediction. Among them is the logit leaf model as a hybrid classification algorithm that enhances logistic regression and decision tree. The logit leaf model consists of two stages. In the first stage, company sets are identified using decision tree, and in the second stage a logit model is created for every leaf of this tree. The purpose of the paper is to present the results of the research on the usefulness of the logit leaf model for corporate bankruptcy prediction. A value added of the paper is the application of the logit leaf model to the prediction firms’ bankruptcy. The research was carried out with the use of 61 financial ratios regarding enterprises from the manufacturing sector in Poland. The CART classification tree, the logit model and the logit leaf model were applied. Models were constructed for balanced and non-balanced data sets. The bankruptcy prediction was made for a year in advance. The following measures of prediction effectiveness of the analysed methods were used: sensitivity, specificity, precision, F1, G-mean and AUC. The results of the conducted research did not confirm an advantage of the hybrid approach over the use of individual classifiers. Calculations were performed in R program.
Barbara Pawełek, Józef Pociecha
The Impact of Longevity on a Valuation of Long-Term Investments Returns: The Case of Selected European Countries
Abstract
The impact of longevity risk is a topic of growing importance in academic research and public debate over the past few years. From the individual’s perspective, the need for long-term investment increases as the life expectancy increases. Improvements in longevity and changing structure of population impact economy and financial stability. In this paper, we consider some economic, financial and demographic variables in the context of their impact on longevity in terms of long-term investment. The principal component regression is used in order to construct long-term investment portfolios that are sensitive to risk factors according to the APT portfolio factor model for selected European countries. Three investment portfolios with different fixed risk profiles (low, medium and high) have been proposed as the final results of the main research. For selected European countries, PCA’s longevity risk factors associated with longevity risk have a significant impact on return on long-term portfolio.
Grażyna Trzpiot

Applications in Economics

Frontmatter
Sustainable Development and Green Economy in the European Union Countries—Statistical Analysis
Abstract
The literature on the subject indicates that green growth is a direct result of implementing a sustainable development strategy. According to this assumption, countries that include sustainable development goals in their strategic documents should also achieve results in the area of green growth or the green economy. The monitoring of green growth should be based on indicators that make it possible to distinguish the green economy from the traditional one by taking into account, inter alia, indicators covering: green products, services, investments, and public procurement as well as green jobs in the green sectors of the economy. OECD proposes to use for this purpose indicators divided into four main groups: environmental and resource productivity, natural asset base, the environmental dimension of the quality of life and economic opportunities and policy responses. The aim of the work is to examine the relationships between sustainable development and green economy, especially in the area of their measurement and to determine the relationship between the results achieved by the EU countries in this area. The result of the research is the assessment of the results obtained by the EU countries in each of the analyzed areas using a taxonomic development measure based on the Weber median and the identification of relation between the results.
Katarzyna Cheba, Iwona Bąk
The Review of Indicators of Data Quality in Intra-Community Trade in Goods. The Choice of an Indicator and Its Effect on the Ranking of Countries
Abstract
This article deals with the issue of mirror data concerning intra-Community supplies of goods. Theoretically, in two sources—registers of two countries, trading partners—goods of the same value should be recorded. The observed asymmetries in the declared values were of interest to numerous researchers. The first one to pay special attention to these differences was Morgenstern (1963), who, among other things, dealt with the study of differences in data on world exports and imports. As a result of the literature review, the methods of examining the quality of mirror data in foreign trade (measures and indicators) were systematised. We have also put forward our own proposals. Selected indicators of data asymmetry were used in the study (including aggregated data asymmetry index ZW, aggregated weighted asymmetry index AER, symmetric mean absolute discrepancy index SMADI and general data asymmetry index OW). Similarities and differences in the obtained results were pointed out (values of indicators and ranking of countries were determined according to the quality of data). The research was conducted using data from the Eurostat COMEXT database—intra-Community supplies (ICS) and acquisitions (ICA) of goods between the EU member states in 2017 were analysed.
Iwona Markowicz, Paweł Baran
Development of ICT in Poland in Comparison with the European Union Countries—Multivariate Statistical Analysis
Abstract
The Statistics Poland defines Information and Communication Technologies (ICT) as “a family of technologies that are processing, collecting and sending information in an electronic format.” The wide access to the ICT, their ceaseless expansion and extending possibilities of applicability constitute the core for the contemporary society development in Poland, Europe and all over the world. Monitoring and analyzing the changes of and in the ICT area are of great importance in both economic and social dimensions. The ICT expansion is considered to be a stimulant for various processes taking place in the modern economy, significantly affecting the innovation growth in many sectors, as well as increasing competitiveness on both micro- and macroeconomic scales. The study presents and discusses the assessment of the ICT development level in Poland against other European Union countries in the individual users and households perspective. Also, the reasons of the Internet absence at households and types of online activities were investigated. The attention was focused on the identification of subgroups of countries similar in the context of the ICT access and development in the studied societies through the considered years. The analysis was based on the Eurostat and the ITU data for years 2008–2017. The exploratory data analysis methods dealing with three-way data structures (the between and within-class principal component analysis) were applied. Factorial maps (scatterplots and biplots) were presented to summarize the results. The Hellwig method for linear ordering was used to rank the EU-28 countries due to the ICT development level.
Małgorzata Misztal, Aleksandra Kupis-Fijałkowska
Sensitivity Analysis in Causal Mediation Effects for TAM Model
Abstract
One of the goals of scientific research is to identify cause-effect relationships, which in many cases are made in non-experimental research design, based on correlation measures or using regression methods. A special case is a structural equation model (SEM) that is often and incorrectly labeled “causal” models. The aim of the paper is to identify causal relationships in relation to technology acceptance models (TAM) (Davis in MIS Q 13:319–340, 1989; Davis et al. in Manage Sci 35:982–1003, 1989) using the analysis of mediation effects and causal dependencies that stem from Markov’s causal rule. Identification of causal relationships is made using d-separation (Pearl in Stat Surv 3:96–146, 2009) and sensitivity analysis (Imai et al. in Stat Sci 1:51–71, 2010; Tingley et al. in J Stat Soft 59:1–38, 2013). The aim of this article is to assess the impact of unknown disturbing variables (confounders) affecting both the mediation and focal-dependent variables. The analysis allowed for simulations of correlated disturbances effect of dependent variables in the TAM model on the degree of average causal mediation effect bias. The TAM model was built on the basis of research conducted on a quota sample of 150 students of the Cracow University of Economics.
Adam Sagan, Mariusz Grabowski

Applications in Social Problems

Frontmatter
Prentice–Williams–Peterson Models in the Assessment of the Influence of the Characteristics of the Unemployed on the Intensity of Subsequent Registrations in the Labour Office
Abstract
In the analysis of the duration of socio-economic phenomena, events subject to the study may occur more than once. They are called recurring or multiple events. Most analyses focus only on the first event and ignore the next one. In many cases, the risk of the next event occurring depends on the previous events. The aim of the paper is to analyse risk of subsequent registrations in the labour office depending on the characteristics of the unemployed (gender, age, education, seniority) using Prentice–Williams–Peterson’s conditional models. Two types of models for multidimensional survival data were used in this paper. The first one (PWP-CP model) considers the time until the event occurs from the beginning of observation, and the second one (PWP-GP model) considers the time from the previous event. The basis of these models is the stratified Cox proportional hazards model, in which the strata are created by subsequent events. These models are an extension of the classical approach to survival analysis. In the study, individual data of persons registered in the Poviat Labour Office in Szczecin were used. The research revealed that age and education influenced the risk of multiple registrations in the office, while gender and seniority did not have a significant impact. In a similar way, the characteristics of the unemployed affected the risk of first return to office. However, they did not affect subsequent registrations.
Beata Bieszk-Stolorz
Right-Skewed Distribution of Features and the Identification Problem of the Financial Autonomy of Local Administrative Units
Abstract
Linear ordering methods with ideal solutions may sometimes suffer from identification problems when determining the development levels of examined objects. These problems manifest themselves in the form of inconsistencies between the range of the constructed synthetic measure and the development level of the complex phenomenon it should depict—especially, when the measure’s low values are assigned to objects with obviously high level of development. Such inconsistencies often arise when simple features are strongly skewed—which is the case of the financial autonomy of the second-level local administrative units (communes) that is described by a number of asymmetric financial indicators. The aim of the research was to pose the problem of identifying levels of financial autonomy of Polish communes—assessed by synthetic measures constructed with ideal solution methods such as Hellwig’s and TOPSIS—and to present proposals to resolve it. Two variants of the classical Hellwig’s and TOPSIS methods were analyzed: standard and with correction of ideal values by the quartile criterion. Additionally, the positional TOPSIS method was also considered. It was found that with the standard classical methods or the positional TOPSIS, the prevalence of asymmetric simple features would reduce the range of the synthetic measure and shift it toward lower values. The variants with correction of ideal values had the range much broader and more centered, the broadest in the case of the corrected TOPSIS method. This contributed to the improvement of consistency between the identified levels of the communes financial autonomy and the synthetic measure values assigned to them.
Romana Głowicka-Wołoszyn, Feliks Wysocki
Multi-criteria Rankings with Interdependent Criteria: Case of EU Countries on Their Way to Healthy Lives and Well-Being
Abstract
One of the main assumptions when making multi-criteria rankings or using multivariate statistical analysis methods in comparative surveys is independence of considered diagnostic criteria. However, in practical research, we meet problems with statistical independence of chosen essential properties of objects under comparison. The other problem is the choice of adequate weights for criteria. Weights are usually chosen subjectively (as in the AHP method), as equal (if there are no reasons justifying diversification) or as statistically justified, taking into account discriminant capability and information capacity of the criteria. Our proposition is to choose weights according to the values of variance inflation factors (VIFs)—diagonals of the inverse to the matrix of correlation coefficients across criteria. Greater VIF means smaller information capacity of the criterion and smaller weight is imposed in consequence. The other proposition is choosing weights basing on principal component analysis (PCA) of the covariance matrix—reduction of criteria space dimension. We compare these proposals for classical simple average weighting method (SAW). Another proposition is the method multi-criteria principal components (MCPC) in which weights are assigned to principal components. Our rankings base on EUROSTAT indices for 28 EU countries measuring their achievements of targets of the UN 2030 Agenda for Sustainable Development Goal 3: “Ensure healthy lives and promote well-being for all at all ages” for the year 2017 (or closest).
Iwona Konarzewska
The Comparison of Income Distributions for Women and Men in the European Union Countries
Abstract
The purpose of this study was to compare personal income distributions in countries of the European Union, taking into account gender differences. Using data from the EU-SILC project, the gender income gap for 28 European countries was examined. First, we examined the income inequalities of men and women in each country using the Oaxaca–Blinder decomposition procedure. The unexplained part of the gender pay gap gave us information about the wage discrimination. Second, we extended the decomposition procedure to different quantile points along the whole income distribution. To construct the counterfactual distribution, we used the recentered influence function—regression approach. We found that there exists an important diversity in the size of the gender pay gap across members of the European Union. The results obtained for these countries allowed us to group them into four clusters using the agglomerative clustering algorithm. The results of decomposition were analyzed and compared across the formulated groups of countries.
Joanna Landmesser
Common Stochastic Mortality Trends for Multiple European Populations
Abstract
The main hypothesis for multi-population mortality models is that mortality rate differences for any two populations having similar socioeconomic status and close connections with each other do not diverge indefinitely over time. Recent evaluation studies have demonstrated that multi-population mortality models are superior to individual mortality forecasting models. However, the key point is to understand, extract and model the common trends driving the mortality patterns for a group of countries to improve the national long-term mortality forecasts. The aim of the paper is twofold. Firstly, the discussion on different approaches to identify the existence of the common mortality trends is provided. Secondly, the mortality time-varying indicator derived from the Lee–Carter model is used to obtain the similarities of different countries via a semi-parametric comparison approach. Two and multi-countries cases are provided.
Justyna Majewska, Grażyna Trzpiot
Impact of the Selected Factors on the Men and Women Wages in Poland in 2014. The Conjoint Analysis Application
Abstract
Numerous studies on the labour market relate to the wages. They indicate a significant diversity of factors affecting the remuneration in groups of employees with different characteristics. The aim of the study is to assess the relative importance of selected variables (attributes of employees) on the level of wages. The analysis is conducted for the employees working in Poland in 2014 in the enterprises with at least ten workers. We also take into account the sub-samples included only men or only women. Additionally, an assessment of the impact of outliers on changes in relative importance of analysed attributes was carried out. The analysis is applied the relative importance measure practised in conjoint analysis. The data employed in the study are from the Eurostat’s Structure of Earnings Survey. The results indicated primarily two variables that have greater importance for the diversity of the wages. They are economic activity (industry, according to the NACE rev. 2) and occupation (according to the ISCO-08). Their joint relative importance is greater than 50%. It is also noticeable that along with the elimination of the outliers the importance of the enterprises’ economic activity is increasing and importance of the level of occupation is decreasing. In the samples encompassed men are noted higher relative importance of variables as follows: economic activity of enterprise and size of the enterprise. In turn, in the samples encompassed women we observe higher relative importance for occupation, educational level, age group and contractual working time (full- or part-time).
Aleksandra Matuszewska-Janica
Metadata
Title
Classification and Data Analysis
Editors
Prof. Krzysztof Jajuga
Jacek Batóg
Marek Walesiak
Copyright Year
2020
Electronic ISBN
978-3-030-52348-0
Print ISBN
978-3-030-52347-3
DOI
https://doi.org/10.1007/978-3-030-52348-0