Skip to main content
Top

2021 | Book

Data Science and SDGs

Challenges, Opportunities and Realities

Editors: Prof. Bikas Kumar Sinha, Md. Nurul Haque Mollah

Publisher: Springer Singapore

insite
SEARCH

About this book

The book presents contributions on statistical models and methods applied, for both data science and SDGs, in one place. Measuring and controlling data of SDGs, data driven measurement of progress needs to be distributed to stakeholders. In this situation, the techniques used in data science, specially, in the big data analytics, play an important role rather than the traditional data gathering and manipulation techniques. This book fills this space through its twenty contributions. The contributions have been selected from those presented during the 7th International Conference on Data Science and Sustainable Development Goals organized by the Department of Statistics, University of Rajshahi, Bangladesh; and cover topics mainly on SDGs, bioinformatics, public health, medical informatics, environmental statistics, data science and machine learning.

The contents of the volume would be useful to policymakers, researchers, government entities, civil society, and nonprofit organizations for monitoring and accelerating the progress of SDGs.

Table of Contents

Frontmatter
SDGs in Bangladesh: Implementation Challenges and Way Forward
Abstract
Sustainable Development Goals (SDGs) are adopted as post-2015 development agenda by UN, which comprise of 17 goals and 169 targets. This agenda calls for action by all countries; poor, rich, and middle income. For its proper implementation, General Economics Division (GED) of Bangladesh Planning Commission has devised mapping of ministries (A Handbook on Mapping of Ministries by Targets in the Implementation of SDGs aligning with 7th Five Year Plan (2016–20) (September 2016), GED, Bangladesh Planning Commission.) by targets and formulated national action plan (National action plan of ministries/divisions by targets for the implementation of SDGs. (June 2018). GED, Bangladesh Planning Commission.) related to SDG goals and targets linking SDG targets with the current 7th FYP for harmonizing national and global agenda.
Shamsul Alam
Some Models and Their Extensions for Longitudinal Analyses
Abstract
In this article, I present some of my statistical research in the field of longitudinal data analysis along with applications of these methods to real data sets. The aim is not to cover the whole field; rather, the perspective is based on my own personal preferences. The presented methods are mainly based on growth curve and mixture regression models and their extensions, where the focus is on continuous longitudinal data. In addition, an example of the analysis of extensive register data for categorical longitudinal data is presented. Applica-tions range from forestry and health sciences to social sciences.
Tapio Nummi
Association of IL-6 Gene rs1800796 Polymorphism with Cancer Risk: A Meta-Analysis
Abstract
Interleukin-6 (IL-6) gene polymorphisms are a crucial functional marker in human body. Several genetic association studies reported the significant association between IL-6 gene and various major disease and cancers. In this study, the association of IL-6 gene polymorphism (rs1800796) with cancer risk was investigated through a meta-analysis, which included the larger sample size. To find the association between IL-6 gene (−572 G/C) polymorphism, we extracted the dataset in 27 eligible studies for 24,138 subjects through an efficient searching strategy from PubMed, PubMed central, web of science, google scholar, and other relevant biological literature-based online databases until February 2019. We investigated the association by comparing the allelic and genotypic case–control frequency based on odds ratio with 95% confidence interval and some other statistical tests. According to the results, the rs1800796 SNP significantly associated with increasing risk (CG vs. CC + GG: OR = 1.12, 95% CI = 1.01 – 1.23, p = 0.0288) of overall cancer, particularly with lung, stomach, and prostate cancer as well as for Asian ethnicity. These findings suggest that IL-6 gene polymorphisms may appraise as a genetic biomarker for cancer risks.
Md. Harun-Or-Roshid, Md. Borqat Ali, Jesmin, Md. Nurul Haque Mollah
Two Level Logistic Regression Analysis of Factors Influencing Dual Form of Malnutrition in Mother–Child Pairs: A Household Study in Bangladesh
Abstract
Bangladesh is undergoing a nutrition transition associated with rapid social and economic transitions giving rise to the double burden of the malnutrition phenomenon. It is essential to investigate the household study of malnutrition among mother and under-five child pairs. The objective of this study was to determine the prevalence and risk factors of malnutrition among mother and under-five child pairs at the same household in Bangladesh. Secondary data from the BDHS-2014 was used in this study. The sample population of this study consisted of 7,368 married, currently non-pregnant Bangladeshi women with their under-five child. Descriptive statistics, Chi-square tests, and two-level binary logistic regression model were used in this study. The prevalence of underweight mother and under-five child pairs was 22.0%, and the prevalence of overweight mother and underweight child was near to 10%. It was found that only less than 20 percent (19.6%) mother and child pairs was found to be of normal weight (healthy). The two-level binary logistic model showed that division, type of residence, parents’ education, household wealth index, mothers’ age, and child birth weight are found to be risk factors of under nutrition among mother and under-five child pairs. Our selected model identified the risk factors of under nutrition among mother and under-five child pairs in Bangladesh. These factors can be considered for reducing the number of malnutrition among mother and under-five child pairs in Bangladesh.
Md. Akhtaruzzaman Limon, Abu Sayed Md. Al Mamun, Kumkum Yeasmin, Md. Moidul Islam, Md. Golam Hossain
Divide and Recombine Approach for Analysis of Failure Data Using Parametric Regression Model
Abstract
The failure data of some products depend on factors or covariates such as the operating environment, usage conditions, etc. Under this situation, the parametric regression model is applied for modeling the failure data of the product as a function of covariates. Divide and recombine (D&R) is a new statistical approach to the analysis of big data. In the D&R approach, the data are divided into manageable subsets, an analytic method is applied independently to each subset, and the outputs are recombined. This chapter applies the D&R approach for analysis of an automobile component failure data using the Weibull regression model. Extensive simulation studies are presented to evaluate the performance of the proposed methodology with comparison to the traditional statistical estimation method.
Md. Razanmiah, Md. Kamrul Islam, Md. Rezaul Karim
Performance of Different Data Mining Methods for Predicting Rainfall of Rajshahi District, Bangladesh
Abstract
Rainfall predicting by efficient method is always interesting for particular region because timely and accurately forecasted rainfall data is extremely helpful to take necessary safety action in advance, in case of agricultural production, flood management, drought monitoring, and ongoing construction project. Data mining technique is suitable for predicting different environmental attributes by extracting new relationships from the past data. So, researchers are always trying to predict rainfall data with maximum accuracy by optimizing and integrating different data mining techniques for different weather stations. In our study, we compare the forecasting performance of Linear Discriminant Analysis, Classification and Regression Trees, Random Forest, K-Nearest Neighbors, and Support Vector Machine for rainfall prediction, in case of Rajshahi district, Bangladesh. The monthly time series data for the time period January, 1964 to December, 2017 is considered for analysis. Data mining processes such as data collection, data pre-processing, modeling, and evaluation are strictly followed for empirical studies. The forecasting performances of these models are confirmed by precision, recall, f-measure, and overall accuracy, and also by graphical method. The empirical result showed that the k-nn method is the most suitable method for predicting rainfall in case of Rajshahi district, Bangladesh for the subsequent time period.
Md. Mostafizur Rahman, Md. Abdul Khalek, M. Sayedur Rahman
Generalized Vector Autoregression Controlling Intervention and Volatility for Climatic Variables
Abstract
The purpose of this study is to build a time series model for forecasting the climatic variables of Rajshahi district using the VAR model controlling intervention and volatility. Seven models for seven climatic variables are found, and the stability of every model is checked with proper validation techniques. The fitted models are GVAR with GARCH (2,1) and intervention for Cloud coverage; GVAR with GARCH (3,1) and intervention for Relative Humidity; ARIMA (1,0,1) with GARCH (1,1) for rainfall, GVAR with GARCH (2,1), and intervention for maximum Temperature; GVAR with ARCH (2) and intervention for minimum temperature; GVAR with intervention for sunshine; and ARIMA (2,0,2) for wind speed. The stable models are used to forecast the daily data which may be beneficial to people and policymakers. Finally, it is found by forecasting that Maximum Temperature (T1), Humidity (H), Bright Sunshine (S), and Wind Speed (W) might be shown upward trend while Minimum Temperature (T2), Rainfall (R), and Cloud Coverage (Cl) might be shown decreasing trend from the year 2018 to 2022. Considering the finding of this study, Government and policymakers can make people aware of the adverse effect of climate change.
Md. Ashek Al Naim, Md. Abeed Hossain Chowdhury, Md. Abdul Khalek, Md. Ayub Ali
Experimental Designs for fMRI Studies in Small Samples
Abstract
Functional Magnetic Resonance Imaging (fMRI) is a technology for studying how our brains respond to mental stimuli. At the design stage, one is interested in developing the best sequence of mental stimuli for collecting the most informative data in order to render the most precise inference about the ‘unknown parameters’ under an assumed statistical model. The simplest such model incorporates linear relation between mean response and the parameters describing the effects of the stimuli, applied at regularly spaced time points during the study period. In this paper, we introduce the linear model and discuss estimation issues and related concepts such as ‘orthogonality’ and ‘balance’.
Bikas K. Sinha
Bioinformatic Analysis of Differentially Expressed Genes (DEGs) Detected from RNA-Sequence Profiles of Mouse Striatum
Abstract
Bioinformatic analysis is a powerful statistical analysis to investigate the significant genes and their biological information from RNA-sequence (RNA-Seq)-based gene expression profiles. The most differentially expressed genes (DEGs) of mouse striatum with their valuable information may be significantly contributed to the neuroscience research. Two inbred mouse strains, for instance, C57BL/6J (B6) and DBA/2J (D2), in neuroscience research are commonly used, and B6 strain sequences are mostly available. Our study’s focus on the identification of significant DEGs of B6 and D2 samples, protein–protein interaction network, to identify their biological functions, molecular pathway analysis, miRNAs-target gene interactions, downstream analysis, and to find out driven genes. Two samples, 10 B6 and 11 D2, were deeply analyzed, which were retrieved from the Gene Expression Omnibus (GEO) database with accession number GSE26024. DESeq2, edgeR, and limma tools were utilized to screen the DEGs somewhere in the range of B6 and D2 samples. DESeq2, edgeR, and limma had identified a total of 736, 757, and 530 DEGs with 37, 48, and 31 up-regulated genes, respectively. Protein–protein interaction network analyses of those DEGs were visualized using a search tool for the Retrieval of Interacting Genes and Cytoscape software. We selected the top 50 high-degree hub DEGs for each of the three methods, and explored 21 common hub genes along with three up-regulated genes Bdkrb2, Aplnr, and Ccl28. To explore the biological insights of these 21 common hub DEGs, Gene Ontology (GO) and KEGG pathway analysis were executed. In downstream analysis, hierarchical and k-means clustering techniques were used, and both the methods clustered Bdkrb2, Aplnr, and Ccl28 genes into the same group. Furthermore, DEGs, specifically the genes Bdkrb2, Aplnr, and Ccl28, are probably the core genes in inbred mouse strains. In conclusion, these genes probably are the biomarkers for further neuroscience research.
Bandhan Sarker, Md. Matiur Rahaman, Suman Khan, Priyanka Bosu, Md. Nurul Haque Mollah
Role of Serum High-Sensitivity C-Reactive Protein Level as Risk Factor in the Prediction of Coronary Artery Disease in Hyperglycemic Subjects
Abstract
In this study, we have evaluated the clinical value of the high-sensitivity C-reactive protein (hs-CRP) level in predicting the risk of coronary artery disease (CAD) in hyperglycemic subjects of Bangladesh. A total of 201 participants were selected for this study and fasting venous blood samples are collected from them for measuring fasting plasma glucose (FPG), serum total cholesterol (TC), triglycerides (TG), LDL-cholesterol (LDL-C) and HDL-cholesterol (HDL-C), hs-CRP, apolipoprotein A-1, apolipoprotein B and lipoprotein(a). CAD risk of study subjects was estimated with the use of Framingham Risk Score (FRS). Out of 201, 91 study participants are found as normal fasting glucose (NFG), 56 as impaired fasting glucose (IFG) and 54 as diabetes mellitus. The average levels of TC, TG, LDL-C, HDL-C, TC/HDL-C ratio, apolipoprotein A-1, apolipoprotein B and lipoprotein(a) in NFG, IFG and diabetes are not significantly different between each group. Statistically significant (p < 0.001) differences are observed between each group of hs-CRP levels. Among the components of FRS; age, systolic blood pressure and HDL-C are significantly correlated with an increase in concentration of FPG. The estimated Framingham 10-year risk of CAD and hs-CRP levels are significantly increasing with the concentration of FPG. Before and after adjusting for covariates (age, sex, smoking status, TC and HDL-C), it is found that FPG is significantly associated with hs-CRP level. Interestingly, serum hs-CRP levels are significantly increased in the higher FPG groups before and after adjustment by covariates. Finally, this study demonstrates hs-CRP as a stronger predictor of cardiovascular events in hyperglycemic subjects thereby helping to assess the risk of CAD induced by hyperglycemia.
Md. Saiful Islam, Rowshanul Habib, Md. Rezaul Karim, Tanzima Yeasmin
Identification of Outliers in Gene Expression Data
Abstract
Identification of outliers is a big challenge in big data although it has drawn a great deal of attention in recent years. Among all big data problems, the detection of outliers in gene expression data warrants extra attention because of its inherent complexity. Although a variety of outlier detection methods are available in the literature, Tomlins et al. (Tomlins et al. Science 310:644–648, 2005) argued that traditional analytical methods, for example, a two-sample t-statistic, which search for common activation of genes across a class of cancer samples, will fail to detect cancer genes, which show differential expression in a subset of cancer samples or cancer outliers. They developed the cancer outlier profile analysis (COPA) method to detect cancer genes and outliers. Inspired by the COPA statistic, some authors have proposed other methods for detecting cancer-related genes with cancer outlier profiles in the framework of multiple testing (Tibshirani and Hastie Tibshirani and Hastie Biostatistics 8:2–8, 2007; Wu Wu Biostatistics 8:566–575, 2007; Lian Lian Biostatistics 9:411–418, 2008; Wang and Rekaya Wang and Rekaya Biomarker Insights 5:69–78, 2010). Such cancer outlier analyses are affected by many problems especially if there is an outlier in the dataset then classical measures of location and scale are seriously affected. So the test statistic using these parameters might not be appropriate to detect outliers. In this study, we try to robustify one existing method. We propose a new technique called expressed robust t-statistic (ERT) for the identification of outliers. The usefulness of the proposed methods is then investigated through a Monte Carlo simulation.
Md. Manzur Rahman Farazi, A. H. M. Rahmatullah Imon
Selecting Covariance Structure to Analyze Longitudinal Data: A Study to Model the Body Mass Index of Primary School-Going Children in Bangladesh
Abstract
In the longitudinal study, the data are collected from the same subject over time and hence the data are correlated. To analyze such data selecting an efficient covariance structure is very important to get better results. Therefore, this article is aimed to select an efficient covariance structure to model the body mass index (BMI) of primary school-going children in Bangladesh. In this study, at first, we have conducted a longitudinal survey to build a cohort of 100 primary school-going children in Sylhet city, Bangladesh. We collected the information from the same children at the initial time (T0), after 6 months (T6), after 12 months (T12) and after 18 months (T18). Linear mixed model (LMM) is applied for selecting an efficient covariance structure and then to model the body mass index. To find out a better covariance structure, we used diagonal, unstructured (UN), auto Regressive order 1 (AR1) and compound symmetry (CS) covariance structures in collecting longitudinal data. Observing all the criteria, it is found that the covariance structure compound symmetry (‘CS’) gives better results for LMM. Finally using the CS covariance structure, initially, we observed that the BMI of male students’ is comparatively smaller than female students’ (Estimate = -−.04, P-value = 0.03). But overtime, a reverse result is observed at T12 and T18. Taken together, we may conclude that compound symmetry (CS) gives better output to model the body mass index of primary school-going children. As female students are getting more obese, in addition, today’s female children are the mothers of the future. Therefore, parents should give concentration to female children to reduce their body weight. This study may be useful for researchers in public health sectors to select a proper covariance structure to analyze their longitudinal data.
Mohammad Ohid Ullah, Mst. Farzana Akter
Statistical Analysis of Various Optimal Latin Hypercube Designs
Abstract
Among several Design of Experiments (DoEs), Latin Hypercube Design (LHD) is one of the most frequently used methods in the field of physical experiments and in the field of computer simulations to find out the behavior of response surface of the surrogate model with respect to design points. A good experimental design should have three important characteristics, namely (i) non-collapsing, (ii) space-filling and (iii) orthogonal properties. Though inherently LHD preserves non-collapsing property, but randomly generated LHDs have poor space-filling in terms of minimum pair-wise distance. In order to ensure the last two properties in LHD, researchers are frequently looking for finding optimal LHD in the sense of space-filling and orthogonal properties. Moreover, researchers are frequently encountered the question, which distance measure is the best in the case of optimal designs? In the literature, several types of optimal LHDs are available such as Maximin LHD, Orthogonal LHD, Uniform LHD, etc. On the other hand, two distance measures namely Euclidean and Manhattan distance measures are used frequently to find optimal DoEs. But which one of the two distance measures is better, is still unknown. In this article, intensive statistical analysis has been carried out on numerical instances to explore the deep scenario of each optimal LHD. The main goal of this research is to find out a scenario of the well-known optimal designs from statistical point of view. From this elementary experimental study, it seems to us that in the sense of space-filling, Euclidean distance measure-based Maximin LHD is the best. But if one needs space-filling along with better orthogonal property, then multi-objective (Maximin with approximate orthogonal)-based optimal LHD is relatively better than Maximin LHD.
A. R. M. Jalal Uddin Jamali, Md. Asadul Alam, Abdul Aziz
Erlang Loss Formulas: An Elementary Derivation
Abstract
The celebrated Erlang loss formulas, which express the probability that exactly j of c available channels/servers are busy serving customers, were discovered about 100 years ago. Today we ask: “What is the simplest proof of these formulas?” As an alternative to more advanced methods, we derive the Erlang loss formulas using (1) an intuitive limit theorem of an alternating renewal process and (2) recursive relations that are solved using mathematical induction. Thus, we make the Erlang loss formulas comprehensible to beginning college mathematics students. We illustrate decision making in some practical problems using these formulas and other quantities derived from them.
Jyotirmoy Sarkar
Machine Learning, Regression and Optimization
Abstract
Machine learning is a subfield of artificial intelligence (AI). While AI is the ability of the machine to think like humans, machine learning is the ability of machine to learn from data without any explicit instructions. Applications of machine learning are abundant: stock-price forecast; face, speech and handwriting recognition; medical diagnosis of diseases like cancer, blood pressure, diabetes, neurological disorders including autism, spinal stenosis and others; and health monitoring, just to name a few. Potential applications of machine learning in solutions to many other complex practical problems are currently being investigated. An ultimate goal of machine learning is to make predictions based on a properly trained model. Two major techniques of supervised machine learning are: statistical regression and classification. For best prediction, the parameters of the model need to be optimized. This is an optimization task. After giving a brief introduction to machine learning and describing the role of regression and optimization, the paper discusses in some detail the basics of regression and optimization methods that are commonly used in machine learning. The paper is interdisciplinary, blending machine learning with statistical regression and numerical linear algebra, and optimization. Thus, it will be of interest to a wide variety of audiences ranging from mathematics, statistics and computer science to various branches of engineering.
Biswa Nath Datta, Biswajit Sahoo
Metadata
Title
Data Science and SDGs
Editors
Prof. Bikas Kumar Sinha
Md. Nurul Haque Mollah
Copyright Year
2021
Publisher
Springer Singapore
Electronic ISBN
978-981-16-1919-9
Print ISBN
978-981-16-1918-2
DOI
https://doi.org/10.1007/978-981-16-1919-9