main-content

## Über dieses Buch

This book constitutes revised selected papers from the 9th International Conference on Computational Advances in Bio and Medical Sciences, ICCABS 2019, held in Miami, Florida, USA in November 2019.

The 15 papers presented in this volume were carefully reviewed and selected from 30 submissions. They deal with topics such as computational biology; biomedical image analysis; biological networks; cancer genomics; gene enrichment analysis; functional genomics; interaction networks; protein structure prediction; dynamic programming; and microbiome analysis.

## Inhaltsverzeichnis

### Detecting De Novo Plasmodesmata Targeting Signals and Identifying PD Targeting Proteins

Abstract
Subcellular localization plays important roles in protein’s functioning. In this paper, we developed a hidden Markov model to detect de novo signals in protein sequences that target at a particular cellular location: plasmodesmata. We also developed a support vector machine to classify plasmodesmata located proteins (PDLPs) in Arabidopsis, and devised a decision-tree approach to combine the SVM and HMM for better classification performance. The methods achieved high performance with ROC score 0.99 in cross-validation test on a set of 360 type I transmembrane proteins in Arabidopsis. The predicted PD targeting signals in one PDLP have been experimentally verified.
Jiefu Li, Jung-Youn Lee, Li Liao

### The Agility of a Neuron: Phase Shift Between Sinusoidal Current Input and Firing Rate Curve

Abstract
The response of a neuron when receiving a periodic input current signal is a periodic spike firing rate signal. The frequency of an input sinusoidal current and the surrounding environment such as background noises are two important factors that affect the firing rate output signal of a neuron model. This study focuses on the phase shift between input and output signals, and here we present a new concept: the agility of a neuron, to describe how fast a neuron can respond to a periodic input signal. By applying the score of agility, we are capable of characterizing the surrounding environment; once the frequency of periodic input signal is given, the actual angle of phase shift can then be determined, and therefore different neuron models can be normalized and compared to others.
Chu-Yu Cheng, Chung-Chin Lu

### Efficient Sequential and Parallel Algorithms for Incremental Record Linkage

Abstract
Given a collection of records, the problem of record linkage is to cluster them such that each cluster contains all the records of one and only one individual. Existing algorithms for this important problem have large run times especially when the number of records is large. Often, a small number of new records have to be linked with a large number of existing records. Linking the old and new records together might call for large run times. We refer to any algorithm that efficiently links the new records with the existing ones as incremental record linkage (IRL) algorithms and in this paper, we offer novel IRL algorithms. Clustering is the basic approach we employ. Our algorithms use a novel random sampling technique to compute the distance between a new record and any cluster and associate the new record with the cluster with which it has the least distance. The idea is to compute the distance between the new record and only a random subset of the cluster records. We can use a sampling lemma to show that this computation is very accurate. We have developed both sequential and parallel implementations of our algorithms. They outperform the best-known prior algorithm (called RLA). For example, one of our algorithms takes 71.22 s to link 100,000 records with a database of 1,000,000 records. In comparison, the current best algorithm takes 140.91 s to link 1,100,000 records. We achieve a very nearly linear speedup in parallel. E.g., we obtain a speedup of 28.28 with 32 cores. To the best of our knowledge, we are the first to propose parallel IRL algorithms. Our algorithms offer state-of-the-art solutions to the IRL problem.
Abdullah Baihan, Reda Ammar, Robert Aseltine, Mohammed Baihan, Sanguthevar Rajasekaran

### Autoencoder Based Methods for Diagnosis of Autism Spectrum Disorder

Abstract
Autism Spectrum Disorder (ASD) is a neurological disorder that affects a person’s behavior and social interaction. Integrating machine learning algorithms with neuroimages a diagnosis method can be established to detect ASD subjects from typical control (TC) subjects. In this study, we develop autoencoder based ASD diagnosis methods. Firstly, we design an autoencoder to extract high-level features from raw features, which are defined based on eigenvalues and centralities of functional brain networks constructed with the entire Autism Brain Imaging Data Exchange 1 (ABIDE 1) dataset. Secondly, we use these high-level features to train several traditional machine learning methods (SVM, KNN, and subspace discriminant), which achieve the classification accuracy of 72.6% and the area under the receiving operating characteristic curve (AUC) of 79.0%. We also use these high-level features to train a deep neural network (DNN) which achieves the classification accuracy of 76.2% and the AUC of 79.7%. Thirdly, we combine the pre-trained autoencoder with the DNN to train it, which achieves the classification accuracy of 79.2%, and the AUC of 82.4%. Finally, we also train SVM, KNN, and subspace discriminant with the features extracted from the combination of the pre-trained autoencoder and the DNN which achieves the classification accuracy of 74.6% and the AUC of 78.7%. These results show that our proposed methods for diagnosis of ASD outperform state-of-the-art studies.
Sakib Mostafa, Wutao Yin, Fang-Xiang Wu

### FastFeatGen: Faster Parallel Feature Extraction from Genome Sequences and Efficient Prediction of DNA -Methyladenine Sites

Abstract
$$N^6$$-methyladenine is widely found in both prokaryotes and eukaryotes. It is responsible for many biological processes including prokaryotic defense system and human diseases. So, it is important to know its correct location in genome which may play a significant role in different biological functions. Few computational tools exist to serve this purpose but they are computationally expensive and still there is scope to improve accuracy. An informative feature extraction pipeline from genome sequences is the heart of these tools as well as for many other bioinformatics tools. But it becomes reasonably expensive for sequential approaches when the size of data is large. Hence, a scalable parallel approach is highly desirable. In this paper, we have developed a new tool, called FastFeatGen, emphasizing both developing a parallel feature extraction technique and improving accuracy using machine learning methods. We have implemented our feature extraction approach using shared memory parallelism which achieves around 10$$\times$$ speed over the sequential one. Then we have employed an exploratory feature selection technique which helps to find more relevant features that can be fed to machine learning methods. We have employed Extra-Tree Classifier (ETC) in FastFeatGen and performed experiments on rice and mouse genomes. Our experimental results achieve accuracy of 85.57% and 96.64%, respectively, which are better or competitive to current state-of-the-art methods. Our shared memory based tool can also serve queries much faster than sequential technique. All source codes and datasets are available at https://​github.​com/​khaled-rahman/​FastFeatGen.
Md. Khaledur Rahman

### Optimized Multiple Fluorescence Based Detection in Single Molecule Synthesis Process Under High Noise Level Environment

Abstract
Single molecule sequencing contributes to overall human advancement in the areas including but not limited to genomics, transcriptomics, clinical test, drug development, and cancer screening. Furthermore, fluorescence based sequencing is mostly employed in single molecule sequencing among other methods, specifically in the fields of DNA sequencing. Contemporary fluorescence labeling methods utilize a Charge-coupled Device camera to capture snapshots of multiple pixels on the single molecule sequencing. We propose a method for fluorescence labeling detection with a single pixel, which excels in high accuracy and low resource requirement in the low signal-to-noise ratio conditions. Such a method also benefits from higher throughput compared to others. The context in this study explores the single molecule synthesis process modeling using negative binomial distributions. Also, including the method of maximum likelihood and Viterbi algorithm in this modeling improves signal detection accuracy. The fluorescence-based model is most beneficial to simulate actual experiment processes and to facilitate in understanding the relations between fluorescence emission and signal receiving event.
Hsin-Hao Chen, Chung-Chin Lu

### Deep Learning of CTCF-Mediated Chromatin Loops in 3D Genome Organization

Abstract
The three-dimensional organization of the human genome is of crucial importance for gene regulation. Results from high-throughput chromosome conformation capture techniques show that the CCCTC-binding factor (CTCF) plays an important role in chromatin interactions, and CTCF-mediated chromatin loops mostly occur between convergent CTCF-binding sites. However, it is still unclear whether and what sequence patterns in addition to the convergent CTCF motifs contribute to the formation of chromatin loops. To discover the complex sequence patterns for chromatin loop formation, we have developed a deep learning model, called DeepCTCFLoop, to predict whether a chromatin loop can be formed between a pair of convergent CTCF motifs using only the DNA sequences of the motifs and their flanking regions. Our results suggest that DeepCTCFLoop can accurately distinguish the convergent CTCF motif pairs forming chromatin loops from the ones not forming loops. It significantly outperforms CTCF-MP, a machine learning model based on word2vec and boosted trees, when using DNA sequences only. Moreover, we show that DNA motifs binding to ASCL1, SP2 and ZNF384 may facilitate the formation of chromatin loops in addition to convergent CTCF motifs. To our knowledge, this is the first published study of using deep learning techniques to discover the sequence motif patterns underlying CTCF-mediated chromatin loop formation. Our results provide useful information for understanding the mechanism of 3D genome organization. The source code and datasets used in this study for model construction are freely available at https://​github.​com/​BioDataLearning/​DeepCTCFLoop.
Shuzhen Kuang, Liangjiang Wang

### Effects of Various Alpha-1 Antitrypsin Supplement Dosages on the Lung Microbiome and Metabolome

Abstract
Patients with Alpha-1 Antitrypsin Deficiency (A1AD) have abnormally low levels of the protein Alpha-1 Antitrypsin (AAT) in their blood, because of a double mutation that makes the protein misfold and instead collect in the liver (sometimes even causing cirrhosis). The currently accepted single dosage (SD) of AAT supplements does not produce AAT blood concentrations anywhere near normal levels; they typically only reach the effect of having a single mutation. Some have therefore advocated for a double dosage (DD) of these treatments, which generally would be enough to approach these normal concentrations. Levels of cytokines, produced by the immune system in response to an attack, have already been observed to drop dramatically when A1AD patients consuming single dosage started taking double dosage, and then either remain the same or increase again upon return to a single dosage regimen. In this study we administer the same dosage sequence to A1AD patients (SD, DD, SD) for one month each and view the effects on their lung microbiome and metabolome. We analyze both at the end of each stage, comparing and contrasting and discovering potential biomarkers for each stage, and concluding with a discussion of potential implications.
Trevor Cickovski, Astrid Manuel, Kalai Mathee, Michael Campos, Giri Narasimhan

### A Multi-hypothesis Learning Algorithm for Human and Mouse miRNA Target Prediction

Abstract
MicroRNAs (miRNAs) are small non-coding RNAs that play a key role in regulating gene expression and thus in many cellular activities. Dysfunction of cells in these tasks is correlated with the development of several kinds of cancer. As the functionality of miRNAs depends on the location of their binding on their targets, binding site prediction has received a lot of attention in the last several years. Despite its importance, the mechanisms of miRNA targeting are still unknown. In this paper, we introduce an algorithm that partitions miRNA target duplexes according to hypotheses that each represents a different mechanism of targeting. The algorithm, called multi-hypothesis learner, examines all possible hypotheses to find out the optimum data partitions according to the performance of these hypotheses for miRNA target prediction. These hypotheses were then utilized to build a superior target predictor for miRNAs. Our method exploited biologically meaningful features for recognizing targets, which enables establishment of hypotheses that can be correlated with target recognition mechanisms. Test results show that the algorithm can provide comparable performance to state-of-the-art machine learning tools such as RandomForest in predicting miRNA binding sites. Moreover, feature selection on the partitions in our method confirms that the partitioning mechanism is closely related to biological mechanisms of miRNA targeting. The resulting data partitions can potentially be used for in vivo experiments to aid in discovery of the targeting mechanisms.
Mohammad Mohebbi, Liang Ding, Russell L. Malmberg, Liming Cai

### RiboSimR: A Tool for Simulation and Power Analysis of Ribo-seq Data

Abstract
RNA-seq and Ribo-seq are widespread quantitative methods for assessing transcription and translation. They can be used to detect differential expression, differential translation, and differential translation efficiency between conditions. The statistical power to detect differential genes is affected by multiple factors, such as the number of replicates, sequencing depth, magnitude of differential expression and translation, distribution of gene counts, and method for estimating biological variance. As power estimation of translational efficiency involves the combination of both RNA-seq measurements and Ribo-seq measurements, this task is particularly challenging. Here we propose a power assessment tool, called RiboSimR, based purely on data simulation. RiboSimR, produces semi-parametric simulations that generate data based on real RNA and Ribo-seq experiments, with customizable choices on baseline parameters and tool configurations. We demonstrate the usefulness of our tool by simulating data based on two published Ribo-seq datasets and analyzing various aspects of experimental design.
Patrick Perkins, Anna Stepanova, Jose Alonso, Steffen Heber

### Treatment Practice Analysis of Intermediate or High Risk Localized Prostate Cancer: A Multi-center Study with Veterans Health Administration Data

Abstract
Khajamoinuddin Syed, William Sleeman IV, Joseph Nalluri, Payal Soni, Michael Hagan, Jatinder Palta, Rishabh Kapoor, Preetam Ghosh

### Forecasting Model for the Annual Growth of Cryogenic Electron Microscopy Data

Abstract
In this paper, we develop a forecasting model for the growth of Cryogenic Electron Microscopy (Cryo-EM) experimental data time series using autoregressive (AR) model. We employ the optimal modeling order that maximizes the estimation accuracy while maintaining the least normalized prediction error. The proposed model has been efficiently used to forecast the growth of cryo-EM data for the next 10 years, 2019–2028. The time series for the number of released three-dimensional Electron Microscopy (3DEM) images along with the time series of the annual number of 3DEM achieving resolution 10 Å or better are used. The data was collected from the public Electron Microscopy Data Bank (EMDB). The simulation results showed that the optimal model orders to estimate both datasets are $$AR\left( 5 \right)$$ and $$AR\left( 6 \right)$$ respectively. Consequently, the optimal models obtained an estimation accuracy of $$96.8\%,$$ and $$85\%$$ for 3DEM experiments time series and 3DEM resolutions time series, respectively. Hence, the forecasting results reveal an exponential increasing behavior in the future growth of annual released of 3DEM and, similarly, for the annual number of 3DEM achieving resolution 10 Å or better.
Qasem Abu Al-Haija, Kamal Al Nasr

### Local and Global Stratification Analysis in Whole Genome Sequencing (WGS) Studies Using LocStra

Abstract
We are interested in the analysis of local and global population stratification in WGS studies. We present a new R package (locStra) that utilizes the covariance matrix, the genomic relationship matrix, and the unweighted/weighted genetic Jaccard similarity matrix in order to assess population substructure. The package allows one to use a tailored sliding window approach, for instance using user-defined window sizes and metrics, in order to compare local and global similarity matrices. A technique to select the window size is proposed. Population stratification with locStra is efficient due to its C++ implementation which fully exploits sparse matrix algebra. The runtime for the genome-wide computation of all local similarity matrices does typically not exceed one hour for realistic study sizes. This makes an unprecedented investigation of local stratification across the entire genome possible. We apply our package to the 1,000 Genomes Project.
Georg Hahn, Sharon Marie Lutz, Julian Hecker, Dmitry Prokopenko, Christoph Lange

### A New Graph Database System for Multi-omics Data Integration and Mining Complex Biological Information

Abstract
Due to the advancement in high throughput technologies and robust experimental designs, many recent studies attempt to incorporate heterogeneous data obtained from multiple technologies to improve our understanding of the molecular dynamics associated with biological processes. Currently available technologies produce wide variety of large amount of data spanning from genomics, transcriptomics, proteomics, and epigenetics. Due to the fact that such multi-omics data are very diverse and come from different biological levels, it has been a major research challenge to develop a model to properly integrate all available and relevant data to advance biomedical research. It has been argued by many researchers that the integration of multi-omics data to extract relevant biological information is currently one of the major biomedical informatics challenges. This paper proposes a new graph database model to efficiently store and mine multi-omics data. We show a working model of this graph database with transcriptomics, genomics, epigenetics and clinical data for three cancer types from the Cancer Genome Atlas. Moreover, we highlight the usefulness of graph database mining to extract relevant biological interpretations and also to find novel relationships between different data levels.
Ishwor Thapa, Hesham Ali

### SMART2: Multi-library Statistical Mitogenome Assembly with Repeats

Abstract
SMART2 is an enhanced version of the SMART pipeline for mitogenome assembly from low-coverage whole-genome sequencing (WGS) data. Novel features include automatic selection of the optimal number of read pairs used for assembly and the ability to assemble multiple sequencing libraries when available. SMART2 succeeded in generating mitochondrial sequences for 26 metazoan species with WGS data but no previously published mitogenomes in NCBI databases. The SMART2 pipeline is publicly available via a user-friendly Galaxy interface at https://​neo.​engr.​uconn.​edu/​?​tool_​id=​SMART2.