n-Gram-based classification and unsupervised hierarchical clustering of genome sequences
Introduction
The number and sizes of genome databases have grown rapidly over the last few years. A huge amount of information requires new ways for processing them and using them in efficient ways. Two of the most important problems are classification and clustering of genomes, i.e., automatically determining a group to which a previously unseen genome sequence belongs and grouping the genome sequences into a tree structure according to their similarity. For example, distinguishing virus subspecies, strains and isolates is important in vaccine development, diagnostics, and other fields of biological and medical research and practice.
The genetic information of every organism is written in the universal code of DNA sequences, and the DNA sequence of any given organism can be obtained by the standard biochemical techniques. Using these sequences, it is now possible to catalogue and characterize any set of living organisms. Using such comparisons we can estimate the place of each organism in the family tree of living species—the “tree of life”. Phylogenetic inference is the process of developing hypothesis about evolutionary relatedness of organisms based on their observable characteristics. There are several techniques for constructing phylogenetic trees used in bioinformatics, including techniques on neighbor joining (e.g. [1], [2], [3]), maximum parsimony (e.g. [3], [4], [5]), maximum likelihood estimation (e.g. [6], [7], [8]), and others. Most of them are, directly or indirectly, based on multiple sequence alignment. Multiple alignment of complete large genomes can be very expensive and, in addition, it is practically impossible to align some highly plastic genomes to each other, since they can significantly differ in size, gene number and gene order. Therefore, there is a need for classification and clustering techniques that do not rely on sequence alignments. It is worth pointing that a technique for classification and clustering (like the one presented in this paper) may not be based upon biological models, but can still give very good potential for handling these problems. A recent method presented in [9] does not require multiple alignment and, in that sense, is related to the methods we propose in this paper, which also rely on n-gram analysis rather than on sequence alignments.1
In this paper we address the following two problems:
- •
classification: given several families of genomes and a genome, determine the family to which it most likely belongs;
- •
clustering: define a procedure for genome clustering using exclusively statistical substring properties of their nucleotide bases; such procedure should be effective, unsupervised, and should not require any expert knowledge for using it, which implies that it can be fully automated.
This work follows some of the ideas from [10], which includes results on using a character n-gram technique for the problem of authorship attribution, i.e., the problem of identifying the author of an anonymous text, or text whose authorship is in doubt. We address the problem of genome sequence classification using a similar method and extend the approach and ideas reported in [10].
The results obtained following the proposed technique are very positive and encouraging. We believe that the technique can find many applications, both in academic research and in medicine and industry.
Overview of the paper. In Section 2 we give some background information and basic notions. In Section 3 we introduce the notion of dissimilarity measures and present several dissimilarity functions. In Section 4 we report on our experimental results that led us to good dissimilarity functions. In Section 5 we discuss how the proposed technique can be used for genome sequence classification and in Section 6 we discuss how the proposed technique can be used for genome clustering and we present some experimental results. In Section 7 we briefly discuss the related work. In Section 8 we compare algorithms proposed in this paper with other previously published methods. In Section 9 we present some plans for future work and in Section 10 we draw final conclusions.
Section snippets
n-Grams
Definition 1 Given a sequence of tokens over the token alphabet , where N and n are positive integers, an n-gram of the sequence S is any n-long subsequence of consecutive tokens. The i th n-gram of S is the sequence [11].
Note that there are N such n-grams in S. There are different n-grams over the alphabet ( is the size of ).
For example, if is English alphabet, and l string on alphabet , ” then 1-grams are: l, i, f, , i, s, a, m,
Dissimilarity functions
Dissimilarity measure d is a function on two sets of sequences and (defining specific profiles) and it should reflect the dissimilarity between these two, i.e., it should meet the following conditions:
- •
;
- •
;
- •
the value should be small if and are similar;
- •
the value should be large if and are not similar.
The last two conditions are informal as the notion of similarity is not strictly defined. In the following text, by similarity of sequences
Evaluation of dissimilarity functions
Following the successful approach of the authorship attribution method presented in [10], we explore the use of the same or a similar technique to the analogous problem of classifying genome sequences. Given several groups of genome sequence and a genome sequence, the task is to determine a group to which the sequence most likely belongs. The method can be described in the following way: for the given set of families , and the given genome sequence g, compute the dissimilarity
Genome sequence classification: experimental results
Experiment 4 Randomly select11
Hierarchical clustering problem
With positive results in genome sequence classification (Section 5), now we address a related, but more complex problem: hierarchical clustering of genome sequences. Our objective is to define an algorithm that can provide fully unsupervised hierarchical clustering of genome sequences. This clustering method would be based on pure statistical n-gram information, without using any additional domain knowledge, and it would rely on dissimilarity functions described in the previous text. Since the
Related work
This work follows some of the ideas from [10]. That paper reports on using n-grams for authorship attribution, i.e., for identifying the author of an anonymous text, or text whose authorship is in doubt. In that work, there is proposed a novel method for computer-assisted authorship attribution based on character level n-gram author profiles, which is motivated by a pioneering method in 1976 [31]. We follow ideas from [10], but apply them to another domain and also change the dissimilarity
Comparison to other methods
In this section we report on comparison results between our methods and some other methods. For all used test corpora our method gave very good results, better than the compared methods.
Future work
For our future work, we are planning to further develop techniques presented in this paper: to further investigate and improve the presented dissimilarity functions and the classification and clustering methods. We will try to explore if (and how) optimal length of n-grams depends on the size of genomes and cardinality of alphabet. Also, we are planning to apply the technique to other corpora and domains (not only in bioinformatics). We have already preformed preliminary experiments on three
Conclusions
In this paper we addressed the problems of automatic isolate classification, and clustering, i.e., unsupervised genome tree generation. For both of these problems we use techniques based on n-grams. For the classification problem, we follow some ideas from [10], while we changed the key ingredient of the technique—the dissimilarity function. For the clustering problem we presented two novel algorithms.
We tested the techniques on the corpus of 445 HIV-1, 18 HIV-2 isolates and 8 SHIV isolates
Acknowledgments
We are grateful to Prof. Dr. Gordana Pavlović-Lažetić, Prof. Dr. Christian Blouin, Dr. Miloš Beljanski, Dr. Michael Rebhan and Aleksandra Nestorović, for very useful suggestions and feedback on the preliminary version of this paper.
References (42)
- et al.
Automatic spelling correction using trigram similarity measure
Inform. Process. Manage.
(1983) - et al.
The use of trigram analysis for spelling error detection
Inform. Process. Manage.
(1981) Experiments with syntactic traces in information retrieval
Inform. Storage Retrieval
(1974)- et al.
The neighbor-joining method: a new method for reconstructing phylogenetic trees
Mol. Biol. Evol.
(1987) - et al.
Multiple sequence alignment with the clustal series of programs
Nucleic Acids Res.
(2003) PHYLIP: phylogeny inference package (version 3.2)
Cladistics
(1989)- et al.
The reconstruction of evolution
Ann. Hum. Genetics
(1963) - et al.
A method for deducing branching sequences in phylogeny
Evolution
(1965) - et al.
New phylogenetic venues opened by a novel implementation of the dnaml algorithm
Bioinformatics
(1998) Taking variation of evolutionary rates between sites into account in inferring phylogenies
J. Mol. Evol.
(2001)
Fastdnaml: a tool for construction of phylogenetic trees of DNA sequences using maximum likelihood
Comput. Appl. Biosci.
CVTree: a phylogenetic tree reconstruction tool based on whole genomes
Nucleic Acids Res.
n-Gram-based author profiles for authorship attribution
Effective text compression with simultaneous digram and trigram encoding
J. Inform. Sci.
Handwriting recognition using position sensitive letter n-gram matching
n-Gram-based text categorization
Characterizing the behavior of a program using multiple-length n-grams
Comparative n-gram analysis of whole-genome protein sequences
Cited by (89)
A novel alignment-free DNA sequence similarity analysis approach based on top-k n-gram match-up
2020, Journal of Molecular Graphics and ModellingCitation Excerpt :In addition, several approaches have been proposed for finding the optimal range of n [26–28]. Some methods directly use the n-gram frequency by measuring the distance between DNA sequences using similarity functions such as Jensen–Shannon divergence and Kullback–Leibler divergence [29–31]. In order to reduce the computational cost in various text operations using the n-gram method, some parts of the n-grams are ignored.
Accurate Open-Set Recognition for Memory Workload
2023, ACM Transactions on Knowledge Discovery from DataGenome classification with deep learning using heuristic algorithms for hyper-parameter optimization
2023, 17th International Conference on INnovations in Intelligent SysTems and Applications, INISTA 2023 - ProceedingsAutomated Authorship Attribution using CNG Distance on Blog Posts in the Serbian Language
2023, 2023 22nd International Symposium INFOTEH-JAHORINA, INFOTEH 2023Grouping of Mushroom 5.8s rRNA Sequences by Implementing Hierarchical Clustering Algorithm
2023, Applications of Machine Learning and Deep Learning on Biological Data