n-Gram-based classification and unsupervised hierarchical clustering of genome sequences

https://doi.org/10.1016/j.cmpb.2005.11.007Get rights and content

Abstract

In this paper we address the problem of automated classification of isolates, i.e., the problem of determining the family of genomes to which a given genome belongs. Additionally, we address the problem of automated unsupervised hierarchical clustering of isolates according only to their statistical substring properties. For both of these problems we present novel algorithms based on nucleotide n-grams, with no required preprocessing steps such as sequence alignment. Results obtained experimentally are very positive and suggest that the proposed techniques can be successfully used in a variety of related problems. The reported experiments demonstrate better performance than some of the state-of-the-art methods. We report on a new distance measure between n-gram profiles, which shows superior performance compared to many other measures, including commonly used Euclidean distance.

Introduction

The number and sizes of genome databases have grown rapidly over the last few years. A huge amount of information requires new ways for processing them and using them in efficient ways. Two of the most important problems are classification and clustering of genomes, i.e., automatically determining a group to which a previously unseen genome sequence belongs and grouping the genome sequences into a tree structure according to their similarity. For example, distinguishing virus subspecies, strains and isolates is important in vaccine development, diagnostics, and other fields of biological and medical research and practice.

The genetic information of every organism is written in the universal code of DNA sequences, and the DNA sequence of any given organism can be obtained by the standard biochemical techniques. Using these sequences, it is now possible to catalogue and characterize any set of living organisms. Using such comparisons we can estimate the place of each organism in the family tree of living species—the “tree of life”. Phylogenetic inference is the process of developing hypothesis about evolutionary relatedness of organisms based on their observable characteristics. There are several techniques for constructing phylogenetic trees used in bioinformatics, including techniques on neighbor joining (e.g. [1], [2], [3]), maximum parsimony (e.g. [3], [4], [5]), maximum likelihood estimation (e.g. [6], [7], [8]), and others. Most of them are, directly or indirectly, based on multiple sequence alignment. Multiple alignment of complete large genomes can be very expensive and, in addition, it is practically impossible to align some highly plastic genomes to each other, since they can significantly differ in size, gene number and gene order. Therefore, there is a need for classification and clustering techniques that do not rely on sequence alignments. It is worth pointing that a technique for classification and clustering (like the one presented in this paper) may not be based upon biological models, but can still give very good potential for handling these problems. A recent method presented in [9] does not require multiple alignment and, in that sense, is related to the methods we propose in this paper, which also rely on n-gram analysis rather than on sequence alignments.1

In this paper we address the following two problems:

  • classification: given several families of genomes and a genome, determine the family to which it most likely belongs;

  • clustering: define a procedure for genome clustering using exclusively statistical substring properties of their nucleotide bases; such procedure should be effective, unsupervised, and should not require any expert knowledge for using it, which implies that it can be fully automated.

This work follows some of the ideas from [10], which includes results on using a character n-gram technique for the problem of authorship attribution, i.e., the problem of identifying the author of an anonymous text, or text whose authorship is in doubt. We address the problem of genome sequence classification using a similar method and extend the approach and ideas reported in [10].

The results obtained following the proposed technique are very positive and encouraging. We believe that the technique can find many applications, both in academic research and in medicine and industry.

Overview of the paper. In Section 2 we give some background information and basic notions. In Section 3 we introduce the notion of dissimilarity measures and present several dissimilarity functions. In Section 4 we report on our experimental results that led us to good dissimilarity functions. In Section 5 we discuss how the proposed technique can be used for genome sequence classification and in Section 6 we discuss how the proposed technique can be used for genome clustering and we present some experimental results. In Section 7 we briefly discuss the related work. In Section 8 we compare algorithms proposed in this paper with other previously published methods. In Section 9 we present some plans for future work and in Section 10 we draw final conclusions.

Section snippets

n-Grams

Definition 1

Given a sequence of tokens S=(s1,s2,,sN+(n1)) over the token alphabet A, where N and n are positive integers, an n-gram of the sequence S is any n-long subsequence of consecutive tokens. The i th n-gram of S is the sequence (si,si+1,,si+n1) [11].

Note that there are N such n-grams in S. There are (|A|)n different n-grams over the alphabet A (|A| is the size of A).

For example, if A is English alphabet, and l string on alphabet A, l=life_is_a_miracle” then 1-grams are: l, i, f, _, i, s, a, m,

Dissimilarity functions

Dissimilarity measure d is a function on two sets of sequences P1 and P2 (defining specific profiles) and it should reflect the dissimilarity between these two, i.e., it should meet the following conditions:

  • d(P,P)=0;

  • d(P1,P2)=d(P2,P1);

  • the value d(P1,P2) should be small if P1 and P2 are similar;

  • the value d(P1,P2) should be large if P1 and P2 are not similar.

The last two conditions are informal as the notion of similarity is not strictly defined. In the following text, by similarity of sequences

Evaluation of dissimilarity functions

Following the successful approach of the authorship attribution method presented in [10], we explore the use of the same or a similar technique to the analogous problem of classifying genome sequences. Given several groups of genome sequence and a genome sequence, the task is to determine a group to which the sequence most likely belongs. The method can be described in the following way: for the given set of families Pi, i=1,2,,k and the given genome sequence g, compute the dissimilarity

Genome sequence classification: experimental results

Experiment 4

Randomly select11

Hierarchical clustering problem

With positive results in genome sequence classification (Section 5), now we address a related, but more complex problem: hierarchical clustering of genome sequences. Our objective is to define an algorithm that can provide fully unsupervised hierarchical clustering of genome sequences. This clustering method would be based on pure statistical n-gram information, without using any additional domain knowledge, and it would rely on dissimilarity functions described in the previous text. Since the

Related work

This work follows some of the ideas from [10]. That paper reports on using n-grams for authorship attribution, i.e., for identifying the author of an anonymous text, or text whose authorship is in doubt. In that work, there is proposed a novel method for computer-assisted authorship attribution based on character level n-gram author profiles, which is motivated by a pioneering method in 1976 [31]. We follow ideas from [10], but apply them to another domain and also change the dissimilarity

Comparison to other methods

In this section we report on comparison results between our methods and some other methods. For all used test corpora our method gave very good results, better than the compared methods.

Future work

For our future work, we are planning to further develop techniques presented in this paper: to further investigate and improve the presented dissimilarity functions and the classification and clustering methods. We will try to explore if (and how) optimal length of n-grams depends on the size of genomes and cardinality of alphabet. Also, we are planning to apply the technique to other corpora and domains (not only in bioinformatics). We have already preformed preliminary experiments on three

Conclusions

In this paper we addressed the problems of automatic isolate classification, and clustering, i.e., unsupervised genome tree generation. For both of these problems we use techniques based on n-grams. For the classification problem, we follow some ideas from [10], while we changed the key ingredient of the technique—the dissimilarity function. For the clustering problem we presented two novel algorithms.

We tested the techniques on the corpus of 445 HIV-1, 18 HIV-2 isolates and 8 SHIV isolates

Acknowledgments

We are grateful to Prof. Dr. Gordana Pavlović-Lažetić, Prof. Dr. Christian Blouin, Dr. Miloš Beljanski, Dr. Michael Rebhan and Aleksandra Nestorović, for very useful suggestions and feedback on the preliminary version of this paper.

References (42)

  • R.C. Angell et al.

    Automatic spelling correction using trigram similarity measure

    Inform. Process. Manage.

    (1983)
  • E.M. Zamora et al.

    The use of trigram analysis for spelling error detection

    Inform. Process. Manage.

    (1981)
  • T. De Heer

    Experiments with syntactic traces in information retrieval

    Inform. Storage Retrieval

    (1974)
  • N. Saitou et al.

    The neighbor-joining method: a new method for reconstructing phylogenetic trees

    Mol. Biol. Evol.

    (1987)
  • Chenna et al.

    Multiple sequence alignment with the clustal series of programs

    Nucleic Acids Res.

    (2003)
  • J. Felsenstein

    PHYLIP: phylogeny inference package (version 3.2)

    Cladistics

    (1989)
  • A. Edwards et al.

    The reconstruction of evolution

    Ann. Hum. Genetics

    (1963)
  • J.H. Camin et al.

    A method for deducing branching sequences in phylogeny

    Evolution

    (1965)
  • O. Trelles et al.

    New phylogenetic venues opened by a novel implementation of the dnaml algorithm

    Bioinformatics

    (1998)
  • J. Felsenstein

    Taking variation of evolutionary rates between sites into account in inferring phylogenies

    J. Mol. Evol.

    (2001)
  • G.J. Olsen et al.

    Fastdnaml: a tool for construction of phylogenetic trees of DNA sequences using maximum likelihood

    Comput. Appl. Biosci.

    (1994)
  • J. Qi et al.

    CVTree: a phylogenetic tree reconstruction tool based on whole genomes

    Nucleic Acids Res.

    (2004)
  • V. Kešelj et al.

    n-Gram-based author profiles for authorship attribution

  • D. Tauritz, Application of n-Grams, Department of Computer Science, University of Missouri-Rolla,...
  • J.L. Wisniewski

    Effective text compression with simultaneous digram and trigram encoding

    J. Inform. Sci.

    (1997)
  • El-N. Adnan et al.

    Handwriting recognition using position sensitive letter n-gram matching

  • J.C. Schmitt, Trigram-based method of language identification, U.S. Patent 5,062,143 (October...
  • W.B. Cavnar et al.

    n-Gram-based text categorization

  • J.S. Downie, Evaluating a simple approach to musical information retrieval: conceiving melodic n-grams as text, PhD...
  • C. Marceau

    Characterizing the behavior of a program using multiple-length n-grams

  • M. Ganapathiraju et al.

    Comparative n-gram analysis of whole-genome protein sequences

  • Cited by (89)

    View all citing articles on Scopus
    View full text