Confidence limits for genome DNA copy number variations in HR-CGH array measurements

https://doi.org/10.1016/j.bspc.2013.11.007Get rights and content

Highlights

  • Estimation of genome copy number variations is provided in large noise.

  • The estimation accuracy is limited with jitter in the breakpoints.

  • The approximate jitter distribution is shown to be the discrete skew Laplace.

  • The estimate lower and upper bounds are derived.

  • The estimate UB and LB masks are suggested for medical applications.

Abstract

Estimation of the genome copy number variations (CNVs) measured using the high-resolution array-comparative genomic hybridization (HR-CGH) microarray is commonly provided in the presence of large Gaussian noise having white properties with different segmental variances. Medical experts must thus be highly concerned about the confidence limits for CNVs in order to make correct decisions about genomic changes. We carry out a probabilistic analysis of CNVs in HR-CGH microarray measurements and show that jitter in the breakpoints can be approximated with the discrete skew Laplace distribution. Using this distribution, we find the confidence upper and lower boundaries to guarantee an existence of genomic changes in the confidence interval of 99.73%. We suggest combining these boundaries with the estimates to give medical experts more information about actual CNVs. Experimental verification of the theory is provided by simulation and using real HR-CGH microarray-based measurements.

Introduction

A disease such as cancer is often accompanied with structural changes called copy-number variations (CNVs) in the deoxyribonucleic acid (DNA) of a genome essential for human life. The cell with the DNA typically has a number of copies of one or more sections of the DNA that results in the structural chromosomal rearrangements – deletions, duplications, inversions and translocations of certain parts [1], [2], [3]. Small such CNVs are present in many forms in the human genome, including single-nucleotide polymorphisms, small insertion–deletion polymorphisms, variable numbers of repetitive sequences, and genomic structural alterations [4]. If genomic aberrations involve large CNVs, the process was shown to be directly coupled with cancer and the relevant structural changes were called copy-number alterations (CNAs) [5]. A brief survey of types of chromosome alterations involving copy number changes is given in [6]. The copy number represents the number of DNA molecules in a cell and can be defined as the number of times a given segment of DNA is present in a cell. Because the DNA is usually double-stranded, the size of a gene or chromosome is often measured in base pairs. A commonly accepted unit of measurement in molecular biology is kilobase (kb) equal to 1000 base pairs of DNA [7]. The human genome with 23 chromosomes is estimated to be about 3.2 billion base pairs long and to contain 20,000–25,000 distinct genes [8]. Each CNV may range from about 1 kb to several megabases (Mbs) in size [1].

The array-comparative genomic hybridization (aCGH) is one of the most modern techniques employing chromosomal microarray analysis to detect the CNVs at a resolution level of 5–10 kbs [9]. It was reported in [10] that the high-resolution CGH (HR-CGH) arrays are accurate to detect structural variations (SV) at resolution of 200 bp. In microarray technique, the CNVs are often normalized and plotted as log 2R/G = log 2Ratio, where R and G are the fluorescent Red and Green intensities, respectively [11]. An annoying feature of such measurements is that the ratio is highly contaminated by noise of which intensity does not always allow for correct visual identification of the breakpoints and copy numbers and makes most of the estimation techniques poor efficient if the number of probes per CNV segment is small. Deletions as small as 300 bp should also be detected in some cases. For instance, arrays with a 9-bp tiling path were used in [10] to map a 622-bp heterozygous deletion. The referenced studies did not utilize genome wide platform and the high resolution reported is limited to targeted studies, using custom platform. So, further progress in ultra-high probe resolution is desirable as shown, for example, in [12].

Based on the physical nature of genome changes in human cells [1], the following distinct properties of the CNVs function were recognized [6]. It is piecewise-constant (PWC) and sparse with a small number of alterations on a long base-pair length. Its constant values are integer, although this property is not survived in the log 2Ratio. Moreover, the constant value may not be integer for somatic alterations, such as seen in cancer, where a mixture of normal and affected cells might coexist. The measurement noise in the log 2 Ratio is highly intensive and can be modeled as additive white Gaussian.

A typical picture of the HR-CGH array measurement of the copy numbers generated with 8000 probes having resolution of 620 bp and obeying the above-listed properties is sketched in Fig. 1. One recognizes here the breakpoints (edges) and segments with constant copy numbers. Typically, a chromosome section is observed with some average resolution r¯, bp and M probes in the genomic location scale. The copy numbers change at L breakpoints, 0<i1<<iL<r¯M, which can be united in a vector

I=[i1i2iL]TRL,where il, l  [1, L], is the lth breakpoint location.

To facilitate the algorithm design, the genomic location scale is often represented in the number of probes n  [1, M] ignoring “bad” or empty measurements. In such a scale, the nlth discrete point corresponds to the ilth breakpoint in the genomic location scale. The points placed as 0 < n1 <  < nL < M can also be united in a vector

N=[n1n2nL]TRL.Although N facilitates the algorithm design, the final estimate is typically represented in the genomic location scale as in Fig. 1.

For L breakpoints, there are L + 1 segmental changes aj, j  [1, L + 1], which can be united in a vector

a=[a1a2aL+1]TRL+1,where aj characterizes a segment between ij−1 and ij on an interval [ij−1, ij  1].

The CNVs estimation problem is thus to predict the breakpoints locations Iˆ and the segmental changes aˆ with a maximum possible accuracy and precision acceptable for medical applications. Those who are familiar with the estimation theory can instantly characterize each of the specific regions numbered in Fig. 1. The relevant components of both I and a can easily be estimated in the case (1), because the number of probes is large in each neighboring segment and the edges are sharp. In the case (2), the component of I is well detectable, but the estimate of the relevant component of a may be imprecise owing to a small number of probes. Reasoning similarly, one may conclude that it is hard to estimate the components of I and a in the case of (3). Finally, the cases (4) and (5) suggest that the segmental estimates may have enough precision, whereas the estimates of the edges not. An analysis of the estimation errors caused by the segmental noise and jitter in the breakpoints is thus required.

Let us consider the microarray-based measurement of the CNVs in more detail. Fig. 2 gives several simulated examples around the lth breakpoint with different realizations of the measurement white Gaussian noise having for simplicity equal segmental variances. The threshold (dashed) is placed equidistantly between the segmental changes. The breakpoint location is found by the maximum likelihood (ML). The case (a) is ideal to mean that with such locations of the measured points the ML estimate will be jitter-free. If it happens that some left-neighboring to il points lie below the threshold (dashed), then the ML estimate will be found to the left of il; four points to the left in the case (b). We call it the left jitter. If some right-neighboring points lie as in Fig. 2c, then the ML estimate will be found to the right of il; two points to the right in the case (c). We call it the right jitter. Also, there may be observed some ambiguities as in the case (d) when the estimator gives two or more possible locations for the same breakpoint.

During decades, many estimation approaches have been developed for PWC signals such as those generated by the chromosomal changes: wavelet-based [13], [14], robust [14], adaptive kernel smoothers [15], ML based on Gauss's ordinary least squares (OLS), penalized bridge estimator [16] and ridge regression [17] (also known as Tikhonov regularization), fussed least-absolute shrinkage and selection operator (Lasso) [18], the Schwarz information criterion-based estimator [19], [20], and forward-backward smoothers [21], [22], [23]. Some of these methods were adopted and developed in bioinformatics for the conditions of PWC chromosomal changes. Efficient algorithms for filtering, smoothing and detection were proposed in [24], [25], [13], [14], [26], [20]. Methods for segmentation and modeling were developed in [27], [28], [29], [19], [30], [31], [24]. Sparse representation based on penalized optimization and Bayesian learning was provided in [32], [33], [34], [35].

A conclusion one may arrive at by applying different estimation techniques to Fig. 1, Fig. 2 can be the following. In view of large segmental noise, no one estimator is able to provide jitter-free breakpoints detection and error-free segmental estimation. The estimation errors may be large and even unacceptable. Medical experts should thus be highly concerned about the confidence limits within which the true CNVs exist with high probability.

In this paper, we provide a probabilistic analysis of CNVs in HR-CGH microarray measurements. We derive an approximate jitter distribution, employ the segmental Gaussian distribution, and find the confidence upper boundary (UB) and lower boundary (LB) in order to outline the confidence interval for genome changes in the three-sigma sense or with the probability of 99.73%. We suggest using these limits along with the estimates in order for medical experts to make more correct decisions about real structural changes in the DNA. The rest of the paper is organized as follows. Section 2 discusses jitter in the breakpoints along with the approximate jitter distribution. Section 3 presents the confidence limits. The results of this study are discussed in Section 4. Finally, concluding remarks can be found in Section 5.

Section snippets

Jitter in the breakpoints

Pursuing the aim of determining the confidence limits for CNVs measured using HR-CGH array, we first derive the approximate jitter distribution for segmental Gaussian distribution. We then show that this approximation is the discrete skew Laplace distribution and verify it by simulation using multiple Monte Carlo runs.

Confidence limits

Provided the jitter distribution (15), the confidence limits for CNVs can be found in some sense. From the standpoint of practical usefulness, it is always desirable to find the confidence endpoints such that the true changes exist within the confidence interval with high probability. In this paper, we specify the confidence UB and LB in the three-sigma sense to guarantee the probability of 99.73% or error probability of 0.27% for CNVs measured using HR-CGH microarray.

Discussion

Some generalizations can now be provided regarding errors in the estimation of the CNVs using the HR-CGH microarray.

Segmental errors: Inaccuracies in the segmental estimates may lead to wrong conclusions about genome changes. More efforts are thus required to increase the probe resolution. Because each change aj is constant, simple averaging is best for noise reduction. An averaging filter produces a required unbiased estimate, where variance is defined by the segmental noise variance σj2

Conclusions

Measurements of genome changes using the HR-CGH array are available with the probe resolution of 0.2, …, 40 kb. They are provided in the presence of large white Gaussian noise that causes the segmental SNRs to range around unity from about 0.1 to 100. Under such conditions, estimates of segmental changes and breakpoint locations are often accompanied with large and even unacceptable errors. We have shown that jitter in the CNVs breakpoint locations can be approximated with the discrete skew

References (41)

  • International Human Genome Sequencing Consortium

    Finishing the euchromatic sequence of the human genome

    Nature

    (2004)
  • H. Ren et al.

    BAC-based PCR fragment microarray: high-resolution detection of chromosomal deletion and duplication breakpoints

    Human Mutation

    (2005)
  • A.E. Urban et al.

    High-resolution mapping of DNA copy alterations in human chromosome 22 using high-density tiling oligonucleotide arrays

    Proceedings of the National Academy of Sciences of the Unites States of America

    (2006)
  • Y.H. Yang et al.

    Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation

    Nucleic Acids Research

    (2002)
  • D.F. Conrad et al.

    Origins and functional impact of copy number variation in the human genome

    Nature

    (2010)
  • L. Hsu et al.

    Denoising array-based comparative genomic hybridization data using wavelets

    Biostatistics

    (2005)
  • E. Ben-Yaacov et al.

    A fast and flexible method for the segmentation of aCGH data

    Biostatistics

    (2008)
  • A. Goldenshluger et al.

    Adaptive de-noising of signals satisfying differential inequalities

    IEEE Transactions on Information Theory

    (1997)
  • I.E. Frank et al.

    A statistical view of some chemometrics regression tools

    Technometrics

    (1993)
  • A.E. Hoerl et al.

    Ridge regression: biased estimation for nonorthogonal problems

    Technometrics

    (1970)
  • Cited by (13)

    • Improving diagnosis accuracy of non-small cell lung carcinoma on noisy data by adaptive group lasso regularized multinomial regression

      2023, Biomedical Signal Processing and Control
      Citation Excerpt :

      However, there are some difficulties in applying machine learning methods based on gene expression profile data to NSCLC diagnosis: data noise, overlapping gene grouping, and the importance evaluations for gene groups and individual genes. Affected by the electromagnetic interference of the instrument and insufficient probe hybridization, the gene expression profile data inevitably contain noise, which will influence the diagnosis accuracy of NSCLC [16–17]. Motivated by the successful application in image denoising [18], robust principal component analysis (RPCA) was used to reduce the influence of noise on the diagnosis accuracy of acute leukemia [19–20].

    • Critical evaluation of CNA estimators for DNA data using matching confidence masks and WGS technology

      2021, Biomedical Signal Processing and Control
      Citation Excerpt :

      Thus, the problem arises of how to select the most accurate algorithm for particular measuring technique. A solution is of a top priority given the fact that no one estimator guarantees a complete CNAs picture under intensive probe noise [16]. We start introducing several CNAs estimators applied to the array-based genomic profiles, such as the changepoint, DNAcopy, and GLAD package algorithms implemented in R, which is an open source environment for statistical computing and graphics.

    • Improving estimates of the breakpoints in genome copy number alteration profiles with confidence masks

      2017, Biomedical Signal Processing and Control
      Citation Excerpt :

      Even so, with no additional information, there is no other way but to find the confidence boundaries and probabilistic masks for these estimates. Below, we will follow this approach referring to [17,40]. By combining (12) and (13) with (14) and (15), the probabilistic masks can be formed as shown in [17] to bound the CNA estimates in the ϑ-sigma sense for the given confidence probability P(ϑ).

    • Confidence masks for genome DNA copy number variations in applications to HR-CGH array measurements

      2014, Biomedical Signal Processing and Control
      Citation Excerpt :

      The problem is complicated by the fact that exact jitter distribution is still unknown for such signals even in white Gaussian noise. Just recently, in [18,20], we have derived an approximate jitter distribution and showed that it obeys the discrete skew Laplace law. In this paper, we introduce a statistical framework and develop an efficient algorithm for computing the confidence lower boundary (LB) and upper boundary (UB) masks for CNVs.

    • Jitter approximation and confidence masks in simulated SCNA using AEP distribution

      2018, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    View all citing articles on Scopus
    View full text