A transcription factor affinity-based code for mammalian transcription initiation

Molly Megraw; Fernando Pereira; Shane T. Jensen; Uwe Ohler; Artemis G. Hatzigeorgiou

doi:10.1101/gr.085449.108

A transcription factor affinity-based code for mammalian transcription initiation

¹ Institute for Genome Sciences and Policy, Duke University, Durham, North Carolina 27708, USA;
² Department of Computer and Information Science, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA;
³ Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA;
⁴ Institute of Molecular Oncology, Biomedical Sciences Research Center “Alexander Fleming,” Athens, Greece

Abstract

The recent arrival of large-scale cap analysis of gene expression (CAGE) data sets in mammals provides a wealth of quantitative information on coding and noncoding RNA polymerase II transcription start sites (TSS). Genome-wide CAGE studies reveal that a large fraction of TSS exhibit peaks where the vast majority of associated tags map to a particular location (∼45%), whereas other active regions contain a broader distribution of initiation events. The presence of a strong single peak suggests that transcription at these locations may be mediated by position-specific sequence features. We therefore propose a new model for single-peaked TSS based solely on known transcription factors (TFs) and their respective regions of positional enrichment. This probabilistic model leads to near-perfect classification results in cross-validation (auROC = 0.98), and performance in genomic scans demonstrates that TSS prediction with both high accuracy and spatial resolution is achievable for a specific but large subgroup of mammalian promoters. The interpretable model structure suggests a DNA code in which canonical sequence features such as TATA-box, Initiator, and GC content do play a significant role, but many additional TFs show distinct spatial biases with respect to TSS location and are important contributors to the accurate prediction of single-peak transcription initiation sites. The model structure also reveals that CAGE tag clusters distal from annotated gene starts have distinct characteristics compared to those close to gene 5′-ends. Using this high-resolution single-peak model, we predict TSS for ∼70% of mammalian microRNAs based on currently available data.

Footnotes

↵5 Corresponding authors.

↵E-mail uwe.ohler{at}duke.edu; fax (919) 668-0795.

↵E-mail artemis{at}fleming.gr; +30-210-965-3934.
[Supplemental material is available online at www.genome.org. The annotation-supported classifier is publicly available as an Open Source command-line tool at http://tools.igsp.duke.edu/generegulation/S-Peaker.]
Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.085449.108.
- Received August 26, 2008.
- Accepted December 31, 2008.