1 Introduction
2 Materials and methods
-
Step 1—Extracting human promoter sequencesA complete list of human genes and related transcripts, linked to the different isoforms of gene products, are selected. A number of 23,459 genes and 73,432 transcripts are collected. Promoter sequences (2000 base pairs upstream the transcription start site are considered according to Cumboo et al. 2018) of those gene/transcripts were retrieved through the package “TxDb Hsapiens UCSC.hg19.KnownGene” version 3.2.2 of R software.
-
Step 2—Collecting human TFs and related PWMsThe set of available 626 human TFs is selected and the related consensus pattern sequences, expressed in terms of position weight matrices (PWMs), are retrieved through JASPAR database (Fornes et al. 2020).
-
Step 3—Computing TFBSsTFBSs associated to each considered human TF are computed through the matchPWM() function, integrated into the Biostrings R library,2 setting a threshold of 0.90.
-
Step 4—TFBS enrichment for each TF and hypergeometric testStatistical tests are performed to assess the association between a given TF and the input gene set. Input file must contain a list of Entrez gene IDs, one ID for each line; an input file sample is included in the directory testset of the package. A hypergeometric test is performed for each considered TF, by comparing the number of genes in the pool set showing at least one TFBS in the promoter region (over all the transcripts) and the expected number, computed on the whole gene set. Obtained P values are then adjusted using Bonferroni’s correction. A complete list of TFs, their associated P value and adjusted P value is made available in the output directory.
-
Step 5—Identification of significant TFsTFs providing low P values (according to a threshold set by the user) are identified as potential regulatory factors of genes of the pool since they show a significant TFBS enrichment in the promoter sequences of those genes. A list of significant TFs is made available in the output directory together with the list of genes showing TFBSs related to a given TF.
-
Step 6—Designing TF networkA link to STRING database (Szklarczyk et al. 2021) visualizing the network of significant TFs is provided in the output directory. Default view is designed with a stringent interaction threshold but can be changed by the user in the STRING database. STRING visualization allows an at-a-glance view of connected significant TFs associated to the considered gene pool. The network of TFs and linked genes, initially submitted by the user, is also available through STRING database (when the -l flag is set).
3 Results
3.1 First case study: schizophrenia disorder
-
Schizophrenia disorder (8 TFs—P value \(6.9\times \,10^{-2}\) considering best 76 TFs)
-
Parkinson’s disease (6 TFs P value \(2\times \,10^{-3}\))
-
Depression (5 TFs—P value \(1.7\times \,10^{-2}\))
-
Neurological Disease Class (27 TFs—P value \(3.1\times \,10^{-2}\))
-
Antisocial behavioral traits (2 TFs—P value \(6.1\times \,10^{-2}\))
-
PSYCH disease Class (19 TFs—P value \(6.6\times \,10^{-2}\))
-
Schizophrenia/bipolar disorder (2 TFs—P value \(9.1\times \,10^{-2}\)) considering best 101 TFs.
Gene symbol | Gene name |
---|---|
ASCl1 | Achaete-scute family bHLH transcription factor 1 |
FOXP2 | Forkhead box P2 |
KLF5 | Kruppel-like factor 5 |
RUNX2 | Runt-related transcription factor 2 |
TBX3 | T-box 3 |
TCF4 | Transcription factor 4 |
TFAP2A | Transcription factor AP-2 alpha |
TFAP2B | Transcription factor AP-2 beta |
-
MAZ (adjusted P value \(< 10^{-21}\))
-
KLF5 (adjusted P value \(< 10^{-18}\))
-
KLF15 (adjusted P value \(< 10^{-17}\))
-
VEZF1 (adjusted P value \(< 10^{-16}\))
-
ZNF148 (adjusted P value \(< 10^{-15}\)).
3.2 Second case study: autism disorder
-
Autism (8 TFs—P value \(< 9.5\times \,10^{-2}\) considering best 181 TFs)
-
Neurodevelopmental psychiatric disorders (3 TFs—P value \(< 1.7\times \,10^{32}\))
-
Parkinson’s Disease (7 TFs—P value \(< 1.2\times \,10^{-2}\))
-
Depression (7 TFs—P value \(< 1.6\,\times \,10^{-2}\)) considering best 214 TFs.
Gene symbol | Gene name |
---|---|
CUX2 | Cut-like homeobox 2 |
EN2 | Engrailed homeobox 2 |
FOXP2 | Forkhead box P2 |
HES1 | Hes family bHLH transcription factor 1 |
HOXA1 | Homeobox A1 |
MAZ | Myc-associated zinc finger protein |
NFIL3 | Nuclear factor, interleukin 3 regulated |
POU6F2 | POU class 6 homeobox 2 |
-
VEZF1 (P value \(< 10^{-17}\))
-
MZF1 (P value \(< 10^{-17}\))
-
KFL15 (P value \(< 10^{-15}\))
-
MAZ (P value \(< 10^{-15}\))
-
ZNF148 (P value \(< 10^{-14}\)).