TFBS Prediction in early D. melanogaster embryos
Research Timeline
Prior to Summer: ATAC-seq data obtained from Bozek et al. was run through TOBIAS and yielded initial TFBS locations
During Summer:
-
Re-ran TOBIAS with an increased number of TFs
-
Familiarized with all tools from bedtools suite
-
Optimized CRM list
-
Ran initial intersect experiments (observing overlaps between predicted bound TFBS and CRMs)
-
Ran clustering experiments upon analysis of TFBS profiles in CRMs (TFBS spacing and TFBS/region). Aim was to find new TFBS clusters that were similar to those found in CRMs
-
Re-centered project approach: Genome wide sampling of TFBS and analysis
Discussion
Limited attention has been directed towards conducting comprehensive surveys of the TFBS landscape within specific genomes. Traditionally, projects involving such assays primarily aimed at predicting CRMs, fostering a bias towards TFBS solely within CRMs or areas suggestive of CRM potential. Consequently, the prevailing notion has been that TF activity predominantly occurs within CRMs. However, our findings challenge this assumption by revealing that the majority of TFBS are situated outside CRMs.
When scrutinizing TFBS distribution across genomic features, we observed intergenic and intronic regions harboring the highest TFBS counts, albeit with low density. This sparsity, particularly within non-CRM regions, defies precise interpretation. Unexpectedly dense TFBS occurrences within 5’UTRs raised questions. However, this phenomenon was recently observed in the human genome by Chen et al. (2023), where nearly half of exonic lengths were enveloped by TFBS. Furthermore, 5’UTR containing TFBS were correlated with elevated expression levels in adjacent genes, indicating a potential role in gene expression.
The unexpected high-density TFBS within fruit fly 5’UTRs hints at potential dual-use exons. 5'UTR regions in these dual-use exons fulfill a double purpose: regulation of mRNA stability and transcriptional regulation. This speculation aligns with the "Dose of Activation" model articulated by Spivakov (2014). This model envisions cumulative "hit-and-run" interactions between TFs and nearby promoters, thus quantitatively contributing to transcription activation. Such a mechanism implies collaborative TF action surpassing an activation threshold. TFBS in 5’UTR regions, proximal to promoters, could serve this function. Similarly, we postulate that some sparsely distributed TFBS in intergenic regions might share this role.
Candidate CRMs derived from this study possess high confidence due to stringent clustering parameters, surpassing prior computational approaches. Despite this, there may be compact TFBS that merit consideration that may have not appeared because of our stringent paremeters. The association of the identified clusters with highly expressed genes during early embryonic stages bolsters our findings' credibility. Our manual CRM identification approach, when coupled with experimental validation, could considerably benefit future CRM prediction algorithms, counteracting the unreliability of ChIP-seq and TF motif scanning data.
Results
1. Majority of TF binding occurs outside known CRMs
TOBIAS revealed a total of 34,315 unique binding sites throughout the genome. It was initially expected for most TFBS to be located within regulatory regions. To test this we intersected these binding sites with the known CRM list (n = 1232) retrieved from Redfly and the Stark Vienna Tile database. Unexpectedly, only 4579 TFBS overlapped with these CRMs. This project terms these TFBS as within-bound binding sites (WBS). The remaining 29736 TFBS were named out-of-bound binding sites (OBS). For quality control, it was necessary to test if CRMs had an enrichment of TFBS. A two-proportion z-test was conducted to compare the ratio of accessible bp in CRMs (bps) over accessible bps and the ratio of WBS over total TFBS. The statistical comparison revealed a significant difference between these two ratios (Z-statistic: 40.47, p-value: <1e-4) with WBS over total TFBS as the largest ratio (Fig 1). This meant that TFBS were indeed enriched in CRMs when compared to the total bp that these regions occupied. However, there were still a substantial number of TFBS whose location is still unaccounted for.
Figure 1) Ratio of WBS and total TFBS compared to ratio of peaks in CRMs and peaks not in CRMs. Two-proportion z-test: 40.47, p-value: <1e-4 where WBS/ total TFBS > peaks in CRM/peaks not in CRM.
To further investigate the genomic locations of these OBS, additional intersections were conducted. This time, the objective was to observe other genomic features with which these OBS might overlap: promoter regions, coding sequences (CDS), 5’UTR, 3’UTR, introns, and intergenic regions. Promoter regions were defined as sections 200 bp upstream of 5’UTRs. The distribution of TFBS is pictured in figures 2 and 3. TFBS count was highest in intergenic regions, followed by introns and 5’UTR. When looking at the density of the TFBS in the overall genomic feature, calculated by dividing TFBS count by total bp of the feature, we conclude that introns and intergenic regions had their TFBS sparsely located. The features with highest density of TFBS were in CRMs and surprisingly in 5’UTR regions (Fig 2).
Figure 2) Distribution of TFBS throughout the genome (release 5) of D.melanogaster. Top: count of TFBS per genomic feature Bottom: density of TFBS per genomic feature
These intersections were done individually and due to the overlapping nature of these genomic features, it is likely that the TFBS that were found in them did so as well. To test this, the tool intervene upset was utilized to visualize how much overlap there was between TFBS in different genomic features. Results show that there isn’t much overlap between 5’UTR TFBS and WBS (in CRMs) at 596 shared TFBS. The largest overlap was between WBS and TFBS in introns at around 78% (n= 3590) as seen on Fig 3.
Future Directions and Implications
The outcomes of this project have significantly enhanced our comprehension of the TFBS landscape in D. melanogaster. Should we proceed with further testing of the dose of activation model to elucidate TFBS within 5’UTRs, it could introduce an additional layer of intricacy to gene regulation and its networks. The robust in-silico identification of candidate CRMs instills confidence in our pipeline's potential integration with machine learning algorithms for CRM prediction.
Figure 3) Intervene upset results indicating overlaps between TFBS in different genomic features
Acquired Skills
-
Bacterial Transformation
-
Fly Stock selection and management
-
BEDTools suite analysis
-
Graphing data through python
-
Data Selection and Experimental Design
-
Journal Club hosting
This general survey of TFBS landscape in the genome of D.melanogaster generates two hypotheses: 1) Some of these OBS belong to uncharacterized CRMs 2) They indicate evidence of a dosage-based activation mechanism. Results also prompts the question as to why there is enrichment of TFBS in 5’UTR regions.
2. OBS locations and clusters could identify new CRMs
The observation of OBS clusters on the Integrative Genomics Viewer (IGV) leads us to believe that these groupings of binding sites could aid in exploring the hypothesis of these TFBS clusters as previously uncharacterized CRMs. To test this hypothesis, we first profiled TFBS patterns in CRMs to find the parameters for TFBS cluster identification. Second, we tried to align any candidate clusters with active enhancer histone marks (H3K4me1 and H3K27ac).
Parameters obtained from CRM TFBS profiling focused on TFBS per region and average median spacing. Median TFBS per region in CRMs was 12 and the average median spacing between TFBS was 19 bp. Computing these values into a bash script based on bedtools' cluster tool, a total of 18 clusters were identified.
Histone marks associated with active enhancers typically include H3K4me1 and H3K27ac, with H3K4me1 often serving as a key indicator of enhancer regions in general. By utilizing ChIP-seq data from Li et al. (2019), which focused on these specific histone marks across various embryonic stages, we were able to intersect our TFBS clusters with them. This intersection provided us with two distinct sets of clusters: those exclusively containing H3K4me1 and those featuring both H3K4me1 and H3K27ac. Among these, clusters exclusively overlapping with H3K4me1 totaled 12, while clusters bearing both histone marks totaled 7.
Figure 4) Euler diagram showing number of clusters satisfying criteria found in each round of selection.
Figure 5) Examples of clusters satisfying criteria falling in regions of H3K4me1 and H3K27ac. Manual selection of closest target genes was done by looking at the UCSC Reference sequence.
Interested in the possible target genes of these candidate active enhancer TFBS clusters, we manually searched for them using a genome viewer on the D.melanogaster reference genome (Fig 5). Results are shown in the following table: