How the TE enrichment analysis is performed by TEENA?

TE enrichment analysis is performed using a Fisher's Exact Test (FET)-based procedure. Of note, we and other labs have also used the fisher function of BEDtools for TE analysis, such as in Imbeault et al, Nature, 2017, Sun et al., Mol Bol Evol, 2021, Du et al, Genome Res, 2023, Yu et al., NAR, 2023. However, the BEDtools fisher has several limitations for TE analysis regarding the speed and flexibility, thus is not suitable for the webserver.

For the TEENA webserver, the Sweep Line algorithm is used to speed up the overlap analysis between TEs vs. given genomic regions, which is at least ten-times faster than the BEDtools fisher based pipeline. TEENA also provides multiple parameters to consider the way of overlap analysis (i.e. use mid-point or not), the gaps in the genome, and the promoter regions. Finally, it applied FET to calculate p-values, which are further adjusted to FDR values.

How are the TEENA results compared with those by previous studies?

TE enrichment analysis is a relatively straightforward task, which technically can be performed by different ways. For example, many previous studies used custome scripts relying on random genome shuffling or statistical tests like Binomial Test or Fisher’s Exact Test. There are also many studies performed such analysis by revoking command-line softwares (e.g. BEDtools, GIGGLE) that are designed with generic purpose of genomic interval comparison.

To evaluate the performance of TEENA, we compared it with BEDtools-fisher and GIGGLE, which has been adopted for TE analysis by many studies. It confirmed that TEENA and BEDtools fisher produced almost identical results, regarding either the calculated fold enrichment or P-values. We also performed multiple case studies using public data covering different data types (TF binding sites, histone marks) and species (human, pig). These results also confirmed that TEENA is capable to detect significantly enriched TEs with good accuracy.

Of note, TEENA is faster than BEDtools-fisher but not as fast as GIGGLE. However, it provides extra features unsupported by either BEDtools-fisher or GIGGLE, such as the parameters for the exclusion of given genomic regions (i.e. promoters and genomic gaps) from analysis.

How the TEENA web server is implemented?

The core pipeline for TEENA is based on Python language. The web server is implemented based on the Python-based web framework Django v4.2.5. The backend database is powered by PostgreSQL 15.0, and the frontend interface is crafted using HTML5, JavaScript and Bootstrap v5.3.1.

Can the TEENA pipeline be downloaded for custom analysis?

Yes. We provided the core scripts for TEENA pipeline at github: https://github.com/Yuzhuo-li/TEENA, which can be downloaded for custom analysis. The scripts should be useful for large-scale analysis or the analysis of organisms unsupported by TEENA web server.

How to analyze the organisms unsupported by the TEENA web server?

For the analysis of currently unsupported organisms, we would suggest the user to download the TEENA pipeline for custom analysis. Alternatively, the user can email us to add support for their custom genomes. To keep the web server robust, we would prefer to only add well annotated genomes.

What type of data can be uploaded for analysis?

TEENA is designed to take genomic interval data as input, such as a set of ChIP-seq peaks, annotated cis-elements, TF binding sites, accessible regions, or any other genomic regions.

Ideally, the uploaded file should be in BED format, but only the first three columns (chrom, chromStart, chromEnd) will be used, while other columns will be ignored. So in this sens, any text file with the first three columns representing the genomic coordinates can be subjected for TEENA analysis.

How to select the parameters for TEENA analysis?

TEENA provides several parameters related to exclusion of genomic gaps (i.e. the Ns in the genome sequence), exclusion of promoter regions, and the use of mid-point or whole interval for analysis. Those parameters are designed to compatible specialized analysis on such as enhancers (i.e. promoter regions should be excluded) and broad peaks (i.e. whole interval instead of mid-point should be used).

Regarding the use of "mid-point", we suggest to choose "Yes" for narrow peaks (e.g cis-elements, TF binding sites, narrow histone marks like H3K27ac/H3K4me3), and choose "No" for broad domains (e.g. broad histone marks like H3K27me3, H3K9me3).

Regarding the exclusion of genomic gaps, we suggest to choose "Yes" for most cases, unless the genomic intervals you provided can overlap the genomic gaps (i.e. Ns in the genome). Without considering the gaps, the background may be estimated incorrectly and cause more false positive results.

Regarding the exclusion of promoter regions, we suggest to choose "Yes" if your data won't overlap promoter regions, such as enhancers which are the focus of many TE studies. Otherwise, choose "No" if you data may overlap promoters.

How long it usually takes for a routine analysis job?

As we tested, a routine TEENA job with ~20000 genomic regions from human genome (~3.2 Gb length) can be finished in ~3 minutes, with the enrichment analysis uses about 2 minute, and the TE-associated region annotation uses about 1 minutes. The running time depends on the genome size and number of genomic regions. We estimate most jobs can be finished between 2-5 minutes.

How to download the analysis results?

The tables and figures generated by TEENA can be downloaded easily by clicking the download buttom nearby. Alternatively, the user can save the result web page to the computer, and by opening the saved html file, all the outputed tables and figures can be downloaded again later on. All the results will be stored on the server for one month.

How to interpret the TE enrichment tables?

Two tables with TE enrichment results are provided, with one table (*_TEENA_result.xlsx) for full results, and the other table (*_TEENA_significant.xlsx) only contains significantly enriched TEs. Those with adjusted_p_value<0.05 & fold_enrichment>2 are defined as significantly enriched.

Both tables are of the same format, with columns 1-3 for TE information (TE_name, TE_family, TE_class) and columns 4-7 for enrichment statistics (number of overlapped regions, p_value, adjusted_p_value, fold_enrichment). In addition, columns 8-9 provide external links to useful databases including Dfam and Repbase, which may faciliate further intepretation of the results.

How to interpret the TE enrichment figures?

Three types of figures are provided to faciliate the intepretation of TE enrichment results. Both are of SVG format, which is a widely vector-based image format. These figures maybe directly used for paper publication. The results for each of the four major classes (i.e. DNA, LINE, SINE, LTR) are visualized separately.

The scatter plot (*_scatter_plot.svg) shows the observed vs. expected overlap frequency of each type of TEs in the user-provided genomic regions. Those on the topleft corner of the diagonal are enriched, while those on the bottomright corner are depleted. The -log10(P) values are indicated by color gradients. Thus this figure provides a global view of TE enrichment. The volcano plots (*_volcanoplot.svg) provide an alternative way to show the global enrichment profile, with the x-axis represents log2(foldEnrich) and y-axis represents the -lo10(P). The barplot (*_barplot.svg) shows the top 20 enriched TEs. If less than 20 types of TEs are significantly enriched, those ranked within the top 20 are still visualized.

How the TE-associated regions are determined?

Two tables with TE enrichment results are provided, with one table (*TEENA_result.xlsx) for full results, and the other table (*TEENA_significant.xlsx) only contains significantly enriched TEs. Those with adjusted_p_value<0.05 & fold_enrichment>2 are defined as significantly enriched.

Both tables are of the same format, with columns 1-3 for TE information (TE_name, TE_family, TE_class) and columns 4-7 for enrichment statistics (number of overlapped regions, p_value, adjusted_p_value, fold_enrichment). In addition, columns 8-9 provide external links to useful databases including Dfam and Repbase, which may faciliate further intepretation of the results.

How to interpret and use the annotation table for TE-associated regions?

We also provide a table (*_annotation.xlsx) for TE-associated regions and their genomic annotations, with the purpose to faciliate the intepretation of TE function and the screening of putative candidate retions for in-depth analysis. In brief, TE-associated regions are defined as regions that overlap TEs. The, the genomic distribution of such regions are further annotated by using the annotatePeaks.pl function of homer.

The annotation table shows multiple information of TE-associated regions, including their genomic coordinates, associated TEs (i.e. genomic position and TE name) and adjacent genes (i.e. gene name, gene ID, distance to TSS and genomic distribution). By using the information provided in this table, the users may perform more analysis, such as GO enrichment analysis or motif analysis of the genomic regions that overlap each type of TEs. This table may also faciliate the selection of candidate TE-associated regions for in-depth bioinformatic or experimental analysis.

How to interpret the genomic annotation figures?

Based on the genomic annotation results, pie plots (*_annotation.svg) are generated to show the genomic distribution of TE-associated regions for each major TE classes (i.e. DNA, LINE, SINE and LTR). In brief, the pies summarize the percentage of regions that reside within promoters, exons, introns, 5'UTRs, 3'UTRs, TSS and non-coding regions, respectively.