Parameters for filtering HERV-TFBSs and HSREs

Unified pipeline ?
Use only uniquely mapped reads ?

Statistical Thresholds: ?
Count-based permutation test
Depth-based permutation test
z score >=
The upper limit of TFs shown in the graphs:

Parameters for filtering HREV-DHSs

Only tier 1 and 2 cells ?
z score >=

Parameters for download

Merge cell types ?

Download all data set. (Size=979MiB, MD5=90ae21fa75febff25862b64d5821dc5f last update; Jun. 27, 2017)

We recommend you to use wget command for downloading our data.

Frequently Asked Questions (FAQs)

What kinds of data are available?

  • TFBSs on HERV/LTRs (HERV-TFBSs). Positions of HERV-TFBSs in the consensus sequence of the HERV/LTRs and in the human reference genome are available. TFBSs were determined by ChIP-Seq datasets provided by ENCODE and Roadmap Epigenomics projects. To determine of the TFBSs, 519 ChIP-Seq datasets for 97 sequence-specific and pol II-associated TFs were analyzed using the unified analytical pipeline. Two types of TFBS datasets were generated; all-read and unique-read TFBSs. Please see "What are all-read and unique-read TFBSs?".
  • DHSs on HERV/LTRs (HERV-DHSs). Positions of HERV-DHSs in the consensus sequence of the HERV/LTRs and in the human reference genome are available. DHSs for 124 cell types were provided by ENCODE.
  • HERV/LTR-shared regulatory elements (HSREs) identified by our original pipeline. Please see "What are HSREs?". Positions of HSREs in the consensus sequence of the HERV/LTRs and in the human reference genome are available.
  • General information of HERV/LTR types; family classification, copies number, and taxon that share orthologous copies of the HERV/LTRs.
  • Comparison between phylogenetic relationship among the HERV/LTR copies and presence of orthologous copies, TFBSs, and TF-binding motifs.
  • What are all-read and unique-read TFBSs?

    When focusing on repetitive elements such as HERV/LTRs, it is important to check whether multiple mapped reads (reads can be mapped to multiple genomic regions) are excluded in data analysis of next generation sequencing. If multiple mapped reads are not excluded, false positive peaks may be detected at regions that have sequences similar to those authentically bounded by the TF. If they are excluded, it is unfeasible to identify ChIP-Seq peaks on recently integrated HERV/LTRs that show low sequence divergence among the copies. Therefore, we generated two types of ChIP-Seq peaks (TFBSs) datasets: all-read and unique-read TFBSs (Fig. 1). All-read TFBSs are ChIP-Seq peaks that were determined with all reads mapped to the human reference genome; in our analytical pipeline, a multiple mapped read was randomly assigned to a particular genomic position chosen from candidate positions. The unique-read TFBSs are ChIP-Seq peaks that were determined with only the reads uniquely mapped to the reference genome; in other words, multiple mapped reads were excluded before the peak calling of ChIP-Seq.

    Fig. 1. An analytical pipeline for peak calling of ChIP-Seq in this study

    What are HSREs (HERV/LTR shared regulatory elements)?

    When a regulatory element is observed on a sequence of HERV/LTR, there are two possible evolutionary scenarios (Fig. 2A, 2B) to depict how the regulatory element was generated.

    Fig. 2A. Scenario 1: The regulatory element originally existed in the HERV/LTR before the insertion

    Fig. 2B. Scenario 2: The regulatory element was newly arisen by mutations after the insertion

    We probably can distinguish the two scenarios by examining many of the HERV/LTRs interspersed in the genome. Namely, the former scenario is more likely if the regulatory elements are shared/conserved among HERV/LTR copies. For this purpose, we defined HSREs, regulatory elements (TF-binding motifs in TFBSs) that are shared/conserved within a substantial fraction of HERV/LTR copies. We identified HSREs as following procedures (Fig. 3).

    Fig. 3. Scheme for identification of HSREs

    1. HERV-TFBS overlaps (HERV-TFBSs) were identified in respective cell types. Then HERV-TFBSs in distinct cell types were merged (merged HERV-TFBSs).
    2. Multiple sequence alignment (MSA) of HERV/LTR copies was constructed. Position of the merged HERV-TFBS was mapped on sequences in the MSA.
    3. TF-binding motif was scanned in HERV-TFBS and mapped on each sequence in the MSA.
    4. HERV-TFBSs and TF-binding motifs were counted at each consensus position in the MSA. If the peak of TF-binding motifs was higher than 60% of the peak of HERV-TFBSs, we regarded a set of TF-binding motifs as HSRE.

    We investigated HSREs for two purposes; (i) to infer property of regulatory elements that were anciently present in HERV/LTRs or their ancestors of exogenous retroviruses, and (ii) to identify a set of regulatory elements that coordinately work in the genome. HSREs are shared among many HERV/LTR copies that are interspersed in the genome, and then a set of HSREs can modulate several genes in a coordinate manner (Fig. 4).

    Fig. 4. Coordinate gene regulations by a set of HSREs

    For this reason, many researchers considered that HERV/LTRs sharing regulatory elements contributed to evolution of gene regulatory networks of the host.

    What kind of statistical threshold is used for filtering HERV-TFBSs or HERV-DHSs?

    It is important to check whether a TF binds to a type of HERV/LTRs significantly more than expected, because HERV/LTRs occupy a large fraction of the genome, and therefore, TF binding would be partially observed on the HERV/LTRs regardless of the absence of a special association between the HERV/LTRs and TFs. Therefore, we evaluated statistical enrichment of binding of a TF in respective types of TEs to random expectation. For this purpose, we performed two kinds of permutation tests for filtering HERV-TFBSs; count-based and depth-based permutation tests. In the both tests, the merged TFBSs were used.

    1. Count-based permutation test

      We generated 100 randomized datasets of TFBSs by permuting genomic position of the TFBSs. In the observed and randomized datasets, HERV-TFBS overlaps were counted. Standardized score (z score) was calculated based on counts in the observed and the randomized datasets. The count-based test is available for all HERV-TFBS combinations.

    2. Depth-based permutation test

      We generated 500 randomized datasets of TFBSs by permuting genomic position of the TFBSs. In the observed and randomized datasets, depth of TFBSs in the MSA (Fig. 3-4) was measured at each consensus position. Z score was calculated at each consensus position based on depths in the observed, and randomized datasets (referred to as base-wised z score). After smoothing the base-wised z score with sliding window algorithm (window size; 50 bp), the maximum base-wised z score among all positions was defined as a depth-based z score. In respective HERV/LTR types, the depth-based test is only done for TFBSs whose maximum depth in observed dataset is greater than or equal to 10.

    For filtering of HERV-DHSs, the count-based permutation test is available. Depth-based z score can distinguish the two situations shown in Fig. 5 whereas count-based z score cannot. Please note that a depth-based z score tends to be more sensitive than a count-based z score, particularly when a HERV/LTR type has long consensus sequence (such as internal sequence of HERV/LTRs).

    Fig. 5. Two possible situations when HERV-TFBS overlaps were observed at 5 times

    How genomic positions were described in files downloaded from this website?

    These files were written by BED-like format. Namely, a row indicate a particular genomic feature (HERV-TFBSs) and itsF genomic position; chromosome name, cromStart position, and chromEnd position. The first base in a chromosome is numbered 0, and the chromEnd base is not included in the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.