Prepare input data • TOP

TOP predicts the quantitative TF occupancy from ChIP-seq around candidate TF binding sites using the site-centric approach.

We first identified candidate binding sites by motif scanning with a permissive threshold (using FIMO) (Grant et al., 2011).

Then, for each cell type, we considered (normalized) DNase and/or ATAC cleavage events occurring within 100 bp of the candidate binding site.

Similarly, we quantified TF occupancy in terms of ChIP-seq read counts within 100 bp of the candidate binding site, and this served as the target of our regression when training TOP.

We simplified the chromatin accessibility data into predictive features using five bins that aggregate the number of cleavage events occurring within the motif itself, as well as within two non-overlapping flanking regions upstream and downstream, using the same binning scheme used in the MILLIPEDE model (Luo and Hartemink, 2013).

Input data

The input data for TOP includes a data frame for each TF x cell type of interest:

The input data frame (see example below) should contains:

Columns of the candidate binding sites: chr, start, end, site name, PWM score, strand, p-value from (FIMO) motif scanning.
Columns of DNase-seq or ATAC-seq bin counts: Five bins (MILLIPEDE M5 bins) around motif matches.
optional: one column of ChIP-seq measurement, if you want to train your own models. It could be quantitative TF occupancy (asinh transformed ChIP-seq read counts) or binary TF binding labels (obtained using ChIP-seq peaks).

You can follow the procedure below or use your own data processing scripts/pipelines to prepare the input data.

You can also include additional steps to correct DNase-seq or ATAC-seq bias. Yet, our binning approach (with M5 bins) should make TOP robust to DNase-seq or ATAC-seq cleavage bias at base-pair resolution.

We recommend using Snakemake to automate the whole procedure, especially when you have data for many TFs in many cell types or conditions. We provided some example pipelines that we used in our paper.

In order to prepare input data using TOP functions, please have the following R packages installed: GenomicRanges, Rsamtools, data.table, ggplot2, as well as the following command line tools: bedtools, bwtool, and fimo from the MEME suite as well as bedGraphToBigWig and bigWigAverageOverBed from UCSC binary utilities directory.

Example data preparation procedure

Here, we show an example procedure with several steps for preparing input data for CTCF in K562 cell type using the human genome assembly version hg38.

Load TOP R package

library(TOP)

Step 1: Find TF motif matches using FIMO software

To scan for TF motif matches, we use the FIMO software from the MEME suite.

Download hg38 reference genome FASTA file and save it as hg38.fa.

wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/analysisSet/hg38.analysisSet.fa.gz
gunzip -c hg38.analysisSet.fa.gz > hg38.fa

Generate the chrom.sizes file which will be needed later.

index_fa('hg38.fa', chromsize_file='hg38.chrom.sizes')

Download the CTCF motif file MA0139.1.meme (in MEME format) from JASPAR.

wget https://jaspar.genereg.net/api/v1/matrix/MA0139.1.meme

We use FIMO with the threshold p-value < 1e-5 in most of our analyses. You can set thresh = "1e-4" (FIMO’s default threshold), if you need more motif matches.

You can also set different background model in bfile, for example, bfile='--uniform--' would use uniform nucleotide frequencies.

Save results in MA0139.1_1e-5.fimo.txt and MA0062.1_1e-5.fimo.txt, which will be used in the next step.

fimo_motif_matches(motif_file='MA0139.1.meme', 
                   sequence_file='hg38.fa', 
                   thresh_pValue=1e-5, 
                   outname='MA0139.1_1e-5.fimo.txt')

Step 2: Get candidate TF binding sites

We take motif matches obtained from FIMO as candidate binding sites, and add 100 bp flanking regions on both sides of the motifs, then filter candidate sites by FIMO p-value, and filter the candidate sites falling in ENCODE blacklist regions.

Download ENCODE blacklist from ENCODE portal and save as blacklist.hg38.bed.gz.

wget https://www.encodeproject.org/files/ENCFF356LFX/@@download/ENCFF356LFX.bed.gz
mv ENCFF356LFX.bed.gz blacklist.hg38.bed.gz

Obtain the candidate sites

# fimo_file: FIMO result file.
# thresh_pValue: FIMO p-value threshold.
# blacklist_file: file with ENOCDE blacklist regions.
sites <- process_candidate_sites(fimo_file='MA0139.1_1e-5.fimo.txt', 
                                 thresh_pValue=1e-5, 
                                 blacklist_file='blacklist.hg38.bed.gz')

Step 3: Count DNase- or ATAC-seq genome-wide cleavage

We use ATAC-seq reads from K562 cell line (ENCODE ID: ENCSR868FGK) for example.

There are three replicate samples in this study. Here we only use one replicate sample (ENCFF534DCE.bam, and we renamed it as K562.ATAC.bam) as an example.

We first sort and index the BAM file, and obtain the total number of mapped reads from the idxstats file, which will be used later when normalizing read counts by library sizes.

# Download the BAM file from ENCODE
wget https://www.encodeproject.org/files/ENCFF534DCE/@@download/ENCFF534DCE.bam
# Rename the bam file
mv ENCFF534DCE.bam K562.ATAC.bam

# This BAM file has already been sorted, so we skip the sorting step. 
sort_index_idxstats_bam('K562.ATAC.bam', sort=FALSE, index=TRUE, idxstats=TRUE)

Next, we count the DNase or ATAC cuts along the genome, and save the genome counts in BigWig files. This step often takes a while especially for large BAM files, but it only needs to be done once.

The resulting BigWig files will be used later for extracting the count matrices around the candidate sites for different motifs.

We need the command line tools: bedtools and bedGraphToBigWig from UCSC binary utilities for this step.

For ATAC-seq, when , reads were shifted so as to address offsets and align the signal across strands.

# bam_file: sorted BAM file.
# chrom_size_file: file of genome sizes by chromosomes.
# shift_ATAC: If TRUE, shift ATAC-seq reads.
# outdir: directory for saving the BigWig files of genome counts, same as outdir in get_sites_counts().
# outname: prefix for the BigWig files, same as genomecount_name in get_sites_counts().
count_genome_cuts(bam_file='K562.ATAC.bam', 
                  chrom_size_file='hg38.chrom.sizes', 
                  shift_ATAC=TRUE,
                  outdir='processed_data',
                  outname='K562.ATAC')

Step 4: Get DNase- or ATAC-seq count matrices around candidate sites, then normalize, bin and transform the counts

Get DNase- or ATAC-seq read counts matrix around candidate sites:

We utilize bwtool to extract read counts from the BigWig files generated earlier.

# sites: data frame containing candidate sites.
# genomecount_dir: directory for genome counts, same as outdir in count_genome_cuts().
# genomecount_name: file prefix for genome counts, same as outname in count_genome_cuts().
sites_counts.mat <- get_sites_counts(sites,
                                     genomecount_dir='processed_data',
                                     genomecount_name='K562.ATAC')
saveRDS(sites_counts.mat, "processed_data/CTCF.K562.ATAC.counts.mat.rds")

Normalize, bin and transform counts:

# count_matrix: DNase (or ATAC) read counts matrix.
# idxstats_file: the idxstats file generated by samtools.
# ref_size: scale to reference library size.
# (default: 50 million for ATAC-seq, 100 million for DNase-seq)
# bin_method: MILLIPEDE binning scheme (default: 'M5').
# transform: type of transformation for DNase (or ATAC) counts (default: 'asinh').
bins <- normalize_bin_transform_counts(sites_counts.mat, 
                                       idxstats_file='K562.ATAC.bam.idxstats.txt', 
                                       ref_size=5e7,
                                       bin_method='M5',
                                       transform='asinh')

Make a data frame for candidate sites combining motif match information and transformed ATAC (or DNase) counts in five MILLIPEDE bins:

combined_data <- data.frame(sites, bins)
colnames(combined_data) <- c('chr','start','end','name','pwm.score','strand','p.value', 
                                paste0('bin', 1:ncol(bins)))

saveRDS(combined_data, 'processed_data/CTCF_MA0139.1_1e-5.K562.ATAC.M5.combined.data.rds')

head(combined_data, 3)
#>    chr start   end  name pwm.score strand  p.value      bin1      bin2 bin3
#> 1 chr1 11122 11341 site1   22.4839      - 9.17e-09 0.2232134 0.0000000    0
#> 2 chr1 11180 11399 site2   20.7581      - 5.22e-08 0.0000000 0.2232134    0
#> 3 chr1 24681 24900 site3   16.5645      - 1.31e-06 0.0000000 0.0000000    0
#>   bin4 bin5
#> 1    0    0
#> 2    0    0
#> 3    0    0

We can apply TOP models to data frames like this to make predictions.

Prepare ChIP-seq data (optional if you want to train your own model)

If you want to train your own model, you would also need to prepare ChIP data and add those to your input data.

Download CTCF K562 ChIP-seq BAM files (ENCODE ID: ENCSR000EGM, two replicates: ENCFF172KOJ and ENCFF265ZSP).

# Download the ChIP-seq BAM files
wget https://www.encodeproject.org/files/ENCFF172KOJ/@@download/ENCFF172KOJ.bam
wget https://www.encodeproject.org/files/ENCFF265ZSP/@@download/ENCFF265ZSP.bam

# Rename the BAM files
mv ENCFF172KOJ.bam CTCF.K562.ChIPseq.rep1.bam
mv ENCFF265ZSP.bam CTCF.K562.ChIPseq.rep2.bam

Sort and index the BAM files and obtain the number of mapped reads.

# The BAM files have already been sorted, so we skip the sorting step. 
bam_sort_index_stats('CTCF.K562.ChIPseq.rep1.bam', sort=FALSE, index=TRUE, idxstats=TRUE)
bam_sort_index_stats('CTCF.K562.ChIPseq.rep2.bam', sort=FALSE, index=TRUE, idxstats=TRUE)

Count ChIP-seq reads around candidate sites (merge ChIP-seq replicates), and normalize (scale) to the same reference library size (20 million).

# sites: data frame containing the candidate sites.
# chip_bam_files: ChIP-seq bam files (may include multiple replicates).
# chrom_size_file: file of genome sizes by chromosomes.
# ref_size: normalize to ChIP-Seq reference library size (default: 20 million).
sites_chip <- count_normalize_chip(sites,
                                   chip_bam_files=c('CTCF.K562.ChIPseq.rep1.bam',
                                                    'CTCF.K562.ChIPseq.rep2.bam'),
                                   chrom_size_file='hg38.chrom.sizes', 
                                   ref_size=2e7)

head(sites_chip, 3)

Combine motif match information, ATAC (or DNase) bins, and ChIP-seq counts into a data frame.

combined_data <- data.frame(sites, bins, chip = sites_chip$chip)
colnames(combined_data) <- c('chr','start','end','name','pwm.score','strand','p.value', 
                                paste0('bin', 1:ncol(bins)), 'chip')
saveRDS(combined_data, 'processed_data/CTCF_MA0139.1_1e-5.K562.ATAC.M5.ChIP.combined.data.rds')

head(combined_data, 3)
#>    chr start   end  name pwm.score strand  p.value      bin1      bin2 bin3
#> 1 chr1 11122 11341 site1   22.4839      - 9.17e-09 0.2232134 0.0000000    0
#> 2 chr1 11180 11399 site2   20.7581      - 5.22e-08 0.0000000 0.2232134    0
#> 3 chr1 24681 24900 site3   16.5645      - 1.31e-06 0.0000000 0.0000000    0
#>   bin4 bin5 chip
#> 1    0    0    0
#> 2    0    0    0
#> 3    0    0    0

Replace ChIP-seq counts with binary ChIP labels (from ChIP-seq peaks) if you want to train TOP logistic version to predict TF binding probability.

Download CTCF K562 ChIP-seq peaks

# Download ChIP-seq peaks
wget https://www.encodeproject.org/files/ENCFF660GHM/@@download/ENCFF660GHM.bed.gz
# Rename the BAM files
mv ENCFF660GHM.bed.gz CTCF.K562.ChIPseq.peaks.bed.gz

# We can directly supply the ChIP-seq peaks file name in 'chip_peak_file',
# alternatively, we can set 'chip_peak_sampleID'. Then, it will download the 
# ChIP-seq peaks file from ENCODE website.
sites_chip_labels <- add_chip_peak_labels_to_sites(sites, 
                                                   chip_peak_file='CTCF.K562.ChIPseq.peaks.bed.gz')

combined_data <- data.frame(sites, bins, chip_label = sites_chip_labels$chip_label)
colnames(combined_data) <- c('chr','start','end','name','pwm.score','strand','p.value', 
                             paste0('bin', 1:ncol(bins)), 'chip_label')
saveRDS(combined_data, 'processed_data/CTCF_MA0139.1_1e-5.K562.ATAC.M5.ChIPlabels.combined.data.rds')

We can plot the footprint profiles around the motif matches

bound_sites_counts.mat <- sites_counts.mat[which(combined_data$chip_label == 1), ]
plot_profile(bound_sites_counts.mat, title='ATAC-seq profiles around CTCF binding sites')