Prepares the training data for fitting TOP models. It splits training data into 10 partitions and assembles training data for all TF x cell type combinations for each of the partitions.

assemble_training_data(
  tf_cell_table,
  logistic_model = FALSE,
  chip_col = "chip",
  training_chrs = paste0("chr", seq(1, 21, 2)),
  n_partitions = 10,
  n_cores = n_partitions,
  max_sites = 50000,
  seed = 1
)

Arguments

tf_cell_table

A data frame listing all TF x cell type combinations and the training data for each combination. It should have at least three columns, with: TF names, cell types, and file names of the individual training data for each TF x cell type combination. The individual training data should be in .rds or text (.txt, or .csv) format.

logistic_model

Logical. If logistic_model = TRUE, prepare assembled data for the logistic version of TOP model. If logistic_model = FALSE, prepare assembled data for the quantitative occupancy model (default).

chip_col

The column name of ChIP data in the individual training data (default: ‘chip’).

training_chrs

Chromosomes used for training the model (default: odd chromosomes, chr1, chr3, ..., chr21)

n_partitions

Number of partitions to split the training data (default: 10).

n_cores

Number of cores to run in parallel (default: equal to n_partitions).

max_sites

Max number of candidate sites to keep for each TF x cell type combination (default: 50000). To reduce computation time, randomly select max_sites candidate sites for each TF x cell type combination, if the number of candidate sites exceeds max_sites.

seed

A number for the seed used when sampling sites.

Value

A list of data frames (default: 10), each containing one partition of the training data with all TF x cell type combinations.

Examples

if (FALSE) {

#  tf_cell_table should have three columns with:
#  TF names, cell types, and paths to the training data files, like:
#  |   tf_name    |   cell_type   |        data_file         |
#  |:------------:|:-------------:|:------------------------:|
#  |     CTCF     |     K562      |   CTCF.K562.data.rds     |
#  |     CTCF     |     A549      |   CTCF.A549.data.rds     |
#  |     CTCF     |    GM12878    |   CTCF.GM12878.data.rds  |
#  |     ...      |     ...       |   ...                    |

# Assembles training data for the quantitative occupancy model,
# uses odd chromosomes for training, keeps at most 50000 candidate sites for
# each TF x cell type combination, and splits training data into 10 partitions.
assembled_training_data <- assemble_training_data(tf_cell_table,
                                                  logistic_model = FALSE,
                                                  chip_col = 'chip',
                                                  training_chrs = paste0('chr', seq(1,21,2)),
                                                  n_partitions=10,
                                                  max_sites = 50000)

# Assembles training data for the logistic version of the model
assembled_training_data <- assemble_training_data(tf_cell_table,
                                                  logistic_model = TRUE,
                                                  chip_col = 'chip_label',
                                                  training_chrs = paste0('chr', seq(1,21,2)),
                                                  n_partitions=10,
                                                  max_sites = 50000)

}