R/assemble_training_data.R
assemble_training_data.Rd
Prepares the training data for fitting TOP models. It splits training data into 10 partitions and assembles training data for all TF x cell type combinations for each of the partitions.
A data frame listing all TF x cell type combinations and the training data for each combination. It should have at least three columns, with: TF names, cell types, and file names of the individual training data for each TF x cell type combination. The individual training data should be in .rds or text (.txt, or .csv) format.
Logical. If logistic_model = TRUE
,
prepare assembled data for the logistic version of TOP model.
If logistic_model = FALSE
, prepare assembled data for the
quantitative occupancy model (default).
The column name of ChIP data in the individual training data (default: ‘chip’).
Chromosomes used for training the model (default: odd chromosomes, chr1, chr3, ..., chr21)
Number of partitions to split the training data (default: 10).
Number of cores to run in parallel
(default: equal to n_partitions
).
Max number of candidate sites to keep for
each TF x cell type combination (default: 50000). To reduce computation time,
randomly select max_sites
candidate sites for
each TF x cell type combination, if the number of candidate sites
exceeds max_sites
.
A number for the seed used when sampling sites.
A list of data frames (default: 10), each containing one partition of the training data with all TF x cell type combinations.
if (FALSE) {
# tf_cell_table should have three columns with:
# TF names, cell types, and paths to the training data files, like:
# | tf_name | cell_type | data_file |
# |:------------:|:-------------:|:------------------------:|
# | CTCF | K562 | CTCF.K562.data.rds |
# | CTCF | A549 | CTCF.A549.data.rds |
# | CTCF | GM12878 | CTCF.GM12878.data.rds |
# | ... | ... | ... |
# Assembles training data for the quantitative occupancy model,
# uses odd chromosomes for training, keeps at most 50000 candidate sites for
# each TF x cell type combination, and splits training data into 10 partitions.
assembled_training_data <- assemble_training_data(tf_cell_table,
logistic_model = FALSE,
chip_col = 'chip',
training_chrs = paste0('chr', seq(1,21,2)),
n_partitions=10,
max_sites = 50000)
# Assembles training data for the logistic version of the model
assembled_training_data <- assemble_training_data(tf_cell_table,
logistic_model = TRUE,
chip_col = 'chip_label',
training_chrs = paste0('chr', seq(1,21,2)),
n_partitions=10,
max_sites = 50000)
}