test SCEPTRE on example data

Last updated: 2022-05-26

Checks: 7 0

Knit directory: GSFA_analysis/

This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it's best to always run the code in an empty environment.

Seed: set.seed(20220524)

The command set.seed(20220524) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 4628ab4

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 4628ab4. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/figure/

Untracked files:
    Untracked:  code/run_sceptre_LUHMES_data.sbatch
    Untracked:  code/run_sceptre_Tcells_stimulated_data.sbatch
    Untracked:  code/run_sceptre_Tcells_unstimulated_data.sbatch
    Untracked:  code/sceptre_LUHMES_data.R
    Untracked:  code/sceptre_Tcells_stimulated_data.R
    Untracked:  code/sceptre_Tcells_unstimulated_data.R

Unstaged changes:
    Modified:   analysis/index.Rmd
    Deleted:    code/README.md

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/test_sceptre.Rmd) and HTML (docs/test_sceptre.html) files. If you've configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	4628ab4	kevinlkx	2022-05-26	test SCEPTRE

This vignette illustrates application of sceptre to an example high-multiplicity-of-infection (MOI) single-cell CRISPR screen dataset. We begin by installing and loading all required packages, including the up-to-date version of sceptre.

install.packages("devtools")
devtools::install_github("katsevich-lab/sceptre")
install.packages("tidyverse")
install.packages("cowplot")

library(tidyverse)
library(cowplot)
library(Matrix)
library(sceptre)

We proceed in seven steps.

Step 1: Prepare the data

The first step is to prepare the data to pass to sceptre. We must prepare three separate data objects: the gene-by-cell expression matrix, the gRNA-by-cell expression matrix, and the cell-specific matrix of covariates.

Gene and gRNA expression matrices

We load the example gene-by-cell and gRNA-by-cell expression matrices that are included in the sceptre package.

data(gene_matrix)
data(gRNA_matrix)

Briefly, gene_matrix (respectively, gRNA_matrix) is a 20 x 40,000 (respectively, 50 x 40,000) matrix of gene (respectively, gRNA) unique molecular identifier (UMI) counts. The data are taken from the paper "A genome-wide framework for mapping gene regulation via cellular genetic screens" by Gasperini et al., 2019. The authors used a CRISPRi-based assay to target 5,000+ putative enhancers in a population of K562 cells. The authors additionally targeted 200+ gene transcription start sites (TSSs) and designed a library of 50 non-targeting gRNAs to serve as negative controls. Genes, gRNAs, and cells are all down-sampled to reduce the size of the data. One can use the commands ?gene_matrix and ?gRNA_matrix to read more about these data.

The row names of gene_matrix and gRNA_matrix are the gene IDs and gRNA IDs, respectively. The gRNA IDs are gRNA-specific oligonucleotide barcodes. In general the gRNA IDs must be strings that uniquely identify each gRNA. The column names, meanwhile, are the cell barcodes. Next, gene_matrix and gRNA_matrix are sparse matrices (as implemented by the Matrix package). In general these matrices can be either sparse matrices or standard (dense) R matrices. We print a few rows and columns of gene_matrix and gRNA_matrix to get a sense of what the data look like.

options(width = 300)
# gene_matrix; rows are genes, columns are cells
gene_matrix[1:10, 1:3]
#> 10 x 3 sparse Matrix of class "dgTMatrix"
#>                 CTCGAAAGTCCATGAT-1_2B_7 GACACGCTCAAAGACA-1_2B_4 GGCTCGATCTGTGCAA-1_1B_1
#> ENSG00000111057                       1                       5                       1
#> ENSG00000121211                       .                       1                       .
#> ENSG00000134809                       3                      10                       1
#> ENSG00000166902                       1                       3                       .
#> ENSG00000172809                      12                      19                      33
#> ENSG00000105298                       .                       .                       .
#> ENSG00000108389                       .                       1                       .
#> ENSG00000157985                       .                       .                       .
#> ENSG00000107731                       .                       .                       .
#> ENSG00000184357                       .                       1                       .

# gRNA matrix; rows are gRNAs, columns are cells
gRNA_matrix[1:10, 1:3]
#> 10 x 3 sparse Matrix of class "dgTMatrix"
#>                      CTCGAAAGTCCATGAT-1_2B_7 GACACGCTCAAAGACA-1_2B_4 GGCTCGATCTGTGCAA-1_1B_1
#> TGTCAGTCCTCCCTCCCCCA                       .                       .                       .
#> CTAAACAGTGAAAGTCACAG                       .                       .                       .
#> GCTCCAATCATATTCTAGAG                       .                       .                       .
#> GATGAAACCTAAGGGCACAC                       .                       .                       .
#> TGCAGACGAGTGTCTCAGAG                       .                       .                       .
#> TCATCTTGAAGTCAGCTCCA                       .                       .                       3
#> GAGAATGGCATGGAGCTCAA                       .                       .                       .
#> AGAGCAAGAGAGCAACTCCG                       .                       .                       .
#> CAGCAATTCTACCTTCAGGT                       .                       .                       .
#> TTTGGTTGTTGCAAATGAGG                       .                       .                       .

We also plot a histogram of the counts of an arbitrarily selected gene ("ENSG00000111057") and gRNA ("TGTCAGTCCTCCCTCCCCCA") to visualize the data.

example_gene <- gene_matrix["ENSG00000111057",]
example_gRNA <- gRNA_matrix["TGTCAGTCCTCCCTCCCCCA",]

hist_gene <- ggplot(data = example_gene %>% tibble(count = .) %>% filter(count >= 1, count <= 40), mapping = aes(x = count)) +
  geom_histogram(binwidth = 3, col = "black", fill = "dodgerblue3", alpha = 0.7) +
  scale_y_continuous(trans='log10', expand = c(0, NA)) + xlab("Gene expressions") + ylab("") + theme_bw(base_size = 10)

hist_gRNA <- ggplot(data = example_gRNA %>% tibble(count = .) %>% filter(count >= 1, count <= 40), mapping = aes(x = count)) +
  geom_histogram(binwidth = 3, col = "black", fill = "orchid4", alpha = 0.7) +
  scale_y_continuous(trans='log10', expand = c(0, NA)) + xlab("gRNA expressions") + ylab("") + theme_bw(base_size = 10)

plot_grid(hist_gene, hist_gRNA, labels = c("a", "b"))

As expected, the data are highly discrete counts. Note that we do not normalize either the gene or gRNA expression matrices, opting instead to work directly with the raw counts.

Cell-wise covariate matrix

Next, we load the cell-wise covariate matrix, covariate_matrix.

data(covariate_matrix)
head(covariate_matrix)
#>                         lg_gRNA_lib_size lg_gene_lib_size     p_mito        batch
#> CTCGAAAGTCCATGAT-1_2B_7         6.023448         9.276783 0.03861979 prep_batch_2
#> GACACGCTCAAAGACA-1_2B_4         6.561031        10.274982 0.04499379 prep_batch_2
#> GGCTCGATCTGTGCAA-1_1B_1         8.196988         9.744375 0.02460745 prep_batch_1
#> CTGGTCTGTGGGTATG-1_2B_7         5.929589         9.260368 0.03326996 prep_batch_2
#> TCTGGAAGTACATGTC-1_2A_5         7.134094        10.002880 0.07319942 prep_batch_2
#> AACTGGTGTAGAGCTG-1_1A_4         5.655992         8.875567 0.01327186 prep_batch_1

covariate_matrix is a 40,000 x 4 data frame of "technical factors," or covariates. The row names of this data frame are the cell barcodes. The covariates are as follows:

Log-transformed gene library size (lg_gene_lib_size); this vector can be computed via log(colSums(gene_matrix))
Log-transformed gRNA library size (lg_gRNA_lib_size); this vector can be computed via log(colSums(gRNA_matrix))
Sequencing batch (batch)
Percentage of gene transcripts that map to mitochondrial genes (p_mito)

We strongly recommend that users include the same four covariates (i.e., lg_gene_lib_size, lg_gRNA_lib_size, batch, and p_mito) in their own cell-wise covariate matrix.

Step 2: Assign perturbation identities to cells

The second step is to impute perturbation assignments onto the cells. For a given gRNA and cell, we label the cell as having been perturbed by the gRNA if the UMI count of the gRNA within the cell exceeds a user-specified, integer-valued threshold; otherwise, we label the cell has having been unperturbed by the gRNA. The default threshold that we use for this purpose is 3, which we have found to work well in practice. This simple thresholding routine is implemented by the function threshold_gRNA_matrix.

perturbation_matrix <- threshold_gRNA_matrix(gRNA_matrix)

# `perturbation_matrix` is a thresholded version of `gRNA_matrix`
perturbation_matrix[1:10, 1:3]
#> 10 x 3 sparse Matrix of class "lgTMatrix"
#>                      CTCGAAAGTCCATGAT-1_2B_7 GACACGCTCAAAGACA-1_2B_4 GGCTCGATCTGTGCAA-1_1B_1
#> TGTCAGTCCTCCCTCCCCCA                       .                       .                       .
#> CTAAACAGTGAAAGTCACAG                       .                       .                       .
#> GCTCCAATCATATTCTAGAG                       .                       .                       .
#> GATGAAACCTAAGGGCACAC                       .                       .                       .
#> TGCAGACGAGTGTCTCAGAG                       .                       .                       .
#> TCATCTTGAAGTCAGCTCCA                       .                       .                       |
#> GAGAATGGCATGGAGCTCAA                       .                       .                       .
#> AGAGCAAGAGAGCAACTCCG                       .                       .                       .
#> CAGCAATTCTACCTTCAGGT                       .                       .                       .
#> TTTGGTTGTTGCAAATGAGG                       .                       .                       .

Note: We draw a terminological distinction between a "gRNA count" and a "perturbation." A "gRNA count" is the number of transcripts (in UMIs) that we observe for a given gRNA in a given cell; a "perturbation" is an assignment of a gRNA to a cell, obtained by thresholding a gRNA count. A "perturbation matrix" is a binary matrix of gRNA-to-cell assignments.

Note: Step 2 is optional; users instead can pass the raw gRNA count matrix to the sceptre function, in which case the matrix is thresholded internally to produce a perturbation matrix. (This option, though convenient, precludes the possibility of combining gRNAs and thus is not recommended; see below.)

Note: Users optionally can implement their own strategy for converting the gRNA count matrix into a perturbation matrix.

Step 3: Combine perturbations

The third step is to combine gRNAs that target the same chromosomal location. Information about the genomic position that each gRNA targets is available in the gRNA_groups_table, which we explore below.

data("gRNA_groups_table")
gRNA_groups_table[c(1, 2, 31, 32, 41, 42),] 
#> # A tibble: 6 × 3
#>   gRNA_id              gRNA_group gRNA_type 
#>   <chr>                <chr>      <chr>     
#> 1 TGTCAGTCCTCCCTCCCCCA chr10.2250 enh_target
#> 2 CTAAACAGTGAAAGTCACAG chr10.2250 enh_target
#> 3 AGAGGTAACCAAAATAGCAA NTC_2      non_target
#> 4 ATATGTAACCTCCAGAATGA NTC_2      non_target
#> 5 CAGGCTTTGCGGACGACGGG KRT18_TSS  tss_target
#> 6 GCTTTGCGGACGACGGTGGG KRT18_TSS  tss_target

The gRNA_groups_table data frame has three columns:

gRNA_id: the ID of an individual gRNA
gRNA_group: the "group" to which a given gRNA belongs. Gasperini et al. designed exactly two gRNAs to target each putative enhancer or TSS; gRNAs that target the same putative enhancer or TSS are paired to form groups of size two. Nontargeting gRNAs, meanwhile, are paired randomly, also forming groups of size two.

Note that in general, a gRNA group can contain any nonzero number of gRNAs.
target_type: one of enh_target, tss_target, and non_target, indicating whether a given gRNA is enhancer-targeting, TSS-targeting, and non-targeting.

For example, the gRNAs "GCTCTTGGCTGGAGAATGCA" and "GTTGCAGATGAGGCAACCGA" target the putative enhancer located at chr10.1318 and are grouped.

gRNA_groups_table[1:2,]
#> # A tibble: 2 × 3
#>   gRNA_id              gRNA_group gRNA_type 
#>   <chr>                <chr>      <chr>     
#> 1 TGTCAGTCCTCCCTCCCCCA chr10.2250 enh_target
#> 2 CTAAACAGTGAAAGTCACAG chr10.2250 enh_target

Similarly, the gRNAs "CCCGCGCGCCGCACGGACGG" and "GCGGATCGGGGCAAGGCTCG" target the TSS of gene HDGF and are grouped.

gRNA_groups_table[41:42,]
#> # A tibble: 2 × 3
#>   gRNA_id              gRNA_group gRNA_type 
#>   <chr>                <chr>      <chr>     
#> 1 CAGGCTTTGCGGACGACGGG KRT18_TSS  tss_target
#> 2 GCTTTGCGGACGACGGTGGG KRT18_TSS  tss_target

Finally, the non-targeting gRNAs "AATATTCTCCCTCATTCTGG" and "TTAAAATTGATTCTGCCACT" are paired randomly to form a group called "NTC_2."

gRNA_groups_table[31:32,]
#> # A tibble: 2 × 3
#>   gRNA_id              gRNA_group gRNA_type 
#>   <chr>                <chr>      <chr>     
#> 1 AGAGGTAACCAAAATAGCAA NTC_2      non_target
#> 2 ATATGTAACCTCCAGAATGA NTC_2      non_target

The function combine_perturbations takes as arguments a perturbation_matrix (i.e., the thresholded gRNA_matrix) and a gRNA_groups_table and collapses gRNAs in the same group into a single "combined" gRNA. The data frame gRNA_groups_table must have columns gRNA_id and gRNA_group, as above.

combined_perturbation_matrix <- combine_perturbations(perturbation_matrix = perturbation_matrix,
                                                      gRNA_groups_table = gRNA_groups_table)

The ouput matrix, combined_perturbation_matrix, is a 25 x 40,000 binary matrix, with rows corresponding to the 25 "gRNA groups" within gRNA_groups_table. The column names (i.e., the cell barcodes) remain unchanged.

dim(combined_perturbation_matrix)
#> [1]    25 40000
combined_perturbation_matrix[1:10,1:3]
#> 10 x 3 sparse Matrix of class "lgCMatrix"
#>            CTCGAAAGTCCATGAT-1_2B_7 GACACGCTCAAAGACA-1_2B_4 GGCTCGATCTGTGCAA-1_1B_1
#> chr10.2250                       .                       .                       .
#> chr14.2077                       .                       .                       .
#> chr1.6211                        .                       .                       |
#> chr17.1748                       .                       .                       .
#> chr17.3995                       .                       .                       .
#> chr17.5367                       .                       .                       .
#> chr17.875                        .                       .                       .
#> chr18.248                        .                       .                       .
#> chr19.2680                       .                       .                       .
#> chr19.633                        .                       .                       .

Note: Step 3 is optional but recommended; sceptre works on either combined or uncombined perturbation matrices.

Note: The gRNA_type column in gRNA_groups_table is optional but useful for bookkeeping purposes.

Step 4: Determine which pairs to analyze

Step 4 is to determine the pairs of genes and gRNA groups to analyze. It is common to analyze pairs of genes and gRNA groups that are in close physical proximity to uncover cis-regulatory relationships. The data frame gene_gRNA_group_pairs contains the 120 pairs of genes and gRNA groups that we will analyze in our analysis.

data(gene_gRNA_group_pairs)
nrow(gene_gRNA_group_pairs)
#> [1] 120
gene_gRNA_group_pairs[c(1:2, 6:7, 21:22),]
#> # A tibble: 6 × 3
#>   gene_id         gRNA_group pair_type       
#>   <chr>           <chr>      <fct>           
#> 1 ENSG00000111057 KRT18_TSS  positive_control
#> 2 ENSG00000121211 MND1_TSS   positive_control
#> 3 ENSG00000105298 chr19.633  candidate       
#> 4 ENSG00000108389 chr17.3995 candidate       
#> 5 ENSG00000111057 NTC_2      negative_control
#> 6 ENSG00000121211 NTC_2      negative_control

The gene_gRNA_group_pairs data frame contains three columns: gene_id, gRNA_group, and pair_type. The gene IDs (respectively, gRNA groups) in gene_gRNA_group_pairs must be a subset of the gene IDs (respectively, gRNA groups) in gene_matrix (respectively, combined_perturbation_matrix). The pair_type column gives the "type" of each pair, one of positive_control (for TSS-targeting gRNA groups paired to their target gene), candidate (for putative enhancer-targeting gRNA groups paired to a nearby gene), and negative_control (for pairs of genes and gRNA groups that consist of a non-targeting gRNA group).

The negative control pairs were constructed by pairing each non-targeting gRNA group to the entire set of genes, as is standard. When a large number of genes and non-targeting gRNA groups is available, it is common to analyze a random subset of the non-targeting gRNA group-gene pairs to reduce computational cost. We recommend analyzing at least as many negative control pairs as there are candidate pairs.

Note: The pair_type column, while helpful for bookkeeping purposes, is optional.

Note: Negative control and positive control pairs are not required to run sceptre; however, negative and positive control pairs are extremely useful for checking the calibration and power of sceptre (see Step 7, below) and therefore are strongly recommended.

Step 5: Determine the sidedness of the test

The fourth step is to determine the sideness of the statistical test. If we are testing for an increase (respectively, decrease) in gene expression in response to the perturbation, then we should use a right-sided (respectively, left-sided) test. On the other hand, if we are testing for an increase or decrease in gene expression, then we should use a two-sided test.

Whether we seek to test for an increase or decrease in gene expression (or either and increase or decrease) depends on (i) the genomic element being targeted, (ii) the CRISPR modality being used, and (iii) whether we are testing an association in cis or in trans. For example, if we perturb a putative enhancer via CRISPRi and test for an association between that enhancer and a gene in cis, then we should should test for a decrease in expression and therefore use a left-sided test. The following table summarizes whether a left-, right-, or two-tailed tailed test is appropriate as a function of the aforementioned variables.

Target element	CRISPR modality	Cis or trans	Testing for...	Sidedness
Enhancer or TSS	CRISPRi or CRISPRko	Cis	Decrease in expression	Left
Silencer	CRISPRi or CRISPRko	Cis	Increase in expression	Right
Enhancer or TSS	CRISPRa	Cis	Increase in expression	Right
Silencer	CRISPRa	Cis	Decrease in expression	Left
Any	Any	Trans	Increase or Decrease	Both

The direction of trans relationships is less certain than that of cis relationships, as trans relationships typically are mediated by one or more transcription factors. We therefore recommend using a two-sided test for all tests of trans relationships.

We use a left-tailed test for the example data, as suggested by the first row of the table.

side <- "left"

Step 6: Run the method

The sixth step is to call the function run_sceptre_high_moi on the data. The most important arguments to this function are gene_matrix, combined_perturbation_matrix, covariate_matrix, gene_gRNA_group_pairs, and side, all of which we prepared above. run_sceptre_high_moi has several additional, optional arguments, which are set to reasonable defaults. One can read more about run_sceptre_high_moi by checking the documentation (?run_sceptre_high_moi). The function takes about 40 second to run on the example data on an 8-core Macbook Pro.

cat(sprintf('Dimenstion of gene expression matrix: %d rows %d columns.\n', nrow(gene_matrix), ncol(gene_matrix)))
#> Dimenstion of gene expression matrix: 20 rows 40000 columns.
cat(sprintf('Dimenstion of combined perturbation matrix: %d rows %d columns.\n', nrow(combined_perturbation_matrix), ncol(combined_perturbation_matrix)))
#> Dimenstion of combined perturbation matrix: 25 rows 40000 columns.
cat(sprintf('Dimenstion of covariate matrix: %d rows %d columns.\n', nrow(covariate_matrix), ncol(covariate_matrix)))
#> Dimenstion of covariate matrix: 40000 rows 4 columns.
cat(sprintf('Dimenstion of gene gRNA-group pairs: %d rows %d columns.\n', nrow(gene_gRNA_group_pairs), ncol(gene_gRNA_group_pairs)))
#> Dimenstion of gene gRNA-group pairs: 120 rows 3 columns.
cat(sprintf('Number of gene gRNA-group pairs included: %d rows \n', nrow(gene_gRNA_group_pairs)))
#> Number of gene gRNA-group pairs included: 120 rows

system.time(
  result <- run_sceptre_high_moi(gene_matrix = gene_matrix,
                               combined_perturbation_matrix = combined_perturbation_matrix,
                               covariate_matrix = covariate_matrix,
                               gene_gRNA_group_pairs = gene_gRNA_group_pairs,
                               side = side))
#> Check /tmp/RtmpDV97Gi/logs for more detailed status updates.
#> Running checks and setting up directory structure.  ✓
#> Running gene precomputations.  ✓
#> Running perturbation precomputations.  ✓
#> Running perturbation-to-gene association tests. ✓
#> Collecting and returning results.
#> Joining, by = c("gene_id", "gRNA_id")
#>  ✓
#>     user   system  elapsed 
#> 9714.885   84.206  252.229

Note: Users can supply the raw gRNA expression matrix they obtained from Step 1 or the perturbation matrix they obtained from Step 2 to the combined_perturbation_matrix argument, in which case sceptre treats each gRNA as its own group of one. We recommend against this usage unless the data consist of only a single gRNA per target.

Step 7. Analyze the results

The output of run_sceptre_high_moi is a data frame called result, which is simply gene_gRNA_pairs with two additional columns: p_value, and z_value. p_value is the SCEPTRE p-value for a given pair, and z_value is the z-value of a negative binomial GLM fitted to the data for a given pair. Positive z-values indicate increased expression, and negative z-values indicate decreased expression. We examine several rows of result below.

head(result, 10)
#> # A tibble: 10 × 5
#>    gene_id         gRNA_id    pair_type         p_value z_value
#>    <chr>           <chr>      <fct>               <dbl>   <dbl>
#>  1 ENSG00000111057 KRT18_TSS  positive_control 2.22e-16 -12.5  
#>  2 ENSG00000121211 MND1_TSS   positive_control 2.22e-16  -9.74 
#>  3 ENSG00000134809 TIMM10_TSS positive_control 2.22e-16 -14.0  
#>  4 ENSG00000166902 MRPL16_TSS positive_control 2.22e-16  -9.80 
#>  5 ENSG00000172809 RPL38_TSS  positive_control 2.21e-13 -14.3  
#>  6 ENSG00000105298 chr19.633  candidate        4.29e- 1  -0.217
#>  7 ENSG00000108389 chr17.3995 candidate        8.36e- 1   0.936
#>  8 ENSG00000157985 chr2.7132  candidate        1.62e- 1  -0.984
#>  9 ENSG00000107731 chr10.2250 candidate        4.39e- 2  -1.27 
#> 10 ENSG00000184357 chr6.1291  candidate        8.49e- 1   0.797

Negative control pairs

We must verify that the test is correctly calibrated. To this end we create a QQ-plot of the p-values corresponding to the negative control pairs. We extract the negative control p-values from the result data frame and pass them to the function make_qq_plot.

neg_control_p_vals <- result %>% filter(pair_type == "negative_control") %>% pull(p_value)
qq_plot <- make_qq_plot(neg_control_p_vals)
plot(qq_plot)

The QQ-plot is a visual diagnostic that displays the extent to which the negative control p-values deviate from the expected uniform distribution. The points, each of which represents a p-value, should lie roughly along the diagonal line. If the points do not lie along the diagonal, then the test is poorly calibrated and the results are unreliable. The gray region is a 95% pointwise confidence band.

A reasonable rule of thumb for interpreting the QQ-plot is as follows: if all (or nearly all) of the points with an expected null p-value of 0.1 or less fall within the confidence band, then we can be confident that the test is correctly calibrated. In other words, if nearly all the points to the right of 0.1 (dashed red line) on the x-axis fall within the gray region, then we have good reason to trust the results. This rule of thumb is satisfied on the example data, as seen above.

A forthcoming article will give recommendations for what to do if the QQ-plot reveals that the test is miscalibrated. We do not expect this to happen commonly.

Positive control pairs

Next, to verify that the test is powerful (i.e., that the test is capable of detecting true relationships), we assess the power of the test on the positive control pairs. We extract the positive control p-values and examine their values.

pos_control_p_vals <- result %>% filter(pair_type == "positive_control") %>% pull(p_value)
pos_control_p_vals
#> [1] 2.220446e-16 2.220446e-16 2.220446e-16 2.220446e-16 2.208234e-13

The positive control p-values are all small, indicating that the test is powerful.

Candidate pairs

Finally, we produce the discovery set. We extract the p-values corresponding to the candidate pairs and apply a Benjamini-Hochberg (BH) correction to adjust for multiple testing.

candidate_pair_results <- result %>% filter(pair_type == "candidate")
candidate_pair_results_p_adj <- candidate_pair_results %>%
  mutate(p_val_adj = p.adjust(p_value, method = "BH"))
head(candidate_pair_results_p_adj)
#> # A tibble: 6 × 6
#>   gene_id         gRNA_id    pair_type p_value z_value p_val_adj
#>   <chr>           <chr>      <fct>       <dbl>   <dbl>     <dbl>
#> 1 ENSG00000105298 chr19.633  candidate  0.429   -0.217     0.737
#> 2 ENSG00000108389 chr17.3995 candidate  0.836    0.936     0.906
#> 3 ENSG00000157985 chr2.7132  candidate  0.162   -0.984     0.537
#> 4 ENSG00000107731 chr10.2250 candidate  0.0439  -1.27      0.431
#> 5 ENSG00000184357 chr6.1291  candidate  0.849    0.797     0.906
#> 6 ENSG00000090060 chr14.2077 candidate  0.179   -1.02      0.537

We call pairs with an adjusted p-value of less or equal than 0.1 significant; the discovery set (i.e., the set of significant pairs) has a false discovery rate (FDR) of 10%.

discovery_set <- candidate_pair_results_p_adj %>% filter(p_val_adj <= 0.1)

The discovery set in this case is empty, as we tested only fifteen candidate pairs.

Note: It is crucial to apply BH only to the pairs in the discovery set (i.e., the candidate pairs). Including positive control or negative control pairs in the discovery set causes BH to become excessively liberal or conservative, respectively.

sessionInfo()
#> R version 4.0.4 (2021-02-15)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Scientific Linux 7.4 (Nitrogen)
#> 
#> Matrix products: default
#> BLAS/LAPACK: /software/openblas-0.3.13-el7-x86_64/lib/libopenblas_haswellp-r0.3.13.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] fstcore_0.9.12  sceptre_0.1.0   Matrix_1.4-1    cowplot_1.1.1   forcats_0.5.1   stringr_1.4.0   dplyr_1.0.8     purrr_0.3.4     readr_2.1.2     tidyr_1.2.0     tibble_3.1.6    ggplot2_3.3.5   tidyverse_1.3.1 workflowr_1.7.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] httr_1.4.2        sass_0.4.1        jsonlite_1.8.0    foreach_1.5.2     modelr_0.1.8      bslib_0.3.1       assertthat_0.2.1  getPass_0.2-2     highr_0.9         cellranger_1.1.0  yaml_2.3.5        pillar_1.7.0      backports_1.4.1   lattice_0.20-45   glue_1.6.2        digest_0.6.29    
#> [17] promises_1.2.0.1  rvest_1.0.2       colorspace_2.0-3  htmltools_0.5.2   httpuv_1.6.5      pkgconfig_2.0.3   broom_0.8.0       haven_2.5.0       scales_1.2.0      processx_3.5.3    whisker_0.4       later_1.3.0       tzdb_0.3.0        git2r_0.30.1      generics_0.1.2    farver_2.1.0     
#> [33] ellipsis_0.3.2    withr_2.5.0       fst_0.9.8         cli_3.2.0         magrittr_2.0.3    crayon_1.5.1      readxl_1.4.0      evaluate_0.15     ps_1.6.0          fs_1.5.2          fansi_1.0.3       doParallel_1.0.17 xml2_1.3.3        data.table_1.14.2 tools_4.0.4       hms_1.1.1        
#> [49] lifecycle_1.0.1   munsell_0.5.0     reprex_2.0.1      callr_3.7.0       compiler_4.0.4    jquerylib_0.1.4   rlang_1.0.2       grid_4.0.4        iterators_1.0.14  rstudioapi_0.13   labeling_0.4.2    rmarkdown_2.13    codetools_0.2-18  gtable_0.3.0      DBI_1.1.2         R6_2.5.1         
#> [65] lubridate_1.8.0   knitr_1.38        fastmap_1.1.0     utf8_1.2.2        rprojroot_2.0.2   stringi_1.7.6     parallel_4.0.4    Rcpp_1.0.8.3      vctrs_0.4.1       dbplyr_2.1.1      tidyselect_1.1.2  xfun_0.30