Last updated: 2022-09-16

Checks: 7 0

Knit directory: GSFA_analysis/

This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20220524) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version e5f4b4a. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/

Untracked files:
    Untracked:  Rplots.pdf
    Untracked:  analysis/check_Tcells_datasets.Rmd
    Untracked:  analysis/fscLVM_analysis.Rmd
    Untracked:  analysis/spca_LUHMES_data.Rmd
    Untracked:  analysis/test_seurat.Rmd
    Untracked:  code/gsfa_negctrl_job.sbatch
    Untracked:  code/music_LUHMES_Yifan.R
    Untracked:  code/plotting_functions.R
    Untracked:  code/run_fscLVM_LUHMES_data.R
    Untracked:  code/run_gsfa_2groups_negctrl.R
    Untracked:  code/run_gsfa_negctrl.R
    Untracked:  code/run_music_LUHMES.R
    Untracked:  code/run_music_LUHMES_data.sbatch
    Untracked:  code/run_music_LUHMES_data_20topics.R
    Untracked:  code/run_music_LUHMES_data_20topics.sbatch
    Untracked:  code/run_sceptre_Tcells_data.sbatch
    Untracked:  code/run_sceptre_Tcells_stimulated_data.sbatch
    Untracked:  code/run_sceptre_Tcells_test_data.sbatch
    Untracked:  code/run_sceptre_Tcells_unstimulated_data.sbatch
    Untracked:  code/run_sceptre_permuted_data.sbatch
    Untracked:  code/run_spca_LUHMES.R
    Untracked:  code/run_spca_TCells.R
    Untracked:  code/run_twostep_clustering_LUHMES_data.sbatch
    Untracked:  code/run_twostep_clustering_Tcells_data.sbatch
    Untracked:  code/run_unguided_gsfa_LUHMES.R
    Untracked:  code/run_unguided_gsfa_LUHMES.sbatch
    Untracked:  code/run_unguided_gsfa_Tcells.R
    Untracked:  code/run_unguided_gsfa_Tcells.sbatch
    Untracked:  code/sceptre_LUHMES_data.R
    Untracked:  code/sceptre_Tcells_stimulated_data.R
    Untracked:  code/sceptre_Tcells_unstimulated_data.R
    Untracked:  code/sceptre_permutation_analysis.R
    Untracked:  code/sceptre_permute_analysis.R
    Untracked:  code/seurat_sim_fpr_tpr.R
    Untracked:  code/unguided_GFSA_mixture_normal_prior.cpp

Unstaged changes:
    Modified:   analysis/index.Rmd
    Modified:   analysis/music_LUHMES_data.Rmd
    Modified:   analysis/sceptre_TCells_data.Rmd
    Modified:   analysis/twostep_clustering_LUHMES_data.Rmd
    Modified:   code/run_sceptre_LUHMES_data.R
    Modified:   code/run_sceptre_LUHMES_data.sbatch
    Modified:   code/run_sceptre_LUHMES_permuted_data.R
    Modified:   code/run_sceptre_Tcells_permuted_data.R
    Modified:   code/run_sceptre_cropseq_data.sbatch
    Modified:   code/run_twostep_clustering_LUHMES_data.R
    Modified:   code/sceptre_analysis.R

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/sceptre_LUHMES_data.Rmd) and HTML (docs/sceptre_LUHMES_data.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd e5f4b4a kevinlkx 2022-09-16 updated results with new version and added permutation result
html 0b325fb kevinlkx 2022-05-26 Build site.
Rmd 4b70cfd kevinlkx 2022-05-26 updated code links
html a38de80 kevinlkx 2022-05-26 Build site.
Rmd 144346c kevinlkx 2022-05-26 updated document format
html dab64ad kevinlkx 2022-05-26 Build site.
Rmd c21a39f kevinlkx 2022-05-26 SCEPTRE analysis on LUHMES data

About the data sets

CROP-seq datasets: /project2/xinhe/yifan/Factor_analysis/shared_data/ The data are Seurat objects, with raw gene counts stored in \(RNA@counts, and cell meta data stored in obj@meta.data. Normalized and scaled data used for GSFA are stored in obj@assays\) , the rownames of which are the 6k genes used for GSFA.

Analysis

Scripts for running SCEPTRE:

mkdir -p /project2/xinhe/kevinluo/GSFA/sceptre_analysis/log
cd /project2/xinhe/kevinluo/GSFA/sceptre_analysis/log

sbatch --mem=50G --cpus-per-task=10 --partition=xinhe ~/projects/GSFA_analysis/code/run_sceptre_LUHMES_data.sbatch

Permutation analysis

mkdir -p /project2/xinhe/kevinluo/GSFA/sceptre_analysis/log
cd /project2/xinhe/kevinluo/GSFA/sceptre_analysis/log

for i in {1..10}
do
  sbatch --mem=50G --cpus-per-task=10 ~/projects/GSFA_analysis/code/run_sceptre_LUHMES_permuted_data.sbatch $i
done

# run_sceptre_LUHMES_permuted_data.sbatch runs run_sceptre_LUHMES_permuted_data.R

Load packages

dyn.load('/software/geos-3.7.0-el7-x86_64/lib64/libgeos_c.so') # attach the geos lib for Seurat
suppressPackageStartupMessages(library(tidyverse))
library(cowplot)
library(Matrix)
library(sceptre)
library(Seurat)
source("code/plotting_functions.R")

Prepare input data for SCEPTRE

Load the Seurat object of LUHMES data

LUHMES_data <- readRDS('/project2/xinhe/yifan/Factor_analysis/shared_data/LUHMES_cropseq_data_seurat.rds')
datadir <- '/project2/xinhe/kevinluo/GSFA/sceptre_analysis/LUHMES_cropseq_data'

We first prepare three separate data objects required to run SCEPTRE: the gene expression matrix, the perturbation matrix, and the covariate matrix.

  1. Gene expression matrices Gene expression (gene-by-cell) raw count matrix
gene_matrix <- LUHMES_data@assays$RNA@counts
# gene-by-cell expression matrix
gene_matrix[1:10, 1:3]
dim(gene_matrix)
# 10 x 3 sparse Matrix of class "dgCMatrix"
#                 Run1_AAACCTGAGATCGATA Run1_AAACCTGCAGCTCGAC
# ENSG00000243485                     .                     .
# ENSG00000237613                     .                     .
# ENSG00000186092                     .                     .
# ENSG00000238009                     .                     .
# ENSG00000239945                     .                     .
# ENSG00000239906                     .                     .
# ENSG00000241599                     .                     .
# ENSG00000279928                     .                     .
# ENSG00000279457                     1                     .
# ENSG00000228463                     1                     .
#                 Run1_AAACCTGCATTAGCCA
# ENSG00000243485                     .
# ENSG00000237613                     .
# ENSG00000186092                     .
# ENSG00000238009                     .
# ENSG00000239945                     .
# ENSG00000239906                     .
# ENSG00000241599                     .
# ENSG00000279928                     .
# ENSG00000279457                     1
# ENSG00000228463                     .
# [1] 33694  8708
  1. Cell meta data and covariate matrix
metadata <- LUHMES_data@meta.data
metadata[1:5, ]
covariate_matrix <- metadata[,c('orig.ident', 'nCount_RNA', 'nFeature_RNA', 'percent_mt')]
covariate_matrix[1:5,]
dim(covariate_matrix)
#                        orig.ident nCount_RNA nFeature_RNA ADNP ARID1B ASH1L
# Run1_AAACCTGAGATCGATA LUHMES_Run1      10655         3287    0      0     0
# Run1_AAACCTGCAGCTCGAC LUHMES_Run1       8648         2791    0      0     0
# Run1_AAACCTGCATTAGCCA LUHMES_Run1       9015         2922    0      0     0
# Run1_AAACCTGGTAAGGGCT LUHMES_Run1       9090         2925    0      0     0
# Run1_AAACCTGGTTCCGTCT LUHMES_Run1      10182         3052    0      0     0
#                       CHD2 CHD8 CTNND2 DYRK1A HDAC5 MECP2 MYT1L Nontargeting
# Run1_AAACCTGAGATCGATA    0    1      0      0     0     0     0            0
# Run1_AAACCTGCAGCTCGAC    0    0      0      0     0     0     0            0
# Run1_AAACCTGCATTAGCCA    1    0      0      0     0     0     0            0
# Run1_AAACCTGGTAAGGGCT    0    0      0      1     0     0     0            0
# Run1_AAACCTGGTTCCGTCT    0    0      0      0     1     0     0            0
#                       POGZ PTEN RELN SETD5 percent_mt
# Run1_AAACCTGAGATCGATA    0    0    0     0   3.594557
# Run1_AAACCTGCAGCTCGAC    0    0    1     0   2.717391
# Run1_AAACCTGCATTAGCCA    0    0    0     0   2.995008
# Run1_AAACCTGGTAAGGGCT    0    0    0     0   3.080308
# Run1_AAACCTGGTTCCGTCT    0    0    0     0   3.142801
#                        orig.ident nCount_RNA nFeature_RNA percent_mt
# Run1_AAACCTGAGATCGATA LUHMES_Run1      10655         3287   3.594557
# Run1_AAACCTGCAGCTCGAC LUHMES_Run1       8648         2791   2.717391
# Run1_AAACCTGCATTAGCCA LUHMES_Run1       9015         2922   2.995008
# Run1_AAACCTGGTAAGGGCT LUHMES_Run1       9090         2925   3.080308
# Run1_AAACCTGGTTCCGTCT LUHMES_Run1      10182         3052   3.142801
# [1] 8708    4
  1. Perturbation matrix (a binary matrix of perturbations, rows are gRNA groups and columns are cell barcodes)
combined_perturbation_matrix <- t(metadata[,4:18])
dim(combined_perturbation_matrix)
combined_perturbation_matrix[1:10,1:3]
range(combined_perturbation_matrix)
# [1]   15 8708
#        Run1_AAACCTGAGATCGATA Run1_AAACCTGCAGCTCGAC Run1_AAACCTGCATTAGCCA
# ADNP                       0                     0                     0
# ARID1B                     0                     0                     0
# ASH1L                      0                     0                     0
# CHD2                       0                     0                     1
# CHD8                       1                     0                     0
# CTNND2                     0                     0                     0
# DYRK1A                     0                     0                     0
# HDAC5                      0                     0                     0
# MECP2                      0                     0                     0
# MYT1L                      0                     0                     0
# [1] 0 1
  1. Specify the gene-gRNA group pairs to test for association

Include the 6k genes used for GSFA as candidates

# Normalized and scaled data used for GSFA, the rownames of which are the 6k genes used for GSFA
scaled_gene_matrix <- LUHMES_data@assays$RNA@scale.data
dim(scaled_gene_matrix)
selected_gene_id <- rownames(scaled_gene_matrix)
all(selected_gene_id %in% rownames(gene_matrix))
# [1] 6000 8708
# [1] TRUE
gRNA_group <- rownames(combined_perturbation_matrix)
pairs <- expand.grid(selected_gene_id, gRNA_group)
gene_gRNA_group_pairs <- data.frame(gene_id = pairs$Var1, gRNA_group = pairs$Var2, pair_type = "candidate")
gene_gRNA_group_pairs[gene_gRNA_group_pairs$gRNA_group == "Nontargeting", "pair_type"] <- "negative_control"
table(gene_gRNA_group_pairs$pair_type)
table(gene_gRNA_group_pairs$gRNA_group)
dim(gene_gRNA_group_pairs)
# 
#        candidate negative_control 
#            84000             6000 
# 
#         ADNP       ARID1B        ASH1L         CHD2         CHD8       CTNND2 
#         6000         6000         6000         6000         6000         6000 
#       DYRK1A        HDAC5        MECP2        MYT1L Nontargeting         POGZ 
#         6000         6000         6000         6000         6000         6000 
#         PTEN         RELN        SETD5 
#         6000         6000         6000 
# [1] 90000     3
save(list = c("gene_matrix", "combined_perturbation_matrix", "covariate_matrix"),
     file = file.path(datadir, 'data.matrices.RData'))
saveRDS(gene_gRNA_group_pairs, file.path(datadir, "gene.gRNA.group.pairs.rds"))

Analyze the SCEPTRE results

outdir <- '/project2/xinhe/kevinluo/GSFA/sceptre_analysis/LUHMES_data_updated/sceptre_output'
result <- readRDS(file.path(outdir, 'sceptre.result.rds'))
head(result, 10)
#            gene_id gRNA_id pair_type      p_value     z_value log_fold_change
# 1  ENSG00000251562    ADNP candidate 8.276198e-04  -2.7128544    -0.042399626
# 2  ENSG00000167552    ADNP candidate 5.699734e-05   5.6529476     0.090952484
# 3  ENSG00000205542    ADNP candidate 1.175300e-03  -3.2572089    -0.077037279
# 4  ENSG00000104722    ADNP candidate 5.498633e-07 -11.9722455    -0.218557354
# 5  ENSG00000156508    ADNP candidate 6.863815e-01  -0.4263849    -0.040725865
# 6  ENSG00000034510    ADNP candidate 5.594500e-02   2.2087809     0.025101774
# 7  ENSG00000198712    ADNP candidate 8.261282e-01   0.1660938    -0.005275184
# 8  ENSG00000213190    ADNP candidate 1.769592e-01   1.4053725     0.012649649
# 9  ENSG00000184009    ADNP candidate 5.329976e-01   0.7832493    -0.007898580
# 10 ENSG00000198668    ADNP candidate 7.860573e-03   4.6174578     0.042193477

Negative control pairs

neg_control_p_vals <- result %>% filter(pair_type == "negative_control") %>% pull(p_value)
make_qq_plot(neg_control_p_vals)

Candidate pairs We extract the p-values corresponding to the candidate pairs and apply a Benjamini-Hochberg (BH) correction to adjust for multiple testing.

candidate_pair_results <- result %>% filter(pair_type == "candidate") %>%
  mutate(p_val_adj = p.adjust(p_value, method = "BH"))
head(candidate_pair_results)

candidate_pvals <- candidate_pair_results %>% pull(p_value)
make_qq_plot(candidate_pvals)

#           gene_id gRNA_id pair_type      p_value     z_value log_fold_change
# 1 ENSG00000251562    ADNP candidate 8.276198e-04  -2.7128544     -0.04239963
# 2 ENSG00000167552    ADNP candidate 5.699734e-05   5.6529476      0.09095248
# 3 ENSG00000205542    ADNP candidate 1.175300e-03  -3.2572089     -0.07703728
# 4 ENSG00000104722    ADNP candidate 5.498633e-07 -11.9722455     -0.21855735
# 5 ENSG00000156508    ADNP candidate 6.863815e-01  -0.4263849     -0.04072587
# 6 ENSG00000034510    ADNP candidate 5.594500e-02   2.2087809      0.02510177
#      p_val_adj
# 1 0.1089391252
# 2 0.0160663656
# 3 0.1220336025
# 4 0.0003446905
# 5 0.9638421091
# 6 0.6058244221

We call pairs with an adjusted p-value of less or equal than 0.1 significant; the discovery set (i.e., the set of significant pairs) has a false discovery rate (FDR) of 10%.

discovery_set <- candidate_pair_results %>% filter(p_val_adj <= 0.1)
head(discovery_set)
#           gene_id gRNA_id pair_type      p_value    z_value log_fold_change
# 1 ENSG00000167552    ADNP candidate 5.699734e-05   5.652948      0.09095248
# 2 ENSG00000104722    ADNP candidate 5.498633e-07 -11.972246     -0.21855735
# 3 ENSG00000124766    ADNP candidate 2.428708e-06  -5.419528     -0.08567983
# 4 ENSG00000173267    ADNP candidate 2.220446e-16  12.720042      0.17604320
# 5 ENSG00000131711    ADNP candidate 1.852806e-05   4.984110      0.08523870
# 6 ENSG00000277586    ADNP candidate 6.380460e-05  -6.488863     -0.15174885
#      p_val_adj
# 1 1.606637e-02
# 2 3.446905e-04
# 3 1.214354e-03
# 4 1.434750e-12
# 5 6.539316e-03
# 6 1.763022e-02

Permutation analysis

Pool results from 10 permuted data sets

outdir <- "/project2/xinhe/kevinluo/GSFA/sceptre_analysis/LUHMES_data_updated/sceptre_output/permutation"

sceptre_res <- data.frame()
for(permute_num in 1:10){
  cat("permutation set: ",permute_num, "\n")
  res_dir <- paste0(outdir, "/perm_", permute_num)
  res <- readRDS(file.path(res_dir, 'sceptre.result.rds'))
  sceptre_res <- rbind(sceptre_res, data.frame(permute_num = permute_num, res))
}

nrow(sceptre_res)
anyNA(sceptre_res$p_value)
# permutation set:  1 
# permutation set:  2 
# permutation set:  3 
# permutation set:  4 
# permutation set:  5 
# permutation set:  6 
# permutation set:  7 
# permutation set:  8 
# permutation set:  9 
# permutation set:  10 
# [1] 900000
# [1] FALSE

QQ-plot for candidate pairs

candidate_pair_results <- sceptre_res %>% filter(pair_type == "candidate") %>%
  mutate(p_val_adj = p.adjust(p_value, method = "BH"))

candidate_p_vals <- candidate_pair_results %>% pull(p_value)
# pdf(file.path(outdir, "LUHMES_permutation_qqplot.pdf"), width = 5, height = 5)
qqunif.plot(candidate_p_vals, main = "SCEPTRE")
# Loading required package: grid

# dev.off()

sessionInfo()
# R version 4.2.0 (2022-04-22)
# Platform: x86_64-pc-linux-gnu (64-bit)
# Running under: CentOS Linux 7 (Core)
# 
# Matrix products: default
# BLAS/LAPACK: /software/openblas-0.3.13-el7-x86_64/lib/libopenblas_haswellp-r0.3.13.so
# 
# locale:
#  [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C         LC_TIME=C           
#  [4] LC_COLLATE=C         LC_MONETARY=C        LC_MESSAGES=C       
#  [7] LC_PAPER=C           LC_NAME=C            LC_ADDRESS=C        
# [10] LC_TELEPHONE=C       LC_MEASUREMENT=C     LC_IDENTIFICATION=C 
# 
# attached base packages:
# [1] grid      stats     graphics  grDevices utils     datasets  methods  
# [8] base     
# 
# other attached packages:
#  [1] lattice_0.20-45    sp_1.4-7           SeuratObject_4.1.0 Seurat_4.1.1      
#  [5] sceptre_0.1.0      Matrix_1.4-1       cowplot_1.1.1      forcats_0.5.1     
#  [9] stringr_1.4.0      dplyr_1.0.9        purrr_0.3.4        readr_2.1.2       
# [13] tidyr_1.2.0        tibble_3.1.7       ggplot2_3.3.6      tidyverse_1.3.1   
# [17] workflowr_1.7.0   
# 
# loaded via a namespace (and not attached):
#   [1] readxl_1.4.0          backports_1.4.1       plyr_1.8.7           
#   [4] igraph_1.3.4          lazyeval_0.2.2        splines_4.2.0        
#   [7] listenv_0.8.0         scattermore_0.8       digest_0.6.29        
#  [10] htmltools_0.5.2       fansi_1.0.3           magrittr_2.0.3       
#  [13] tensor_1.5            cluster_2.1.3         ROCR_1.0-11          
#  [16] tzdb_0.3.0            globals_0.15.0        modelr_0.1.8         
#  [19] matrixStats_0.62.0    spatstat.sparse_2.1-1 colorspace_2.0-3     
#  [22] rvest_1.0.2           ggrepel_0.9.1         haven_2.5.0          
#  [25] xfun_0.30             callr_3.7.0           crayon_1.5.1         
#  [28] jsonlite_1.8.0        progressr_0.10.0      spatstat.data_2.2-0  
#  [31] survival_3.3-1        zoo_1.8-10            glue_1.6.2           
#  [34] polyclip_1.10-0       gtable_0.3.0          leiden_0.4.2         
#  [37] future.apply_1.9.0    abind_1.4-5           scales_1.2.0         
#  [40] DBI_1.1.3             spatstat.random_2.2-0 miniUI_0.1.1.1       
#  [43] Rcpp_1.0.8.3          viridisLite_0.4.0     xtable_1.8-4         
#  [46] reticulate_1.24       spatstat.core_2.4-2   htmlwidgets_1.5.4    
#  [49] httr_1.4.3            RColorBrewer_1.1-3    ellipsis_0.3.2       
#  [52] ica_1.0-2             farver_2.1.0          pkgconfig_2.0.3      
#  [55] uwot_0.1.11           sass_0.4.1            dbplyr_2.1.1         
#  [58] deldir_1.0-6          utf8_1.2.2            tidyselect_1.1.2     
#  [61] rlang_1.0.2           reshape2_1.4.4        later_1.3.0          
#  [64] munsell_0.5.0         cellranger_1.1.0      tools_4.2.0          
#  [67] cli_3.3.0             generics_0.1.2        broom_0.8.0          
#  [70] ggridges_0.5.3        evaluate_0.15         fastmap_1.1.0        
#  [73] yaml_2.3.5            goftest_1.2-3         processx_3.5.3       
#  [76] knitr_1.39            fs_1.5.2              fitdistrplus_1.1-8   
#  [79] RANN_2.6.1            pbapply_1.5-0         future_1.25.0        
#  [82] nlme_3.1-157          whisker_0.4           mime_0.12            
#  [85] xml2_1.3.3            compiler_4.2.0        rstudioapi_0.13      
#  [88] plotly_4.10.0         png_0.1-7             spatstat.utils_2.3-1 
#  [91] reprex_2.0.1          bslib_0.3.1           stringi_1.7.6        
#  [94] highr_0.9             ps_1.7.0              rgeos_0.5-9          
#  [97] vctrs_0.4.1           pillar_1.7.0          lifecycle_1.0.1      
# [100] spatstat.geom_2.4-0   lmtest_0.9-40         jquerylib_0.1.4      
# [103] RcppAnnoy_0.0.19      data.table_1.14.2     irlba_2.3.5          
# [106] httpuv_1.6.5          patchwork_1.1.1       R6_2.5.1             
# [109] promises_1.2.0.1      KernSmooth_2.23-20    gridExtra_2.3        
# [112] parallelly_1.31.1     codetools_0.2-18      MASS_7.3-56          
# [115] assertthat_0.2.1      rprojroot_2.0.3       withr_2.5.0          
# [118] sctransform_0.3.3     mgcv_1.8-40           parallel_4.2.0       
# [121] hms_1.1.1             rpart_4.1.16          rmarkdown_2.14       
# [124] Rtsne_0.16            git2r_0.30.1          getPass_0.2-2        
# [127] shiny_1.7.1           lubridate_1.8.0