Tutorial for data preparation

In this tutorial, we prepare input data for the analysis.

Load the package.

library(mapgen)

GWAS summary statistics

We need a data frame of GWAS summary statistics with the following columns:

chr: chromosome
pos: position
beta: effect size (if you have Odds Ratio, you will need to transform it to log(Odds Ratio))
se: standard error (SE)
a0: reference allele
a1: association/effect allele
snp: SNP ID (e.g. rsID)
pval: p-value

Let’s load a small example GWAS dataset (test.sumstats.txt.gz), and process the data (see below).

We can use the vroom package to read the GWAS summary statistics.

library(vroom)
gwas.file <- system.file('extdata', 'test.sumstats.txt.gz', package='mapgen')
sumstats <- vroom(gwas.file, col_names = TRUE, show_col_types = FALSE)
head(as.data.frame(sumstats),3)

##                           var_name                              phase3_1kg_id
## 1 1_10616_CCGCCGTTGCAAAGGCGCGCCG_C rs376342519:10616:CCGCCGTTGCAAAGGCGCGCCG:C
## 2                      1_11008_C_G                                1:11008:C:G
## 3                      1_11012_C_G                                1:11012:C:G
##   chr position_b37                     a0 a1 bcac_onco2_r2
## 1   1        10616 CCGCCGTTGCAAAGGCGCGCCG  C      0.837659
## 2   1        11008                      C  G      0.580657
## 3   1        11012                      C  G      0.580657
##   bcac_onco2_eaf_controls bcac_onco2_beta bcac_onco2_se bcac_onco2_P1df_Wald
## 1                  0.9925         0.00009       0.05784             0.998700
## 2                  0.0979        -0.04365       0.02002             0.029280
## 3                  0.0979        -0.04365       0.02002             0.029279
##   bcac_onco2_P1df_LRT bcac_icogs2_r2       rsIDs
## 1            0.998700       0.262460 rs376342519
## 2            0.029368       0.335425           1
## 3            0.029367       0.335425           1

LD blocks

For the LD blocks, we need a data frame with four columns: chromosome, start and end positions, and the indices of the LD blocks.

We provided the LD blocks from LDetect for the 1000 Genomes (1KG) European population (in hg19).

LD_Blocks <- readRDS(system.file('extdata', 'LD.blocks.EUR.hg19.rds', package='mapgen'))
head(LD_Blocks, 3)

##   chr   start     end locus
## 1   1   10583 1892607     1
## 2   1 1892607 3582736     2
## 3   1 3582736 4380811     3

* You can skip this if you only need to run enrichment analysis.

Reference panel

Fine-mapping requires linkage-disequilibrium (LD) information.

You could either provide LD matrices or use a reference genotype panel, which will compute LD matrices internally.

To use the reference genotype panel, we utilize the R package bigsnpr to read in PLINK files (bed/bim/fam) into R and match alleles between GWAS summary statistics and reference genotype panel.

If you have reference genotypes in PLINK format, you can use bigsnpr::snp_readBed() to read bed/bim/fam files into a bigSNP object and save as a ‘.rds’ file. This ‘.rds’ file could be used for downstream analyses when attached with .

We provided a bigSNP object of the reference genotype panel from the 1000 Genomes (1KG) European population. If you are in the He lab at UChicago, you can load the bigSNP object from RCC as below.

library(bigsnpr)
bigSNP <- snp_attach(rdsfile = '/project2/xinhe/1kg/bigsnpr/EUR_variable_1kg.rds')

We also have pre-computed LD matrices of European samples from UK Biobank. They can be downloaded [here][UKBB_LD]. If you are at UChicago, you can directly access those LD matrices from RCC at /project2/mstephens/wcrouse/UKB_LDR_0.1_b37/.

We create a data frame region_info, with filenames of UKBB reference LD matrices and SNP info adding to the LD_Blocks.

region_info <- get_UKBB_region_info(LD_Blocks,
                                    LDREF.dir = "/project2/mstephens/wcrouse/UKB_LDR_0.1_b37", 
                                    prefix = "ukb_b37_0.1")

* You can skip this if you only need to run enrichment analysis, or if you only need to run downstream analysis (e.g. gene mapping) with precomputed fine-mapping result.

Process GWAS summary statistics

We run process_gwas_sumstats() to process the summary statistics. This checks and cleans GWAS summary statistics, add locus ID from the LD_Blocks if available, match alleles in GWAS SNPs with the reference panel from the bigSNP object if bigSNP is specified, or harmonize GWAS SNPs with the reference LD information from the precopmuted LD matrices if region_info is specified.

gwas.sumstats <- process_gwas_sumstats(sumstats, 
                                       chr='chr', 
                                       pos='position_b37', 
                                       beta='bcac_onco2_beta', 
                                       se='bcac_onco2_se',
                                       a0='a0', 
                                       a1='a1', 
                                       snp='phase3_1kg_id', 
                                       pval='bcac_onco2_P1df_Wald',
                                       LD_Blocks=LD_Blocks,
                                       bigSNP=bigSNP)

* You do not need to specify bigSNP, region_info, or LD_Blocks if you only need to run enrichment analysis.

Check that the summary statistics is processed and has appropriate columns:

head(as.data.frame(gwas.sumstats),3)

##   chr   pos a0 a1     beta      se                   snp    pval     zscore
## 1   1 13110  A  G -0.02066 0.03231           1:13110:G:A 0.52245 -0.6394305
## 2   1 13116  G  T -0.01255 0.01895 rs201725126:13116:T:G 0.50772 -0.6622691
## 3   1 13118  G  A -0.01255 0.01895 rs200579949:13118:A:G 0.50772 -0.6622691
##   locus ss_index bigSNP_index
## 1     1        3            4
## 2     1        4            5
## 3     1        5            6

Kaixuan Luo

GWAS summary statistics

LD blocks

Reference panel

Process GWAS summary statistics