Finding the significance of the overlap between 2 or more gene sets using simulation in R.

TLDR: Example R function to calculate significance of overlap of 2 or more gene sets. genes_all is a vector that contains all genes, and gene_sets takes a list of vectors for each gene set. I encourage people to read the full tutorial and attempt to reproduce the code themselves (especially since there are quicker ways to do the simulation).

library("purrr")

overlap_significance <- function(genes_all, gene_sets, iterations) {
  observed <- length(reduce(gene_sets, intersect))
  simulated <- map_dbl(seq_len(iterations), function(x) {
    sim <- map(lengths(gene_sets), ~sample(genes_all, .x))
    sim <- length(reduce(sim, intersect))
    return(sim)
  })
  pval <- (sum(simulated >= observed) + 1) / (iterations + 1)
  return(list(pval=pval, simulated_values=simulated, observed=observed))
}

I often see the question asked about calculating the significance of overlap of 2 or more gene sets. There are indeed tools available for this with 2 to 3 gene sets. However, I wanted to take this opportunity to introduce the power of simulation and resampling to easily tackle this question for 2 or more gene sets in R.

First I generate a set of 10,000 genes for use in this tutorial, and then select at random 10 genes that will be guaranteed to overlap in all example datasets.

all_genes <- sprintf("ENSG%08d", seq_len(10000))
common_genes <- sample(all_genes, 10)

> head(all_genes, 5)
[1] "ENSG00000001" "ENSG00000002" "ENSG00000003" "ENSG00000004" "ENSG00000005"

For the significance of overlap between just 2 gene sets, the question is often modeled using a hypergeometric distribution. Using the example given in R, the hypergeometric distribution is concerned with modeling the probability of getting x white marbles if you randomly pick (without replacement) k marbles from a jar with m white marbles and n black marbles. This tutorial will not go over this concept in detail, since I only bring it up to point out that this distribution is used to generate a null hypothesis that the overlap between the gene sets could plausibly occur through random sampling. For the case of 2 gene sets, I will show you that our simulated null distribution will match the hypergeometric null distribution.

From the example gene set I made above, I’ll make two gene sets that are guaranteed to have an overlap of at least 10 genes.

library("tidyverse")

# Create 2 gene sets.
sets <- map(seq_len(2), function(x) {
  c(common_genes, sample(all_genes, sample(seq(200, 500), 1)))
})

> lengths(sets)
[1] 487 240

# Get the overlapping number of genes.
observed <- length(reduce(sets, intersect))

> observed
[1] 16

To simulate the null distribution for the overlap of two genes sets, for 10,000 iterations I will create two gene sets the same size as the original gene sets by randomly sampling genes without replacement from the full gene list, and then compute the number of overlapping genes. I will also generate a hypergeometric null distribution for comparison based on sampling 10,000 values from a hypergeometric distribution.

# Hypergeometric null distribution.
hyper <- rhyper(
  nn=length(all_genes), m=length(sets[[1]]),
  n=length(all_genes) - length(sets[[1]]),
  k=length(sets[[2]])
)

# Simulated null distribution.
simulated <- map_dbl(seq_len(10000), function(x) {
   sim <- map(lengths(sets), ~sample(all_genes, .x))
   sim <- length(reduce(sim, intersect))
   return(sim)
})

Here is a plot that shows the null distribution methodologies side to side.

data.frame(Simulated=simulated, Hypergeometric=hyper) %>%
  pivot_longer(everything(), names_to="Method", values_to="Overlap") %>%
  ggplot(aes(x=Overlap, fill=Method)) +
    geom_histogram(binwidth=1) +
    geom_vline(xintercept=observed, lty=2, color="grey", size=1) +
    theme_bw() +
    theme(text=element_text(size=12), legend.position="none") +
    facet_grid(Method~.) +
    scale_fill_manual(values=c("dodgerblue", "seagreen")) +
    ylab("Count")

null set of 2

The plot of the two distributions look nearly identical. The grey dotted line in the observed value, and by cursory inspection of its location, it can be assumed that the p-value will be fairly high.

The p-values for the simulated method is simply the number of simulated values that were greater than or equal to the observed value (plus one), over the number of iterations (plus one). I’ve also included the p-value for the hypergeometric test, which is the probability of getting the observed value or greater under the hypergeometric null distribution.

sim_pval <- (sum(simulated >= observed) + 1) / (10000 + 1)

> sim_pval
[1] 0.1330867

hyper_pval <- phyper(
  q=observed, m=length(sets[[1]]),
  n=length(all_genes) - length(sets[[1]]),
  k=length(sets[[2]]), lower.tail=FALSE
)

> hyper_pval
[1] 0.07750621

If we were to step up to comparing the overlap of 3 or more gene sets, the problem becomes more difficult since we would need to use a multivariate hypergeomtric distribution. However, the code we wrote for the simulation can used as is with as many gene sets as we want. Here is an example of using 3 gene sets.

# Generate 3 random gene sets with at least 10 genes overlapping.
sets <- map(seq_len(3), function(x) {
  c(common_genes, sample(all_genes, sample(seq(200, 500), 1)))
})

> lengths(sets)
[1] 375 341 305

# Observed overlap.  
observed <- length(reduce(sets, intersect))

> observed
[1] 10

# Simulated overlap.
simulated <- map_dbl(seq_len(10000), function(x) {
   sim <- map(lengths(sets), ~sample(all_genes, .x))
   sim <- length(reduce(sim, intersect))
   return(sim)
})

For the case of 3 gene sets, I’ve generated another plot below. Comparing this to the example with 2 gene sets from above, the observed value is much higher than most of the simulated values, so you would expect a fairly low p-value.

ggplot() +
  geom_histogram(mapping=aes(x=simulated), binwidth=1, fill="dodgerblue") +
  geom_vline(xintercept=observed, lty=2, color="grey", size=1) +
  theme_bw() +
  theme(text=element_text(size=12)) +
  xlab("Overlap") +
  ylab("Count")

null set 3

pval <- (sum(simulated >= observed) + 1) / (10000 + 1)

> pval
[1] 9.999e-05

As expected the p-value is low.

Session info provided for posterity.

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux

Matrix products: default
BLAS/LAPACK: /geode2/home/u070/rpolicas/Carbonate/.conda/envs/R/lib/libopenblasp-r0.3.10.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] data.table_1.12.8 forcats_0.5.0     stringr_1.4.0     dplyr_1.0.2      
 [5] purrr_0.3.4       readr_1.3.1       tidyr_1.1.2       tibble_3.0.3     
 [9] ggplot2_3.3.2     tidyverse_1.3.0  

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.6     pillar_1.4.6     compiler_4.0.2   cellranger_1.1.0
 [5] dbplyr_1.4.4     tools_4.0.2      digest_0.6.25    jsonlite_1.7.0  
 [9] lubridate_1.7.9  lifecycle_0.2.0  gtable_0.3.0     pkgconfig_2.0.3 
[13] rlang_0.4.7      reprex_0.3.0     cli_2.0.2        rstudioapi_0.11 
[17] DBI_1.1.0        haven_2.3.1      withr_2.2.0      xml2_1.3.2      
[21] httr_1.4.2       fs_1.5.0         generics_0.0.2   vctrs_0.3.4     
[25] hms_0.5.3        grid_4.0.2       tidyselect_1.1.0 glue_1.4.2      
[29] R6_2.4.1         fansi_0.4.1      readxl_1.3.1     farver_2.0.3    
[33] modelr_0.1.8     blob_1.2.1       magrittr_1.5     backports_1.1.9 
[37] scales_1.1.1     ellipsis_0.3.1   rvest_0.3.6      assertthat_0.2.1
[41] colorspace_1.4-1 labeling_0.3     stringi_1.4.6    munsell_0.5.0   
[45] broom_0.7.0      crayon_1.3.4

Read more here: Source link