Genomic ranges of problematic genomic regions that should be avoided
when working with genomic data. For human, mouse, and selected model
organisms.
TL;DR – For human hg38 genome assembly,
Anshul
recommends
ENCFF356LFX exclusion list
regions.
BED files of exclusion regions are available on the ENCODE
project
website (Amemiya, Kundaje, and Boyle 2019). Human (hg19, hg38) and mouse
(mm9, mm10) exclusion regions are available. However, exclusion lists
generated by multiple labs often create uncertainty what to use. The
purpose of this package is to provide a unified place for informed
retrieval of exclusion regions.
Naming convention: <genome assembly>.<lab>.<original file name>
, e.g.,
hg19.Birney.wgEncodeDacMapabilityConsensusExcludable
.
Download all data from the Google Drive
folder
See make-data.R how to create the
excluderanges GRanges objects.
excluderanges
Install if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
# Install the development version of Bioconductor (need 3.14 and above)
# BiocManager::install(version = "devel")
# Check that you have a valid Bioconductor installation
# BiocManager::valid()
BiocManager::install("mdozmorov/excluderanges")
Use excluderanges
# hg38 excluderanges coordinates
download.file(url = "https://drive.google.com/uc?export=download&id=1Jy_7cvFj5ZHUUwHgGUKZeFk8lFCalpKk", destfile = "hg38.Kundaje.GRCh38_unified_Excludable.rds")
excludeGR.hg38.Kundaje.1 <- readRDS(file = "hg38.Kundaje.GRCh38_unified_Excludable.rds")
excludeGR.hg38.Kundaje.1
> excludeGR.hg38.Kundaje.1
GRanges object with 910 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 628903-635104 *
[2] chr1 5850087-5850571 *
[3] chr1 8909610-8910014 *
[4] chr1 9574580-9574997 *
[5] chr1 32043823-32044203 *
... ... ... ...
[906] chrY 11290797-11334278 *
[907] chrY 11493053-11592850 *
[908] chrY 11671014-11671046 *
[909] chrY 11721528-11749472 *
[910] chrY 56694632-56889743 *
-------
seqinfo: 24 sequences from hg38 genome
Save the data in a BED file, if needed.
rtracklayer::export(excludeGR.hg38.Kundaje.1, "hg38.Kundaje.GRCh38_unified_Excludable.bed", format = "bed")
We can load other excludable regions for the hg38 genome assembly and
compare them.
download.file(url = "https://drive.google.com/uc?export=download&id=13RQmvqwC9qaYE65i74U1N-Rb4R-Sy2ud", destfile = "hg38.Bernstein.Mint_Excludable_GRCh38.rds")
excludeGR.hg38.Bernstein <- readRDS(file = "hg38.Bernstein.Mint_Excludable_GRCh38.rds")
download.file(url = "https://drive.google.com/uc?export=download&id=1ULI6uvs0vTlPcZ5yyLdsSgX1mIqnnlpp", destfile = "hg38.Kundaje.GRCh38.Excludable.rds")
excludeGR.hg38.Kundaje.2 <- readRDS(file = "hg38.Kundaje.GRCh38.Excludable.rds")
download.file(url = "https://drive.google.com/uc?export=download&id=1TA5Y8nmutoAyhL8XEl2qQTnmW6gl8CAg", destfile = "hg38.Reddy.wgEncodeDacMapabilityConsensusExcludable.hg38.rds")
excludeGR.hg38.Reddy <- readRDS(file = "hg38.Reddy.wgEncodeDacMapabilityConsensusExcludable.hg38.rds")
download.file(url = "https://drive.google.com/uc?export=download&id=1psVWkiaLy4nxAWssqv5dEZyNUrDIm005", destfile = "hg38.Wold.hg38mitoExcludable.rds")
excludeGR.hg38.Wold <- readRDS(file = "hg38.Wold.hg38mitoExcludable.rds")
download.file(url = "https://drive.google.com/uc?export=download&id=1pwbBEOHFfnVb6I9UljF3bJQLF3et4WV7", destfile = "hg38.Yeo.eCLIP_Excludableregions.hg38liftover.bed.fixed.rds")
excludeGR.hg38.Yeo <- readRDS(file = "hg38.Yeo.eCLIP_Excludableregions.hg38liftover.bed.fixed.rds")
Compare the number of excludable regions.
library(ggplot2)
mtx_to_plot <- data.frame(Count = c(length(excludeGR.hg38.Bernstein),
length(excludeGR.hg38.Kundaje.1),
length(excludeGR.hg38.Kundaje.2),
length(excludeGR.hg38.Reddy),
length(excludeGR.hg38.Wold),
length(excludeGR.hg38.Yeo)),
Source = c("Bernstein.Mint_Excludable_GRCh38",
"Kundaje.GRCh38_unified_Excludable",
"Kundaje.GRCh38.Excludable",
"Reddy.wgEncodeDacMapabilityConsensusExcludable",
"Wold.hg38mitoExcludable",
"Yeo.eCLIP_Excludableregions.hg38liftover.bed"))
# Order Source by the number of regions
mtx_to_plot$Source <- factor(mtx_to_plot$Source, levels = mtx_to_plot$Source[order(mtx_to_plot$Count)])
ggplot(mtx_to_plot, aes(x = Source, y = Count, fill = Source)) +
geom_bar(stat = "identity") +
coord_flip() +
theme_bw() + theme(legend.position = "none")
ggsave("man/figures/excluderanges_hg38_count.png", width = 5.5, height = 2)
Compare the width of excludable regions. log2 scale because of heavy
right tail distributions.
library(ggridges)
mtx_to_plot <- data.frame(Width = c(width(excludeGR.hg38.Bernstein),
width(excludeGR.hg38.Kundaje.1),
width(excludeGR.hg38.Kundaje.2),
width(excludeGR.hg38.Reddy),
width(excludeGR.hg38.Wold),
width(excludeGR.hg38.Yeo)),
Source = c(rep("Bernstein.Mint_Excludable_GRCh38", length(excludeGR.hg38.Bernstein)),
rep("Kundaje.GRCh38_unified_Excludable", length(excludeGR.hg38.Kundaje.1)),
rep("Kundaje.GRCh38.Excludable", length(excludeGR.hg38.Kundaje.2)),
rep("Reddy.wgEncodeDacMapabilityConsensusExcludable", length(excludeGR.hg38.Reddy)),
rep("Wold.hg38mitoExcludable", length(excludeGR.hg38.Wold)),
rep("Yeo.eCLIP_Excludableregions.hg38liftover.bed", length(excludeGR.hg38.Yeo))))
ggplot(mtx_to_plot, aes(x = log2(Width), y = Source, fill = Source)) +
geom_density_ridges() +
theme_bw() + theme(legend.position = "none")
ggsave("man/figures/excluderanges_hg38_width.png", width = 5.5, height = 2)
We can investigate the total width of each set of excludable ranges.
mtx_to_plot <- data.frame(TotalWidth = c(sum(width(excludeGR.hg38.Bernstein)),
sum(width(excludeGR.hg38.Kundaje.1)),
sum(width(excludeGR.hg38.Kundaje.2)),
sum(width(excludeGR.hg38.Reddy)),
sum(width(excludeGR.hg38.Wold)),
sum(width(excludeGR.hg38.Yeo))),
Source = c("Bernstein.Mint_Excludable_GRCh38",
"Kundaje.GRCh38_unified_Excludable",
"Kundaje.GRCh38.Excludable",
"Reddy.wgEncodeDacMapabilityConsensusExcludable",
"Wold.hg38mitoExcludable",
"Yeo.eCLIP_Excludableregions.hg38liftover"))
ggplot(mtx_to_plot, aes(x = TotalWidth, y = Source, fill = Source)) +
geom_bar(stat="identity") + scale_x_log10() + scale_y_discrete(label=abbreviate) +
xlab("log10 total width")
ggsave("man/figures/excluderanges_hg38_sumwidth.png", width = 6.5, height = 2)
We can compare Jaccard overlap between those sets of excludable regions.
library(pheatmap)
library(stringr)
# Jaccard calculations
jaccard <- function(gr_a, gr_b) {
intersects <- GenomicRanges::intersect(gr_a, gr_b, ignore.strand = TRUE)
intersection <- sum(width(intersects))
union <- sum(width(GenomicRanges::union(gr_a, gr_b, ignore.strand = TRUE)))
DataFrame(intersection, union,
jaccard = intersection/union,
n_intersections = length(intersects))
}
# List and names of all excludable regions
all_excludeGR_list <- list(excludeGR.hg38.Bernstein,
excludeGR.hg38.Kundaje.1,
excludeGR.hg38.Kundaje.2,
excludeGR.hg38.Reddy,
excludeGR.hg38.Wold,
excludeGR.hg38.Yeo)
all_excludeGR_name <- c("Bernstein.Mint_Excludable_GRCh38",
"Kundaje.GRCh38_unified_Excludable",
"Kundaje.GRCh38.Excludable",
"Reddy.wgEncodeDacMapabilityConsensusExcludable",
"Wold.hg38mitoExcludable",
"Yeo.eCLIP_Excludableregions.hg38liftover")
# Correlation matrix, empty
mtx_to_plot <- matrix(data = 0, nrow = length(all_excludeGR_list), ncol = length(all_excludeGR_list))
# Fill it in
for (i in 1:length(all_excludeGR_list)) {
for (j in 1:length(all_excludeGR_list)) {
# If diagonal, set to zero
if (i == j) mtx_to_plot[i, j] <- 0
# Process only one half, the other is symmetric
if (i > j) {
mtx_to_plot[i, j] <- mtx_to_plot[j, i] <- jaccard(all_excludeGR_list[[i]], all_excludeGR_list[[j]])[["jaccard"]]
}
}
}
# Trim row/colnames
rownames(mtx_to_plot) <- colnames(mtx_to_plot) <- str_trunc(all_excludeGR_name, width = 25)
# Save the plot
png("man/figures/excluderanges_hg38_jaccard.png", width = 1000, height = 900, res = 200)
pheatmap(data.matrix(mtx_to_plot))
dev.off()
Note that some excludable ranges objects contain six columns, implying
there may be some interesting metadata. Let’s explore one.
mcols(excludeGR.hg38.Reddy)
mtx_to_plot <- as.data.frame(table(mcols(excludeGR.hg38.Reddy)[["name"]]))
colnames(mtx_to_plot) <- c("Type", "Number")
mtx_to_plot <- mtx_to_plot[order(mtx_to_plot$Number), ]
mtx_to_plot$Type <- factor(mtx_to_plot$Type, levels = mtx_to_plot$Type)
ggplot(mtx_to_plot, aes(x = Number, y = Type, fill = Type)) +
geom_bar(stat="identity") +
theme_bw() + theme(legend.position = "none")
ggsave("man/figures/excluderanges_hg38_Reddy_metadata.png", width = 5, height = 2.5)
One may decide to combine the excludable ranges from all labs, although
from previous results we may decide to follow Anshul’s
advice
advice about the ENCFF356LFX exclusion list
regions and use the
excludeGR.hg38.Kundaje.1
object.
excludeGR.hg38.all <- reduce(c(excludeGR.hg38.Bernstein, excludeGR.hg38.Kundaje.1, excludeGR.hg38.Kundaje.2, excludeGR.hg38.Reddy, excludeGR.hg38.Wold, excludeGR.hg38.Yeo))
# Keep only standard chromosomes
excludeGR.hg38.all <- keepStandardChromosomes(excludeGR.hg38.all, pruning.mode = "coarse")
print(length(excludeGR.hg38.all))
# [1] 13239
summary(width(excludeGR.hg38.all))
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 5 1778 2306 8153 2859 5407757
Centromeres, telomeres, etc.
Besides the ENCODE-produced excludable regions, we may want to exclude
centromeres, telomeres, and other gap locations. The “Gap Locations”
track for Homo Sapiens is available for the GRcH37/hg19 genome assembly
as a UCSC ‘gap’
table.
It can be retrieved from
AnnotationHub,
but lacks the metadata columns needed to decide the type of gaps.
suppressPackageStartupMessages(library(AnnotationHub))
ah <- AnnotationHub()
# Search for the gap track
# ahData <- query(ah, c("gap", "Homo sapiens", "hg19"))
# ahData[ctcfData$title == "Gap"]
gaps <- ahData[["AH6444"]]
> gaps
UCSC track 'gap'
UCSCData object with 457 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 124535435-142535434 *
[2] chr1 121535435-124535434 *
[3] chr1 3845269-3995268 *
[4] chr1 13219913-13319912 *
[5] chr1 17125659-17175658 *
... ... ... ...
[453] chr6_ssto_hap7 4639807-4661556 *
[454] chr6_ssto_hap7 4723510-4774367 *
[455] chr6_ssto_hap7 4783799-4836049 *
[456] chr17_ctg5_hap1 1256795-1306794 *
[457] chr17_ctg5_hap1 1588969-1638968 *
-------
seqinfo: 93 sequences (1 circular) from hg19 genome
The UCSC ‘gap’
table
provides better granularity about the types of gaps available. E.g., for
human, hg19, we have the following types and the number of gaps.
Those objects are provided as individual GRanges.
Naming convention: <genome assembly>.UCSC.<gap type>
, e.g.,
hg19.UCSC.gap_centromere
.
Download all data from the Google Drive
folder
We can similarly load any gap type object.
download.file(url = "https://drive.google.com/uc?export=download&id=1uno7XzgCdFJeMjYI6GMfU5bAwEBWaUlR", destfile = "hg19.UCSC.centromere.rds")
gapsGR_hg19_centromere <- readRDS(file = "hg19.UCSC.centromere.rds")
gapsGR_hg19_centromere
> gapsGR_hg19_centromere
GRanges object with 24 ranges and 6 metadata columns:
seqnames ranges strand | bin ix n size type bridge
<Rle> <IRanges> <Rle> | <numeric> <numeric> <character> <numeric> <character> <character>
2 chr1 121535434-124535434 * | 23 1270 N 3000000 centromere no
184 chr21 11288129-14288129 * | 10 22 N 3000000 centromere no
199 chr22 13000000-16000000 * | 10 3 N 3000000 centromere no
206 chr19 24681782-27681782 * | 1 410 N 3000000 centromere no
224 chrY 10104553-13104553 * | 10 105 N 3000000 centromere no
... ... ... ... . ... ... ... ... ... ...
439 chr6 58830166-61830166 * | 16 628 N 3000000 centromere no
453 chr5 46405641-49405641 * | 14 452 N 3000000 centromere no
460 chr4 49660117-52660117 * | 1 447 N 3000000 centromere no
476 chr3 90504854-93504854 * | 2 784 N 3000000 centromere no
481 chr2 92326171-95326171 * | 20 770 N 3000000 centromere no
-------
seqinfo: 24 sequences from hg19 genome
Centromeres for the hg38 genome assembly
Note that the UCSC ‘gap’ table for the hg38 human genome assembly does
not contain genomic coordinates for the “centromere” gap type. These can
be obtained from the
rCGH package as
follows:
suppressPackageStartupMessages(library(rCGH))
suppressPackageStartupMessages(library(GenomicRanges))
# hg38 # data.frame
# Adjust chromosome names
hg38$chrom[hg38$chrom == 23] <- "X"
hg38$chrom[hg38$chrom == 24] <- "Y"
hg38$chrom <- paste0("chr", hg38$chrom)
# Make GRanges object
hg38.UCSC.centromere <- makeGRangesFromDataFrame(hg38, seqnames.field = "chrom", start.field = "centromerStart", end.field = "centromerEnd")
# Assign seqinfo data
seqlengths(hg38.UCSC.centromere) <- hg38$length
genome(hg38.UCSC.centromere) <- "hg38"
# Resulting object
hg38.UCSC.centromere
#> GRanges object with 24 ranges and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> [1] chr1 121535434-124535434 *
#> [2] chr2 92326171-95326171 *
#> [3] chr3 90504854-93504854 *
#> [4] chr4 49660117-52660117 *
#> [5] chr5 46405641-49405641 *
#> ... ... ... ...
#> [20] chr20 26369569-29369569 *
#> [21] chr21 11288129-14288129 *
#> [22] chr22 13000000-16000000 *
#> [23] chrX 58632012-61632012 *
#> [24] chrY 10104553-13104553 *
#> -------
#> seqinfo: 24 sequences from hg38 genome
The rCGH package also
contains data for the hg19
and hg18
genomes. The hg19
centromere
data is equivalent to the hg19.UCSC.centromere
object provided in our
excluderanges
package.
Source data for the excludable regions
mtx <- read.csv("inst/extdata/table_excluderanges.csv")
knitr::kable(mtx)
Object | Number.of.regions | Assembly | Lab | Number.of.columns | Source |
---|---|---|---|---|---|
ce10.Kundaje.ce10-Excludable.rds | 122 | ce10 | Anshul Kundaje, Stanford | 3 | mitra.stanford.edu/kundaje/akundaje/release/Excludables/ce10-C.elegans |
dm3.Kundaje.dm3-Excludable.rds | 492 | dm3 | Anshul Kundaje, Stanford | 3 | mitra.stanford.edu/kundaje/akundaje/release/Excludables/dm3-D.melanogaster/ |
hg19.Bernstein.Mint_Excludable_hg19.rds | 9035 | hg19 | Bradley Bernstein, Broad | 6 | www.encodeproject.org/files/ENCFF200UUD/ |
hg19.Birney.wgEncodeDacMapabilityConsensusExcludable.rds | 411 | hg19 | Ewan Birney, EBI | 6 | www.encodeproject.org/files/ENCFF001TDO/ |
hg19.Crawford.wgEncodeDukeMapabilityRegionsExcludable.rds | 1649 | hg19 | Gregory Crawford, Duke | 6 | www.encodeproject.org/files/ENCFF001THR/ |
hg19.Wold.hg19mitoExcludable.rds | 295 | hg19 | Barbara Wold, Caltech | 3 | www.encodeproject.org/files/ENCFF055QTV/ |
hg19.Yeo.eCLIP_Excludableregions.hg19.rds | 57 | hg19 | Gene Yeo, UCSD | 6 | www.encodeproject.org/files/ENCFF039QTN/ |
hg38.Bernstein.Mint_Excludable_GRCh38.rds | 12052 | hg38 | Bradley Bernstein, Broad | 6 | www.encodeproject.org/files/ENCFF023CZC/ |
hg38.Kundaje.GRCh38.Excludable.rds | 38 | hg38 | Anshul Kundaje, Stanford | 3 | www.encodeproject.org/files/ENCFF356LFX/ |
hg38.Kundaje.GRCh38_unified_Excludable.rds | 910 | hg38 | Anshul Kundaje, Stanford | 3 | www.encodeproject.org/files/ENCFF419RSJ/ |
hg38.Reddy.wgEncodeDacMapabilityConsensusExcludable.hg38.rds | 401 | hg38 | Tim Reddy, Duke | 6 | www.encodeproject.org/files/ENCFF220FIN/ |
hg38.Wold.hg38mitoExcludable.rds | 299 | hg38 | Barbara Wold, Caltech | 3 | www.encodeproject.org/files/ENCFF940NTE/ |
hg38.Yeo.eCLIP_Excludableregions.hg38liftover.bed.fixed.rds | 56 | hg38 | Gene Yeo, UCSD | 6 | www.encodeproject.org/files/ENCFF269URO/ |
mm10.Hardison.Excludable.full.rds | 7865 | mm10 | Ross Hardison, PennState | 3 | www.encodeproject.org/files/ENCFF790DJT/ |
mm10.Hardison.psuExcludable.mm10.rds | 5552 | mm10 | Ross Hardison, PennState | 3 | www.encodeproject.org/files/ENCFF226BDM/ |
mm10.Kundaje.anshul.Excludable.mm10.rds | 3010 | mm10 | Anshul Kundaje, Stanford | 3 | www.encodeproject.org/files/ENCFF999QPV/ |
mm10.Kundaje.mm10.Excludable.rds | 164 | mm10 | Anshul Kundaje, Stanford | 3 | www.encodeproject.org/files/ENCFF547MET/ |
mm10.Wold.mm10mitoExcludable.rds | 123 | mm10 | Barbara Wold, Caltech | 3 | www.encodeproject.org/files/ENCFF759PJK/ |
mm9.Wold.mm9mitoExcludable.rds | 123 | mm9 | Barbara Wold, Caltech | 3 | www.encodeproject.org/files/ENCFF299EZH/ |
mtx <- read.csv("inst/extdata/table_gap.csv")
knitr::kable(mtx)
Citation
Below is the citation output from using citation('excluderanges')
in
R. Please run this yourself to check for any updates on how to cite
excluderanges.
print(citation("excluderanges"), bibtex = TRUE)
#>
#> Dozmorov MG, Davis E, Mu W, Lee S, Triche T, Phanstiel D, Love M
#> (2021). _excluderanges_.
#> https://github.com/mdozmorov/excluderanges/excluderanges - R package
#> version 0.99.3, <URL: https://github.com/mdozmorov/excluderanges>.
#>
#> A BibTeX entry for LaTeX users is
#>
#> @Manual{,
#> title = {excluderanges},
#> author = {Mikhail G. Dozmorov and Eric Davis and Wancen Mu and Stuart Lee and Tim Triche and Douglas Phanstiel and Michael Love},
#> year = {2021},
#> url = {https://github.com/mdozmorov/excluderanges},
#> note = {https://github.com/mdozmorov/excluderanges/excluderanges - R package version 0.99.3},
#> }
Code of Conduct
Please note that the excluderanges
project is released with a
Contributor Code of
Conduct. By
contributing to this project, you agree to abide by its terms.
This package was developed using
biocthis.
References
Amemiya, Haley M, Anshul Kundaje, and Alan P Boyle. 2019. “The ENCODE
Blacklist: Identification of Problematic Regions of the Genome.” Sci
Rep 9 (1): 9354. doi.org/10.1038/s41598-019-45839-z.
Read more here: Source link