I was also interested in this problem and I tried to find a solution in R:
We can take advantage of the regioneR
package and its createRandomRegions
function: it lets you sample regions and you control region size, variation in size, and overlapping output regions or not. The function allows you to specify a genome and also mask any region from which you do not want to sample regions.
The easiest solution would be to input your genes as the background genome from which to sample the regions, but the genome input can only have one region per chromosome. Thus, the final solution is to input the genome and mask all regions which are not genes. For example:
# Create GRanges for the whole genome and the genes in the genome
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(BSgenome.Hsapiens.UCSC.hg38)
# Genes, or any BED of regions you like
txdb.genes <- genes(txdb)
txdb.genes <- reduce(txdb.genes, ignore.strand = T)
# Genome
genome <- GRanges(
seqnames=seqnames(BSgenome.Hsapiens.UCSC.hg38),
ranges=IRanges(start=rep(1, length(BSgenome.Hsapiens.UCSC.hg38)), end=seqlengths(BSgenome.Hsapiens.UCSC.hg38)),
strand="*")
# Mask anything in the genome which are not genes
mask <- setdiff(genome, txdb.genes, ignore.strand = T)
table(countOverlaps(txdb.genes, mask)) # check no genes are in the mask
# Create random regions
library(regioneR)
regions_in_genes <- createRandomRegions(nregions=1000, length.mean=1000, length.sd=0, genome = genome, mask = mask, non.overlapping = T)
table(countOverlaps(regions_in_genes, txdb.genes)) # Check that all your regions overlap with genes
table(countOverlaps(regions_in_genes, txdb.genes, type = "within")) # Check that all your regions are WITHIN genes
Read more here: Source link