I was also interested in this problem and I tried to find a solution in R:
We can take advantage of the
regioneR package and its
createRandomRegions function: it lets you sample regions and you control region size, variation in size, and overlapping output regions or not. The function allows you to specify a genome and also mask any region from which you do not want to sample regions.
The easiest solution would be to input your genes as the background genome from which to sample the regions, but the genome input can only have one region per chromosome. Thus, the final solution is to input the genome and mask all regions which are not genes. For example:
# Create GRanges for the whole genome and the genes in the genome library(TxDb.Hsapiens.UCSC.hg38.knownGene) library(BSgenome.Hsapiens.UCSC.hg38) # Genes, or any BED of regions you like txdb.genes <- genes(txdb) txdb.genes <- reduce(txdb.genes, ignore.strand = T) # Genome genome <- GRanges( seqnames=seqnames(BSgenome.Hsapiens.UCSC.hg38), ranges=IRanges(start=rep(1, length(BSgenome.Hsapiens.UCSC.hg38)), end=seqlengths(BSgenome.Hsapiens.UCSC.hg38)), strand="*") # Mask anything in the genome which are not genes mask <- setdiff(genome, txdb.genes, ignore.strand = T) table(countOverlaps(txdb.genes, mask)) # check no genes are in the mask # Create random regions library(regioneR) regions_in_genes <- createRandomRegions(nregions=1000, length.mean=1000, length.sd=0, genome = genome, mask = mask, non.overlapping = T) table(countOverlaps(regions_in_genes, txdb.genes)) # Check that all your regions overlap with genes table(countOverlaps(regions_in_genes, txdb.genes, type = "within")) # Check that all your regions are WITHIN genes
Read more here: Source link