Given set of genomic sequences find potentially enriched genes?

I’m not sure there is any one tool that will do all of this for you. Perhaps some of the following might help.

After downloading genes for mm10, construct a list of windows upstream or centered on TSSs, and overlap them or associate them with your coordinates:

$ wget -qO- 
    | gunzip --stdout - 
    | awk '$3 == "gene"' - 
    | convert2bed -i gff - 
    > gencode.vM25.genes.bed


$ bedmap --echo --echo-map-id --skip-unmapped windowsAroundMySequences.bed gencode.vM25.genes.bed > answer.bed

How you define windowsAroundMySequences.bed is up to you. You could do something like the following, say, to make strand-specific 5kb proximal promoter windows:

$ awk -v FS="t" -v OFS="t" '($6=="+"){ print $1, $2-5000, $2, $3, $4, $5, $6 }($6=="-"){ print $1, $3, $3+5000, $4, $5, $6 }' mySequences.bed > windowsAroundMySequences.bed

Depending on your mouse experiment, the Gorkin et al. fetal dataset housed on the epilogos site might be of interest for demarcating enhancers. There is a tabix-based data download available from the top-right corner of the page for doing queries for columns that have locally- or globally-high surprisal values for enhancer chromatin states (columns 5-9 in the score data portion of the query result).

Or perhaps get the database for mm9 at and use liftOver to get mm10 enhancer regions. Once you have those, you can use bedops or bedmap to do overlap or association queries between enhancers and your windows-of-interest.

As to enrichment, you could use all genes as background and count overlap events over a subset of genes of interest and over background, using a hypergeometric to calculate the probability of observing such overlaps by chance. You’d need to decide what genes are interesting, however.

Or perhaps you’d synthesize a population of sequences with a similar distribution to what you are starting with, and you would count how many times such random sequences overlap your TSS windows, to measure the probability that your specific sequences overlap their TSSs by chance.

Read more here: Source link