My lab has some RNAseq data from cyanobacteria and they have asked me to look for motifs within the promoters of de novo gene clusters to identify potential regulatory sequences that we could then use DNA-protein affinity chromatography to identify potential regulatory proteins.
I was able to easily create clusters, extract promoter regions, and run gimme motifs, but now I am at an impasse. There are hundreds of identified motifs for each of my cluster depths. Making the problem more difficult is that I have struggled find sufficient documentation online for gimme motifs’ output statistics.
From my basic research it seems that this approach of wholesale computational de novo motif scanning is generally frowned upon but this approach was suggested by a collaborator who found a motif in a manually curated cluster.
My questions to you all:
Does anyone know what the stat values mean or how I should threshold them for accurate motifs? (statistical values below)
Is this methodology misguided / is there a better way to do this?
Values:
Motif, best_match, best_match_pvalue, enr_at_fpr, fraction_fpr, ks_pvalue, ks_significance, max_enrichment, max_fmeasure, mncp, num_cluster, phyper_at_fpr, pr_auc, recall_at_fdr, roc_auc, roc_auc_xlim, score_at_fpr, stars
Read more here: Source link