Dealing with P value inflation of SNPs identified by qqplot

Dealing with P value inflation of SNPs identified by qqplot


I’m attempting to call SNPs that changed in allele frequency in a population. The population was sequenced at two times (before and after a drought) using pooled-sequencing of the whole genome. Fisher exact tests identified 7 million SNPs, which was comparable to a past study in this species which found 6 million SNPs across the whole genome. When I generate a Manhattan plot using the P values from fishers exact test and a Bonferroni corrected P value for 7M SNPs, about 4k SNPs are significant. This amount seems reasonable considering the population adapted to strong drought in between samplings and that I had several-hundred times coverage of the genome, which was greater than the previous study that found 6M SNPs. However, when I make a qqplot of these P values versus expected P values, most of the observed P values are well above the expected 1:1 line (image below). I calculated the genomic inflation factor lambda to be 2.06 by dividing the median observed chi-squared statistic by the expected median. One of the ways I’ve came across to correct for this inflation is to divide all observed test statistics by lambda and use the resulting “genomic corrected” P values. However, using this method seems overly conservative and results in only 23 SNPs reaching statistical significance.

qqplot image

I’m interested in any thoughts or suggestions for dealing with the issue of P value inflation. On the one hand, I could probably just filter the 4k P values for those in exons and have a reasonable number to proceed. On the other hand, the qqplot indicates something else is going on that I should probably control for, like population structure. I’m aware of a few methods that can control for population structure in GWAS, but my data is pooled by population and so my understanding is those methods wouldn’t apply with out individual-level data. Is there way I can easily correct for P value inflation? Or, do I need to analyze the data with a different approach/toolkit that can take population structure into account?









Read more here: Source link