Difference in Bismark output methylation call files and coverage files
Hi biostars!
I am working with RRBS data and have used Bismark for methylation calls.
I have compared the two output files with cytosines in a CpG-context and the coverage files (which only consider cytosines in a CpG context), but I can´t get my head around whether they should have the similar number of cytosines. For my data, there are 10 times more cytosines in the CpG-context files (giving me info if the cytosine is methylated or not) than in the coverage files. I guess the coverage files do not report cytosines that are not covered by any reads, but I still think it is a little weird that the difference between the number of cytosine is that great. This is probably due to some lack of understanding on my part, so I really do appreciate any attempt at explaining what is going on here for me
Thanks in advance
Best,
Line
• 810 views
It seems you are comparing the “genome-wide cytosine report” files and the “coverage” files outputted by Bismark (you can probably recognize them because the cytosine report files usually contain a “CpG_report” string and the coverage files usually contain “cov” in their names).
By “number” of cytosines I suppose you’re counting the lines in each file. It is as you say, the cytosine report contains the coordinates for all of the genomic cytosines (moreover, it provides strand-specific information so a typical human RRBS should have around 58 million lines). This is not the case for the coverage file so your differences are not unexpected. For example, I’m currently handling RRBS data for which I can have ~5-7 M lines in the coverage file and 58 M in the cytosine report for a certain sample. In fact, all of the cytosine report files have the same number of lines!
Keep in mind that RRBS in a targeted sequencing which looks at a fraction of the total genomic CpGs, so do not expect to have coverage at the majority of genomic CpGs. I usually see RRBS data with 2-4 million CpGs detected with >10 coverage when sequencing 30-50M reads per sample.
If you do a quick check removing the lines with 0 counts for cytosines and thymines (say using awk '($4 != 0 || $5 != 0) {print $0}'
) you will see that you lose a lot of lines in the cytosine report.
Traffic: 2229 users visited in the last hour
Read more here: Source link