I am using Homer to identify peaks in RNA-seq data and then determine differential expression by counting reads per peak. Homer has a lovely package that does just this: getDifferentialPeaksReplicates.pl. The issue is that for some reason Homer returns the same peak multiple times in its final output (Bonus question: how does Homer produce different statistics for the same peak?). Here is the code I am using:
getDifferentialPeaksReplicates.pl
-genome mm10
-style factor
-size 25
-minDist 1
-fdr 0.05
-P 0.1
-all
-t
${dir}CS1/CS1_R1_TagDirectory
${dir}CS2/CS2_R1_TagDirectory
${dir}CS5/CS5_R1_TagDirectory
${dir}CS6/CS6_R1_TagDirectory
-b
${dir}CS3/CS3_R1_TagDirectory
${dir}CS4/CS4_R1_TagDirectory
${dir}CS7/CS7_R1_TagDirectory
${dir}CS8/CS8_R1_TagDirectory
-i
${dir}CS9/CS9_R1_TagDirectory
${dir}CS10/CS10_R1_TagDirectory
> ${dir}/CS_SampleTagDirectories/difPeaks.txt
Here is an example of the output:
#cmd=getDifferentialPeaksReplicates.pl -genome mm10 -style factor -size 25 -minDist 1-FDR 0.05 -P 0.1 -all -t /media/sf_UbuntuSharing/2104UNHX-0846/CS3/CS3_R1_TagDirectory /media/sf_UbuntuSharing/2104UNHX-0846/CS4/CS4_R1_TagDirectory /media/sf_UbuntuSharing/2104UNHX-0846/CS7/CS7_R1_TagDirectory /media/sf_UbuntuSharing/2104UNHX-0846/CS8/CS8_R1_TagDirectory -b /media/sf_UbuntuSharing/2104UNHX-0846/CS1/CS1_R1_TagDirectory /media/sf_UbuntuSharing/2104UNHX-0846/CS2/CS2_R1_TagDirectory /media/sf_UbuntuSharing/2104UNHX-0846/CS5/CS5_R1_TagDirectory /media/sf_UbuntuSharing/2104UNHX-0846/CS6/CS6_R1_TagDirectory -i /media/sf_UbuntuSharing/2104UNHX-0846/CS9/CS9_R1_TagDirectory /media/sf_UbuntuSharing/2104UNHX-0846/CS10/CS10_R1_TagDirectory|PeakID (cmd=annotatePeaks.pl 0.943238134802417.peaks mm10 -d /media/sf_UbuntuSharing/2104UNHX-0846/CS1/CS1_R1_TagDirectory /media/sf_UbuntuSharing/2104UNHX-0846/CS2/CS2_R1_TagDirectory /media/sf_UbuntuSharing/2104UNHX-0846/CS5/CS5_R1_TagDirectory /media/sf_UbuntuSharing/2104UNHX-0846/CS6/CS6_R1_TagDirectory /media/sf_UbuntuSharing/2104UNHX-0846/CS3/CS3_R1_TagDirectory /media/sf_UbuntuSharing/2104UNHX-0846/CS4/CS4_R1_TagDirectory /media/sf_UbuntuSharing/2104UNHX-0846/CS7/CS7_R1_TagDirectory /media/sf_UbuntuSharing/2104UNHX-0846/CS8/CS8_R1_TagDirectory -raw) (cmd=getDiffExpression.pl 0.943238134802417.raw.txt bg bg bg bg target target target target -norm2total -DESeq2 -fdr 0.05 -log2fold 1 -export 0.943238134802417) Chr Start End Strand Peak Score Focus Ratio/Region Size Annotation Detailed Annotation Distance to TSS Nearest PromoterID Entrez ID Nearest Unigene Nearest Refseq Nearest Ensembl Gene Name
chr9-1263 chr9 108945410 108945434 + 126.8 0.988 exon (NM_025407, exon 6 of 13) exon (NM_025407, exon 6 of 13) -8164 NM_007738 12836 Mm.6200 NM_007738 ENSMUSG00000025650 Col7a1
chr9-5451 chr9 108945410 108945434 + 102.8 0.988 exon (NM_025407, exon 6 of 13) exon (NM_025407, exon 6 of 13) -8164 NM_007738 12836 Mm.6200 NM_007738 ENSMUSG00000025650 Col7a1
chr9-1527 chr9 108945410 108945434 + 126.8 0.988 exon (NM_025407, exon 6 of 13) exon (NM_025407, exon 6 of 13) -8164 NM_007738 12836 Mm.6200 NM_007738 ENSMUSG00000025650 Col7a1
chr9-2841 chr9 108945410 108945434 + 123.5 0.988 exon (NM_025407, exon 6 of 13) exon (NM_025407, exon 6 of 13) -8164 NM_007738 12836 Mm.6200 NM_007738 ENSMUSG00000025650 Col7a1
chr9-2268 chr9 108945410 108945434 + 125.7 0.988 exon (NM_025407, exon 6 of 13) exon (NM_025407, exon 6 of 13) -8164 NM_007738 12836 Mm.6200 NM_007738 ENSMUSG00000025650 Col7a1
chr9-1475 chr9 108945410 108945434 + 122.5 0.988 exon (NM_025407, exon 6 of 13) exon (NM_025407, exon 6 of 13) -8164 NM_007738 12836 Mm.6200 NM_007738 ENSMUSG00000025650 Col7a1
chr9-1476 chr9 108945410 108945434 + 121.4 0.988 exon (NM_025407, exon 6 of 13) exon (NM_025407, exon 6 of 13) -8164 NM_007738 12836 Mm.6200 NM_007738 ENSMUSG00000025650 Col7a1
chr9-3571 chr9 108945410 108945434 + 118.1 0.988 exon (NM_025407, exon 6 of 13) exon (NM_025407, exon 6 of 13) -8164 NM_007738 12836 Mm.6200 NM_007738 ENSMUSG00000025650 Col7a1
chr9-7461 chr9 108945410 108945434 + 83.1 0.988 exon (NM_025407, exon 6 of 13) exon (NM_025407, exon 6 of 13) -8164 NM_007738 12836 Mm.6200 NM_007738 ENSMUSG00000025650 Col7a1
chr9-2842 chr9 108945410 108945434 + 120.3 0.988 exon (NM_025407, exon 6 of 13) exon (NM_025407, exon 6 of 13) -8164 NM_007738 12836 Mm.6200 NM_007738 ENSMUSG00000025650 Col7a1
chr9-1370 chr9 108945410 108945434 + 123.5 0.988 exon (NM_025407, exon 6 of 13) exon (NM_025407, exon 6 of 13) -8164 NM_007738 12836 Mm.6200 NM_007738 ENSMUSG00000025650 Col7a1
chr9-4379 chr9 108945410 108945434 + 112.6 0.988 exon (NM_025407, exon 6 of 13) exon (NM_025407, exon 6 of 13) -8164 NM_007738 12836 Mm.6200 NM_007738 ENSMUSG00000025650 Col7a1
chr9-1325 chr9 108945410 108945434 + 126.8 0.988 exon (NM_025407, exon 6 of 13) exon (NM_025407, exon 6 of 13) -8164 NM_007738 12836 Mm.6200 NM_007738 ENSMUSG00000025650 Col7a1
chr9-1369 chr9 108945410 108945434 + 120.3 0.988 exon (NM_025407, exon 6 of 13) exon (NM_025407, exon 6 of 13) -8164 NM_007738 12836 Mm.6200 NM_007738 ENSMUSG00000025650 Col7a1
chr9-1791 chr9 108945410 108945434 + 125.7 0.988 exon (NM_025407, exon 6 of 13) exon (NM_025407, exon 6 of 13) -8164 NM_007738 12836 Mm.6200 NM_007738 ENSMUSG00000025650 Col7a1
chr9-6481 chr9 108945410 108945434 + 101.7 0.988 exon (NM_025407, exon 6 of 13) exon (NM_025407, exon 6 of 13) -8164 NM_007738 12836 Mm.6200 NM_007738 ENSMUSG00000025650 Col7a1
chr5-1382 chr5 125387930 125387954 + 100.6 0.706 exon (NM_019639, exon 2 of 2) exon (NM_019639, exon 2 of 2) 2075 NM_019639 22190 Mm.331 NM_019639 ENSMUSG00000008348 Ubc
chr5-1298 chr5 125387930 125387954 + 107.1 0.734 exon (NM_019639, exon 2 of 2) exon (NM_019639, exon 2 of 2) 2075 NM_019639 22190 Mm.331 NM_019639 ENSMUSG00000008348 Ubc
chr5-1225 chr5 125387930 125387954 + 103.9 0.715 exon (NM_019639, exon 2 of 2) exon (NM_019639, exon 2 of 2) 2075 NM_019639 22190 Mm.331 NM_019639 ENSMUSG00000008348 Ubc
chr5-1226 chr5 125387930 125387954 + 107.1 0.718 exon (NM_019639, exon 2 of 2) exon (NM_019639, exon 2 of 2) 2075 NM_019639 22190 Mm.331 NM_019639 ENSMUSG00000008348 Ubc
chr5-39105 chr5 121286308 121286332 + 103.9 0.988 exon (NM_181421, exon 12 of 76) exon (NM_181421, exon 12 of 76) 66101 NM_181421 269700 Mm.184589 NM_181421 ENSMUSG00000042744 Hectd4
chr3-1127 chr3 88747924 88747948 + 91.8 0.79 exon (NM_018804, exon 4 of 4) exon (NM_018804, exon 4 of 4) 24663 NM_018804 229521 Mm.379376 NM_018804 ENSMUSG00000068923 Syt11
chr3-1100 chr3 88747924 88747948 + 94 0.792 exon (NM_018804, exon 4 of 4) exon (NM_018804, exon 4 of 4) 24663 NM_018804 229521 Mm.379376 NM_018804 ENSMUSG00000068923 Syt11
As you can see, some of the peaks are listed multiple times with slightly different statistics. This is problematic because DESeq2 will be performing too many tests which negatively impacts the adjusted p-values.
How do you prevent Homer from listing the same peak twice?
Read more here: Source link