Finding 16 mer not present in GRCh38

Thanks for the question – it has kept me busy this Sunday morning / afternoon. As implied by others, this poses a computational challenge but is not insurmountable.

For motif searching generally, I usually use AWK. My approach here was to:

  1. generate all possible k-mers of the chosen size (run once only)
  2. separately, tabulate k-mers of the chosen size in the hg38 sequence
  3. get the difference between #1 and #2

I should also note that I have only done this for 12-mers and for just chr22. For 16-mers, I’d likely have to use a cluster environment. Over the entire genome, I would continue to break the analysis up by chromosome.

# generate all possible 12-mer
# 'expected' must be set as maximum possible (for 12-mer, expected = 4^12)
awk -v atgc=ATGC -v expected=16777216 -v range=4 'BEGIN{
  srand(565447);
  do {
    seq = substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)
    if (!(seq in motifs)) {
      print seq
      motifs[seq] = 1
      count++
    }
  } while (count < expected)
}' > all.12mer.out

NB – each random base is produced by substr(atgc,int(1+rand()*range),1). In this case, for a 12-mer, there are 12 of these assigned to seq.



For 16-mers, you need 16:

    awk -v atgc=ATGC -v expected=4294967296 -v range=4 'BEGIN{
      srand(565447);
      do {
        seq = substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)
        if (!(seq in motifs)) {
          print seq
          motifs[seq] = 1
          count++
        }
      } while (count < expected)
    }' > all.16mer.out

all.12mer.out looks like:

head all.12mer.out 
CACCCTGATCCT
ACAATGTTCGTT
GTGACGGTCGCT
TTAGGCACCGAT
GCTCATATTTAT
GCGGGTTAGATG
GTGGTTGCGCAA
GATCTAAAGTGG

There are 16,777,216 possible 12-mers. This step takes a while (20 minutes) and could be improved in efficiency. The output file is ~220MB. The good thing is that this step only needs to be run once only.

In this step, the entire target sequence is first flattened and converted to upper case. Then, each found 12-mer is added to an indexed AWK array, with the entry’s value incremented when the same 12-mer is found again. It moves in a sliding window of 1bp:

 ATGCATGCATGCATGCATGCATGC
 ATGCATGCAT
 |_________|
   TGCATGCATG
   |_________|
     GCATGCATGC
     |_________|

samtools faidx hg38.fasta chr22 | 
awk '{if (NR>1) printf toupper($0)}' | 
awk '{
  for (i=1; i<=NF; i+=1)
    if (i+11<=NF) {
      m12[$(i)$(i+1)$(i+2)$(i+3)$(i+4)$(i+5)$(i+6)$(i+7)$(i+8)$(i+9)$(i+10)$(i+11)]+=1
    }
  } END {
    for (motif in m12)
      print motif","m12[motif]
  }' FS='' > hg38.12mer.out

hg38.12mer.out contains each identified 12-mer and its frequency

head hg38.12mer.out
GCTCTCCCGCAG,2
TCGCTATATTGT,1
GAAACACGCCCT,2
AACCCTGGGAGT,2
GGCAGTCAGGAT,7
TCCCACTCGGCA,2
GTGAATTCTTCT,5
TCCCACTCGGCC,5


For 16-mer, you would need to modify to:

samtools faidx hg38.fasta chr22 | 
awk '{if (NR>1) printf toupper($0)}' | 
awk '{
  for (i=1; i<=NF; i+=1)
    if (i+15<=NF) {
      m16[$(i)$(i+1)$(i+2)$(i+3)$(i+4)$(i+5)$(i+6)$(i+7)$(i+8)$(i+9)$(i+10)$(i+11)$(i+12)$(i+13)$(i+14)$(i+15)]+=1
    }
  } END {
    for (motif in m16)
      print motif","m16[motif]
  }' FS='' > hg38.16mer.out
# identify difference between all possible 12-mers and those found in the target sequence
awk 'NR==FNR {hgmers[$1]; next} !($0 in hgmers)' FS="," hg38.12mer.out all.12mer.out

With that, there are, apparently, 7,228,791 12-mers that are not present in chr22 in hg38:

awk 'NR==FNR {hgmers[$1]; next} !($0 in hgmers)' FS="," hg38.12mer.out all.12mer.out | head
ACAATGTTCGTT
GTGACGGTCGCT
GCGGGTTAGATG
GTGGTTGCGCAA
GCACGACATTGA
GCACGAGTCAAG
GTCCGCGGGATT
GCCGTGTAAAAG
CGCTTGTCACAC
GGTACCGTTTGA

I originally checked for 8-mers but all 8-mers are accounted for in chr22 in hg38.

If I manage to do 16-mer, I’ll post the solution…

Kevin

Read more here: Source link