How to get the sequence differences between multiple bacterial genomes

How to get the sequence differences between multiple bacterial genomes

1

I am working on some closely related bacterial species (complete genomes from NCBI). I would like to extract the sequence differences between them. To be more specific, I want to find unique sequences (50 -100 nts) in each of the bacterial species (n=6) under my study.
I found many other related posts suggesting tools like Mauve, Mummer, Artemis ACT etc.. and I tried all of them. Mauve gave pretty good results, but when I extracted the sequences in the range as suggested to be different from other genomes under comparison, and performed blast, I got multiple hits to other bacterial species.

Thanks in advance.


genome


alignment

• 518 views

Here’s one solution with unikmer, for limited number of input files.
for loop can be replaced with parallel for saving time.

  1. Generating k-mers.

    for f in *.fa.gz; do
        unikmer count -k 31 -K -s -o $f.k31 $f;
    done
    
  2. Computing k-mers shared by two or more strains. (updated here)

    unikmer common -n 2 *.k31.unik -o shared
    
    # method 2, faster
    unikmer sort  *.k31.unik --repeated -o shared
    
  3. Removing k-mers shared by two or more strainss, left are unique k-mers.

    for f in *.k31.unik; do 
        unikmer diff $f shared.unik -s -o $f.uniq;
    done
    
  4. Retrieving unique sequences from genome with unique k-mers.
    Output format can be in BED (default) or FASTA (-a),
    minimum lengths (-m) are configurable.

    for f in *.fa.gz; do
        unikmer uniqs -g $f $f.k31.unik.uniq.unik -M -m 50 -a -o $f.uniq.fa;
    done
    
  5. Stats of result.

    unikmer stats *.uniq.unik -a 
    file                                            k  canonical  hashed  scaled  include-taxid  global-taxid  sorted  compact  gzipped  version  number
    1773-GCA_000706665.1.fa.gz.k31.unik.uniq.unik  31          ✓       ✕       ✕              ✕                     ✓        ✕        ✓     v4.1  16,217
    1773-GCA_000738445.1.fa.gz.k31.unik.uniq.unik  31          ✓       ✕       ✕              ✕                     ✓        ✕        ✓     v4.1  11,939
    1773-GCA_000738475.1.fa.gz.k31.unik.uniq.unik  31          ✓       ✕       ✕              ✕                     ✓        ✕        ✓     v4.1   7,074
    1773-GCA_000756525.1.fa.gz.k31.unik.uniq.unik  31          ✓       ✕       ✕              ✕                     ✓        ✕        ✓     v4.1   9,679
    1773-GCA_000756545.1.fa.gz.k31.unik.uniq.unik  31          ✓       ✕       ✕              ✕                     ✓        ✕        ✓     v4.1  72,867
    1773-GCA_000934585.1.fa.gz.k31.unik.uniq.unik  31          ✓       ✕       ✕              ✕                     ✓        ✕        ✓     v4.1  20,135
    
    seqkit stats *.uniq.fa
    file                                format  type  num_seqs  sum_len  min_len  avg_len  max_len
    1773-GCA_000706665.1.fa.gz.uniq.fa  FASTA   DNA        358   28,030       50     78.3    1,005
    1773-GCA_000738445.1.fa.gz.uniq.fa  FASTA   DNA        379   23,088       50     60.9       89
    1773-GCA_000738475.1.fa.gz.uniq.fa  FASTA   DNA        220   13,486       50     61.3       99
    1773-GCA_000756525.1.fa.gz.uniq.fa  FASTA   DNA        239   16,306       50     68.2      354
    1773-GCA_000756545.1.fa.gz.uniq.fa  FASTA   DNA      1,888  129,254       50     68.5    6,865
    1773-GCA_000934585.1.fa.gz.uniq.fa  FASTA   DNA        479  211,054       50    440.6    9,400
    


Login
before adding your answer.

Traffic: 1198 users visited in the last hour

Read more here: Source link