How To Extract A Sequence From A Big (6Gb) Multifasta File ?

How To Extract A Sequence From A Big (6Gb) Multifasta File ?

11

I want to extract some sequences using ID from a multifasta file.

Using perl is not possible because it gave an error when indexing the database. Maybe because of it’s size? Is there any way to this differently? thanks for your help!


fasta

• 71k views

updated 1 hour ago by

0

written 9.2 years ago by

▴

280

How about samtools faidx? It seems to be less well-known, but I think it should work for you.

If you have these reads in test.fa:

>one
AAAAA
>two
CCCCC
>three
GGGGG

Then just run:

samtools faidx test.fa

This will create an index file that will allow you to do sublinear queries for each name lookup (faster than the awk/grep solutions above and easier than rolling your own script).

For example:

samtools faidx test.fa two

Will output:

>two
CCCCC

It works for multiple regions (samtools faidx test.fa one two three) and that lets you play tricks like:

xargs samtools faidx test.fa < names.txt

You can use blast to extract sequences from a large file, see below.
That being said I am not sure how well this works if the file is large solely because
it contains a very large number of sequences. I think the original use case of BLAST optimizes for retrieving locations from long sequence.

# look at the file
$ head EC4115.fa 
>NC_011353.1 Escherichia coli O157:H7 str. EC4115 chromosome, complete genome.
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTCTC
TGACAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGG
TCACTAAATACTTTAACCAATATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTAC

# generate the blast database
$ makeblastdb -dbtype nucl -out EC -in EC4115.fa -parse_seqids

# retreive an entry by id
$ blastdbcmd -db EC -entry 'NC_011353.1' | head
>lcl|NC_011353.1 Escherichia coli O157:H7 str. EC4115 chromosome, complete genome
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTCTCTGACAGCAGCTTCTGAACTG
GTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAATATAGGCATAGCGCACAGAC
....

# query the blast database by id and coordinates
$ blastdbcmd -db EC -range 100-105 -entry 'NC_011353.1'
>lcl|NC_011353.1:100-105 Escherichia coli O157:H7 str. EC4115 chromosome, complete genome
TTAAAA

I believe this perl script will do what you need (just whipped it up – excuse any typos)

my $inseq = 0

my $inFh = IO::File->new( myfile.fa ) || die "can't open filen";
while( my $line = $inFh->getline )
{
    chomp($line);

    if($inseq){
        if($line =~ /^>/){
            $inseq = 0;
        } else {
            print $line . "n";
            next;
        }
    }

    if($line =~ /^>/ && $line =~/$ARGV[0]/){
        $inseq = 1;
        print $line . "n";
    } 
}
close($inFh);

Run it like:

$perl myscript.pl SequenceName1234 >output

Here is a bash script to extract multiple sequences from a fasta file.

 #!/usr/bin/bash

#extract multiple sequences from a large fasta file

    while read p; do
    echo '>'$p >>contig_out.txt
    grep -A 10000 -w $p fasta_file.fa | sed -n -e '1,/>/ {/>/ !{'p''}} >>contig_out.txt
    done <contig_list.txt

Where contig_list is a list of the sequence IDs of interest (one sequence id per row) and contig_out contains the sequence IDs followed by their sequence in fasta format.
Note: increase grep -A parameter if sequences exceed 10000 lines.

3.5 years ago by


Kzra

&utrif;

30

A different approach using Heng Li’s awk (named below as hawk) also described here would be like so

# extract the sequence matchin a read id
$ hawk -c fastx ' $name ~ /NC_011353.1/ {print $seq }' < EC4115.fa

This example will read through the entire file upon each invocation.

If you are going to retrieve sequences multiple times, I recommend building SQL db. I recommend SQLite as it’s very light, yet powerful program: you don’t need to run a served, the db is in one file and it has bindings to python (and probably to other languages).
We have successfully used SQLite with sets of several millions of sequences (fasta having several GBs). The query time is much below 1 sec if you properly index your sequence table.

One of the difficulties is the lack of standards in a fasta header format. Considering you’re interested in the first element in the header as an ID. Here is my python proposal (change the IDclean line if you want an other ID in your header), I used the dictionary type to index the fasta first: it makes this process and the subsequent ID extraction very fast 😉

#!/usr/bin/python

### extracts sequences from a fasta file (arg 1)
### whose id is in the IDs file (arg 2)

import string
import sys
ListOfIds = sys.argv[1]
fastafile = sys.argv[2]

try:
    ids=open(ListOfIds, 'r')
except IOError, e:
    print "File error: ",ListOfIds
    pass


lignes = ids.readlines()
req=[]
for ligne in lignes:
    req.append(ligne.strip())


#### reading the fasta file to cut
handle = open(fastafile)

bank={}
seqIDmap={}
seq_id = handle.next()
while (seq_id[0]!=">"):
    seq_id = handle.next()
while True:
    try:
        seq = handle.next()
        line = handle.next()
        while (line[0]!=">"):
            seq = seq+line
            line = handle.next()
        bank[seq_id]=seq
        IDclean=string.split(seq_id, " ")[0][1:].strip()
        seqIDmap[IDclean]=seq_id
        seq_id = line # for the next
    except StopIteration:
        break
# last line
bank[seq_id]=seq
seqIDmap[string.split(seq_id, " ")[0][1:].strip()]=seq_id

handle.close()

######## end reading the potentially big fasta file

faName=fastafile.split("https://www.biostars.org/")[-1]
listName=ListOfIds.split("https://www.biostars.org/")[-1]
subsetName=listName+"-"+faName
subset = open(subsetName,"w")
nbNF=0
for i in req:
    try:
        subset.write(seqIDmap[i].strip()+"n")
        subset.write(bank[seqIDmap[i]].strip()+"n")
    except KeyError:
        print i, "not found in fasta"
        nbNF+=1

subset.close()

print
print nbNF, "IDs (listed above) from",listName, "have not been found in", faName
print
print "the Subset fasta file", subsetName, "is now created"

I run into this problem and devellop a small python lib to answer this, as other answer it uses index but is in pure python. Because it first sort the fasta file it is extremly fast even faster than samtools faidx but the indexing takes longer

As rule of thumb indexes become worth if you need to extract sequences more than once.
here the link if you have any question feel free to ask.

gitlab.com/lannes/large_fasta_handler/tree/master

I will use some code in perl, I’m assuming that your fasta file have the ID as the only text in the fasta comment

 #!/usr/bin/perl

 # extractSeq.pl -> extract fasta sequence from a list of IDs
 use strict;
 use warnings;
 $ARGV[2] or die "use perl extractSeq.pl LIST FASTA OUPUTn";
 my ($list, $fasta, $out) = @ARGV;
 my %list;
 open L, "$list" or die;
 while ( < L > ) {
      chomp;
      $list{$_} = 1;
  }
  close L;

  open O, ">$out" or die;
  open F, "$fasta" or die;
  $/ = "n>";
  while ( < F > ) {
       s/>//g;
       my @seq = split (/n/, $_);
       my $id = shift @seq;
       next unless (defined $list{$id});
       print O join "n", ">$id", @seq;
  }
  close O;
  close F;

*edit: remove the spaces in the file handles in the “while ()” statement, autoformating is messing with the code view

9.2 years ago by


JC

12k

Here is a Perl script to extract sequences by their IDs (assuming that you have one sequence identifier per line in the file “ids.txt”) from a multi-fasta file:

perl -ne ‘if(/^>(S+)/){$c=$i{$1}}$c?print:chomp;$i{$_}=1 if @ARGV’ ids.txt multifasta.file

Another way of doing the same:

$ seqtk subseq multifasta.file ids.txt


Login
before adding your answer.

Traffic: 2575 users visited in the last hour

Read more here: Source link