Linearize fasta files

Program versions used:

BBMap – v. 38.32
Seqtk – v. 1.3-r106
Seqkit – v. 0.8.1
Perl – v. 5.16.3
Python – v. 3.6.6
sed – v. 2.2.2

$ time (cat Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)

real    0m1.050s
user    0m0.002s
sys     0m1.045s

With BBMap – reformat.sh

$ time reformat.sh -Xmx40g in=Homo_sapiens.GRCh38.dna.primary_assembly.fa fastawrap=0)
java -ea -Xmx40g -cp bbmap/current/ jgi.ReformatReads -Xmx40g in=Homo_sapiens.GRCh38.dna.primary_assembly.fa fastawrap=0
Executing jgi.ReformatReads [-Xmx40g, in=Homo_sapiens.GRCh38.dna.primary_assembly.fa, fastawrap=0]

No output stream specified.  To write to stdout, please specify 'out=stdout.fq' or similar.
Input is being processed as unpaired
Input:                          194 reads               3099750718 bases
Output:                         194 reads (100.00%)     3099750718 bases (100.00%)

Time:                           14.052 seconds.
Reads Processed:         194    0.01k reads/sec
Bases Processed:       3099m    220.59m bases/sec

real    0m14.216s
user    0m44.320s
sys     0m4.096s

To provide a reference other program options discussed above (on same hardware):

From @shenwei

$ time (seqkit seq -w 0 Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)

real    0m4.022s
user    0m2.387s
sys     0m1.868s

$ time (seqkit seq Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)

real    0m5.135s
user    0m3.530s
sys     0m1.843s

From @Heng Li

$ time (seqtk seq Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)

real    0m4.682s
user    0m3.016s
sys     0m1.417s

From @Pierre

$ time (awk '/^>/ {printf("%s%st",(N>0?"n":""),$0);N++;next;} {printf("%s",$0);} END {printf("n");}' < Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)

real    0m28.235s
user    0m26.903s
sys     0m1.331s

$ time (LC_ALL=C awk '/^>/ {printf("%s%st",(N>0?"n":""),$0);N++;next;} {printf("%s",$0);} END {printf("n");}' < Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)

real    0m27.308s
user    0m26.014s
sys     0m1.288s

$ time (LC_ALL=C awk '/^>/ {printf("%s%st",(N>0?"n":""),$0);N++;next;} {printf("%s",$0);} END {printf("n");}' < Homo_sapiens.GRCh38.dna.primary_assembly.fa | tr "t" "n"  > /dev/null)


Note: This is done to convert linearized fasta back into proper format.


real    0m33.568s
user    0m33.478s
sys     0m4.654s

From @finswimmer

$  time (awk -v RS=">" -v FS="n" -v ORS="n" -v OFS="" '$0 {$1=">"$1"n"; print}'  Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)

real    1m3.129s
user    0m59.192s
sys     0m3.928s

$  time (sed -e 's/(^>.*$)/#1#/' Homo_sapiens.GRCh38.dna.primary_assembly.fa | tr -d "r" | tr -d "n" | sed -e 's/$/#/' | tr "#" "n" | sed -e '/^$/d' > /dev/null)

real    0m27.813s
user    0m28.834s
sys     0m18.141s

From @Jorge Amigo

$ time (perl -pe 'chomp unless /^>/' < Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)

real    0m17.720s
user    0m15.785s
sys     0m1.925s

From @jrj.healey (needs python v.3.6+)

$ time (python3 -c 'import sys;from Bio import SeqIO; [print(f">{r.id}n{r.seq}") for r in SeqIO.parse(sys.argv[1], "fasta")];' Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)

real    0m40.869s
user    0m30.741s
sys     0m9.864s

Read more here: Source link