Program versions used:
BBMap – v. 38.32
Seqtk – v. 1.3-r106
Seqkit – v. 0.8.1
Perl – v. 5.16.3
Python – v. 3.6.6
sed – v. 2.2.2
$ time (cat Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)
real 0m1.050s
user 0m0.002s
sys 0m1.045s
With BBMap – reformat.sh
$ time reformat.sh -Xmx40g in=Homo_sapiens.GRCh38.dna.primary_assembly.fa fastawrap=0)
java -ea -Xmx40g -cp bbmap/current/ jgi.ReformatReads -Xmx40g in=Homo_sapiens.GRCh38.dna.primary_assembly.fa fastawrap=0
Executing jgi.ReformatReads [-Xmx40g, in=Homo_sapiens.GRCh38.dna.primary_assembly.fa, fastawrap=0]
No output stream specified. To write to stdout, please specify 'out=stdout.fq' or similar.
Input is being processed as unpaired
Input: 194 reads 3099750718 bases
Output: 194 reads (100.00%) 3099750718 bases (100.00%)
Time: 14.052 seconds.
Reads Processed: 194 0.01k reads/sec
Bases Processed: 3099m 220.59m bases/sec
real 0m14.216s
user 0m44.320s
sys 0m4.096s
To provide a reference other program options discussed above (on same hardware):
From @shenwei
$ time (seqkit seq -w 0 Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)
real 0m4.022s
user 0m2.387s
sys 0m1.868s
$ time (seqkit seq Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)
real 0m5.135s
user 0m3.530s
sys 0m1.843s
From @Heng Li
$ time (seqtk seq Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)
real 0m4.682s
user 0m3.016s
sys 0m1.417s
From @Pierre
$ time (awk '/^>/ {printf("%s%st",(N>0?"n":""),$0);N++;next;} {printf("%s",$0);} END {printf("n");}' < Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)
real 0m28.235s
user 0m26.903s
sys 0m1.331s
$ time (LC_ALL=C awk '/^>/ {printf("%s%st",(N>0?"n":""),$0);N++;next;} {printf("%s",$0);} END {printf("n");}' < Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)
real 0m27.308s
user 0m26.014s
sys 0m1.288s
$ time (LC_ALL=C awk '/^>/ {printf("%s%st",(N>0?"n":""),$0);N++;next;} {printf("%s",$0);} END {printf("n");}' < Homo_sapiens.GRCh38.dna.primary_assembly.fa | tr "t" "n" > /dev/null)
Note: This is done to convert linearized fasta back into proper format.
real 0m33.568s
user 0m33.478s
sys 0m4.654s
From @finswimmer
$ time (awk -v RS=">" -v FS="n" -v ORS="n" -v OFS="" '$0 {$1=">"$1"n"; print}' Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)
real 1m3.129s
user 0m59.192s
sys 0m3.928s
$ time (sed -e 's/(^>.*$)/#1#/' Homo_sapiens.GRCh38.dna.primary_assembly.fa | tr -d "r" | tr -d "n" | sed -e 's/$/#/' | tr "#" "n" | sed -e '/^$/d' > /dev/null)
real 0m27.813s
user 0m28.834s
sys 0m18.141s
From @Jorge Amigo
$ time (perl -pe 'chomp unless /^>/' < Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)
real 0m17.720s
user 0m15.785s
sys 0m1.925s
From @jrj.healey (needs python v.3.6+)
$ time (python3 -c 'import sys;from Bio import SeqIO; [print(f">{r.id}n{r.seq}") for r in SeqIO.parse(sys.argv[1], "fasta")];' Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)
real 0m40.869s
user 0m30.741s
sys 0m9.864s
Read more here: Source link