Split Fasta file and rename output files with contig names
Hello!
I am trying to split a large fasta file (19,336 lines) into individual contigs. The file set up is as follows:
>k141_284136 flag=1 multi=3.0000 len=1875
AGCCTACATTGGCAAGGTACTGCTTTTGTCGCCCATCGTTGGCGAATTTGCTAATGAGAACACACGGAT
>k141_407195 flag=1 multi=5.0000 len=1723
GCCAGTAGTTTTCAGATTTTCAATTACTTTCTTTGCTTCTTTTAACGCAGCCGCAAAGTTGTCATCAAGTTCTCCACCCTGTGCAATATGTTTATATAGAATGCTGCTTACTTTGTCAGCAA
>k141_169332 flag=1 multi=3.0000 len=20
ATTATCCATCCTATTCATCGCTTGATGAAATGTTGCAAAATTCCAAAGATTTTCAGCGTCAAATCGTTCGTATATCCTAATTAAACACCGCTAAAAGTTATGTCTAAGCAATCTTTAA
I am able to split the file but the output files names are meaning less (xaa, xab, xac etc.).
I am trying to split the fasta file so each contig is in an individual fasta file names with the contig name following the >.
For example file one would be titled “k141_284136.fa” and include:
>k141_284136 flag=1 multi=3.0000 len=1875
AGCCTACATTGGCAAGGTACTGCTTTTGTCGCCCATCGTTGGCGAATTTGCTAATGAGAACACACGGAT
My input file is called vDNA-S1S1.fa! Thank you for any help.
• 702 views
You can use this shell approach:
numseqs=$(grep -c ">" "$1");
numlines=$(wc -l < "$1");
if (( "$numlines" > $(( 2*$numseqs )) )); then
echo "The fasta file needs to be linearised before this function will work.";
return 1;
fi;
while read line; do
if [ "${line:0:1}" == ">" ]; then
header="$line";
filename=$(echo "${line#>}" | sed 's/ .*//g');
touch "$filename".fasta
echo "$header" >> "${filename}".fasta;
else
seq="$line";
echo "$seq" >> "${filename}".fasta;
fi;
done < $1
Note the first block is optional, as it just checks that the sequences are linear which yours appear to be.
Put it in a script and just run: bash scriptname.sh contigs.fasta
. The new files will be spit out in the same place.
Perl one liner:
$ perl -lane '$id = $1 if (/>(w+)s/); `echo "$_" >> "$id.fa"`' < seq.fa
$ more k141_*
::::::::::::::
k141_169332.fa
::::::::::::::
>k141_169332 flag=1 multi=3.0000 len=20
ATTATCCATCCTATTCATCGCTTGATGAAATGTTGCAAAATTCCAAAGATTTTCAGCGTCAAATCGTTCGTATATCCTAATTAAACACCGCTAAAAGTTATGTCTAAGCAATCTTTAA
::::::::::::::
k141_284136.fa
::::::::::::::
>k141_284136 flag=1 multi=3.0000 len=1875
AGCCTACATTGGCAAGGTACTGCTTTTGTCGCCCATCGTTGGCGAATTTGCTAATGAGAACACACGGAT
::::::::::::::
k141_407195.fa
::::::::::::::
>k141_407195 flag=1 multi=5.0000 len=1723
GCCAGTAGTTTTCAGATTTTCAATTACTTTCTTTGCTTCTTTTAACGCAGCCGCAAAGTTGTCATCAAGTTCTCCACCCTGTGCAATATGTTTATATAGAATGCTGCTTACTTTGTCAGCAA
Traffic: 2303 users visited in the last hour
Read more here: Source link