Split Fasta file and rename output files with contig names

Split Fasta file and rename output files with contig names

2

Hello!

I am trying to split a large fasta file (19,336 lines) into individual contigs. The file set up is as follows:

 >k141_284136 flag=1 multi=3.0000 len=1875
 AGCCTACATTGGCAAGGTACTGCTTTTGTCGCCCATCGTTGGCGAATTTGCTAATGAGAACACACGGAT
 >k141_407195 flag=1 multi=5.0000 len=1723
 GCCAGTAGTTTTCAGATTTTCAATTACTTTCTTTGCTTCTTTTAACGCAGCCGCAAAGTTGTCATCAAGTTCTCCACCCTGTGCAATATGTTTATATAGAATGCTGCTTACTTTGTCAGCAA
 >k141_169332 flag=1 multi=3.0000 len=20
ATTATCCATCCTATTCATCGCTTGATGAAATGTTGCAAAATTCCAAAGATTTTCAGCGTCAAATCGTTCGTATATCCTAATTAAACACCGCTAAAAGTTATGTCTAAGCAATCTTTAA

I am able to split the file but the output files names are meaning less (xaa, xab, xac etc.).

I am trying to split the fasta file so each contig is in an individual fasta file names with the contig name following the >.

For example file one would be titled “k141_284136.fa” and include:

>k141_284136 flag=1 multi=3.0000 len=1875
 AGCCTACATTGGCAAGGTACTGCTTTTGTCGCCCATCGTTGGCGAATTTGCTAATGAGAACACACGGAT

My input file is called vDNA-S1S1.fa! Thank you for any help.


contig


fasta


split

• 702 views

updated 2 hours ago by

0

written 12 months ago by

0

You can use this shell approach:

numseqs=$(grep -c ">" "$1");
numlines=$(wc -l < "$1");
if (( "$numlines" > $(( 2*$numseqs )) )); then
    echo "The fasta file needs to be linearised before this function will work.";
    return 1;
fi;

while read line; do
    if [ "${line:0:1}" == ">" ]; then
        header="$line";
        filename=$(echo "${line#>}" | sed 's/ .*//g');
        touch "$filename".fasta
        echo "$header" >> "${filename}".fasta;
    else
        seq="$line";
        echo "$seq" >> "${filename}".fasta;
    fi;
done < $1

Note the first block is optional, as it just checks that the sequences are linear which yours appear to be.

Put it in a script and just run: bash scriptname.sh contigs.fasta. The new files will be spit out in the same place.

Perl one liner:

$ perl -lane '$id = $1 if (/>(w+)s/); `echo "$_" >> "$id.fa"`' < seq.fa
$ more k141_*
::::::::::::::
k141_169332.fa
::::::::::::::
>k141_169332 flag=1 multi=3.0000 len=20
ATTATCCATCCTATTCATCGCTTGATGAAATGTTGCAAAATTCCAAAGATTTTCAGCGTCAAATCGTTCGTATATCCTAATTAAACACCGCTAAAAGTTATGTCTAAGCAATCTTTAA
::::::::::::::
k141_284136.fa
::::::::::::::
>k141_284136 flag=1 multi=3.0000 len=1875
AGCCTACATTGGCAAGGTACTGCTTTTGTCGCCCATCGTTGGCGAATTTGCTAATGAGAACACACGGAT
::::::::::::::
k141_407195.fa
::::::::::::::
>k141_407195 flag=1 multi=5.0000 len=1723
GCCAGTAGTTTTCAGATTTTCAATTACTTTCTTTGCTTCTTTTAACGCAGCCGCAAAGTTGTCATCAAGTTCTCCACCCTGTGCAATATGTTTATATAGAATGCTGCTTACTTTGTCAGCAA

12 months ago by


JC

12k


Login
before adding your answer.

Traffic: 2303 users visited in the last hour

Read more here: Source link