Since this is still unanswered, I’ll post an R solution until someone is kind enough to answer with the requested biopython solution.
Read the fasta file in.
library("Biostrings")
library("tidyverse")
fasta <- readDNAStringSet("test.fasta")
You’ll end up with a Biostrings object of the fasta file.
> fasta
DNAStringSet object of length 7:
width seq names
[1] 175 AGACCATTTAAATTCTAACTACC...GGCGTCGCGGCGCTGGCTGTGGG "gene_id:XLOC_000...
[2] 62 CGCAACCCATGGGAGGGGGAATG...GTGGCATGGGGATGGCTCCTGGT "gene_id:XLOC_000...
[3] 144 GGGTTGTTGCTCCTCGCGTTAGG...AAGAAGTGAATAATGTATCATCC "gene_id:XLOC_000...
[4] 97 GCTGATCAGAAGTTTGCAGATGA...TCGGTAGGTATCTGGTACACGGC "gene_id:XLOC_001...
[5] 108 CTTCGGAAATACCGTGTGCAATA...CCGTCTCCTTCTTTGAGTCATAA "gene_id:XLOC_001...
[6] 183 TTACAAAGTTATGCCATAGTACA...GGCAATGAACCGATGGTGCGCGT "gene_id:XLOC_001...
[7] 147 ACTACACACAAAGTTCGTAGGAT...TTTGGCGGCGAATTGGAGTTCCC "gene_id:XLOC_001...
You can then concatenate the sequences with the same gene id.
new_fasta <- fasta %>%
{data.frame(seq=as.character(.), name=names(.))} %>%
mutate(name=str_extract(name, "(?<=")gene_id:[[[:alnum:]]_]+")) %>%
group_by(name) %>%
summarize(seq=str_c(seq, collapse="")) %>%
{set_names(.$seq, .$name)} %>%
DNAStringSet
And now you should have a Biostrings object of the new sequences, where the sequences for each gene_id were concatenated based on the order they appeared in the fasta file.
> new_fasta
DNAStringSet object of length 2:
width seq names
[1] 381 AGACCATTTAAATTCTAACTACC...AAGAAGTGAATAATGTATCATCC gene_id:XLOC_000656
[2] 535 GCTGATCAGAAGTTTGCAGATGA...TTTGGCGGCGAATTGGAGTTCCC gene_id:XLOC_001528
You can then save this back to a fasta.
writeXStringSet(new_fasta, "new.fasta")
Read more here: Source link