How to merge exons based on gene ids?

Since this is still unanswered, I’ll post an R solution until someone is kind enough to answer with the requested biopython solution.

Read the fasta file in.

library("Biostrings")
library("tidyverse")

fasta <- readDNAStringSet("test.fasta")

You’ll end up with a Biostrings object of the fasta file.

> fasta
DNAStringSet object of length 7:
    width seq                                               names               
[1]   175 AGACCATTTAAATTCTAACTACC...GGCGTCGCGGCGCTGGCTGTGGG "gene_id:XLOC_000...
[2]    62 CGCAACCCATGGGAGGGGGAATG...GTGGCATGGGGATGGCTCCTGGT "gene_id:XLOC_000...
[3]   144 GGGTTGTTGCTCCTCGCGTTAGG...AAGAAGTGAATAATGTATCATCC "gene_id:XLOC_000...
[4]    97 GCTGATCAGAAGTTTGCAGATGA...TCGGTAGGTATCTGGTACACGGC "gene_id:XLOC_001...
[5]   108 CTTCGGAAATACCGTGTGCAATA...CCGTCTCCTTCTTTGAGTCATAA "gene_id:XLOC_001...
[6]   183 TTACAAAGTTATGCCATAGTACA...GGCAATGAACCGATGGTGCGCGT "gene_id:XLOC_001...
[7]   147 ACTACACACAAAGTTCGTAGGAT...TTTGGCGGCGAATTGGAGTTCCC "gene_id:XLOC_001...

You can then concatenate the sequences with the same gene id.

new_fasta <- fasta %>%
  {data.frame(seq=as.character(.), name=names(.))} %>%
  mutate(name=str_extract(name, "(?<=")gene_id:[[[:alnum:]]_]+")) %>%
  group_by(name) %>%
  summarize(seq=str_c(seq, collapse="")) %>%
  {set_names(.$seq, .$name)} %>%
  DNAStringSet

And now you should have a Biostrings object of the new sequences, where the sequences for each gene_id were concatenated based on the order they appeared in the fasta file.

> new_fasta
DNAStringSet object of length 2:
    width seq                                               names               
[1]   381 AGACCATTTAAATTCTAACTACC...AAGAAGTGAATAATGTATCATCC gene_id:XLOC_000656
[2]   535 GCTGATCAGAAGTTTGCAGATGA...TTTGGCGGCGAATTGGAGTTCCC gene_id:XLOC_001528

You can then save this back to a fasta.

writeXStringSet(new_fasta, "new.fasta")

Read more here: Source link