How to automat the calculation of alignment scores (BLOSUMxx matrix based) between one short peptide sequence and 10000 other peptide sequences (using biopython)

Hi everyone and thanks in advance for any of your help.

I am running into an issue recently when trying to calculate the scores of a high number of short peptide alignments (10000).

I have to calculate all the alignement scores (calculation based on the BLOSUM62 matrix, but I could also want to use other BLOSUM matrix) of a set of 10000 alignments.

Those alignments are between one “reference” sequence (that does not vary) and 10000 other (almost random) sequences. Those sequences are all very short (9 to 20aa — my example bellow is with 9aa).

First, I already generated the list of 10000 (almost random) peptide sequences in an .xls (or .cvs) table. No issue for that.

Then, I found this biopython script to calculate the score of one alignement (here i put one example with 2 of my peptides): (the script come from this forum)

from Bio.SubsMat.MatrixInfo import blosum62 as blosum

blosum.update(((b,a),val) for (a,b),val in list(blosum.items()))

def score_pairwise(seq1, seq2, matrix, gap_s, gap_e, gap = True):
for A,B in zip(seq1, seq2):
    diag = ('-'==A) or ('-'==B)
    yield (gap_e if gap else gap_s) if diag else matrix[(A,B)]
    gap = diag

seq1 = 'ALNSSGTLT'
seq2 = 'AVNDSG-LT'

print(sum(score_pairwise(seq1, seq2, blosum, -3, -1)))

—> this give me the correct score of 27. All perfect.

The issue I am facing now is that although this script is perfect for me to calculate the score of the 2 sequences aligned (seq1 and seq2), I am struggling to find a neat way to apply this Python script on 10000 alignements.

As I said, seq1 does not vary so nothing to change here. Now, i wanted python to retrieve the 10000 other seq2 sequences from my xls/cvs sheet into a list so that the variable seq2 could take the various string values from that list. For that I did:

import pandas as pd
all_seq2_list=pd.read_csv('10000seq2.csv')
all_seq2_list.values.T[0].tolist()

Which return me the list [‘randompeptide1’, ‘randompeptide2’,…, ‘randompeptide10000’]

So far so good.

I have a vague idea that I should find a way to make a loop where the seq2 variable changes each time, assigning it the new string value (peptide sequence) corresponding to the next sequence in my all_seq2_list but I don’t have enough technical knowledge in python to understand how to do that.

Bottom line is:
I am struggling with the construction of the loop and how to (for each iteration) give the next value from my all_seq2_list to my seq2 variable; and then calculate the 10000 resulting scores
I think it is probably not very difficult for someone who use python every day but as a newbie I am stuck.

Can anyone help me ?
Thank you!

Read more here: Source link