Hi everyone and thanks in advance for any of your help.
I am running into an issue recently when trying to calculate the scores of a high number of short peptide alignments (10000).
I have to calculate all the alignement scores (calculation based on the BLOSUM62 matrix, but I could also want to use other BLOSUM matrix) of a set of 10000 alignments.
Those alignments are between one “reference” sequence (that does not vary) and 10000 other (almost random) sequences. Those sequences are all very short (9 to 20aa — my example bellow is with 9aa).
First, I already generated the list of 10000 (almost random) peptide sequences in an .xls (or .cvs) table. No issue for that.
Then, I found this biopython script to calculate the score of one alignement (here i put one example with 2 of my peptides): (the script come from this forum)
from Bio.SubsMat.MatrixInfo import blosum62 as blosum
blosum.update(((b,a),val) for (a,b),val in list(blosum.items()))
def score_pairwise(seq1, seq2, matrix, gap_s, gap_e, gap = True):
for A,B in zip(seq1, seq2):
diag = ('-'==A) or ('-'==B)
yield (gap_e if gap else gap_s) if diag else matrix[(A,B)]
gap = diag
seq1 = 'ALNSSGTLT'
seq2 = 'AVNDSG-LT'
print(sum(score_pairwise(seq1, seq2, blosum, -3, -1)))
—> this give me the correct score of 27. All perfect.
The issue I am facing now is that although this script is perfect for me to calculate the score of the 2 sequences aligned (seq1 and seq2), I am struggling to find a neat way to apply this Python script on 10000 alignements.
As I said, seq1 does not vary so nothing to change here. Now, i wanted python to retrieve the 10000 other seq2 sequences from my xls/cvs sheet into a list so that the variable seq2 could take the various string values from that list. For that I did:
import pandas as pd
all_seq2_list=pd.read_csv('10000seq2.csv')
all_seq2_list.values.T[0].tolist()
Which return me the list [‘randompeptide1’, ‘randompeptide2’,…, ‘randompeptide10000’]
So far so good.
I have a vague idea that I should find a way to make a loop where the seq2 variable changes each time, assigning it the new string value (peptide sequence) corresponding to the next sequence in my all_seq2_list but I don’t have enough technical knowledge in python to understand how to do that.
Bottom line is:
I am struggling with the construction of the loop and how to (for each iteration) give the next value from my all_seq2_list to my seq2 variable; and then calculate the 10000 resulting scores
I think it is probably not very difficult for someone who use python every day but as a newbie I am stuck.
Can anyone help me ?
Thank you!
Read more here: Source link