PROVEAN input script

PROVEAN input script


Hi everyone,

I would like to write a python script to extract amino acid substitutions and indels from several protein alignments, and use it as an input for variant effect prediction with PROVEAN. So far I have modified an initial script (C: How to extract polymorphic positions from several alignments?) to extract only amino acid substitutions, but I haven’t been able to extract correctly the indels according to PROVEAN input format ( I have been able to extract indels position by position, but this doesn’t work for PROVEAN. Any help would be very welcome!!

Here is the script:

for sequence in sequences: 
    print('Alignment:    ' + sequence.split("")[-1].split('_')[0])
    alignment =, 'fasta')
    align_array_trim = prune_aln(alignment) #prune_aln is a function that I've created.
    align_array = np.array([list(rec) for rec in alignment], np.character)
    single = {}

    if ((align_array_trim.shape[1] / alignment.get_alignment_length()) > 0.7): 
    #Only alignments over 70% of initial length after gaps are removed
        for record in alignment:

        refseq = list(alignment)[-1]
        file1 ='|')[1] + '.fasta'
        f = open(filepath1,'w')
        f.write('>' + + 'n' + str(refseq.seq) + 'n')

        for column in range(align_array.shape[1]):
            c = str(align_array[:,column])
            counter = Counter(align_array[:,column])
            if (len(counter) == 2):
            # single polymorphism
                main_letter = counter.most_common(1)[0][0]
                letter = counter.most_common(2)[1][0]
                for pos in np.where(align_array[:,column] == letter)[0]:
                    change = main_letter.decode('UTF-8') + str(column + 1) + letter.decode('UTF-8')
                    single.setdefault(change, [])

        file2 ='|')[1] + '_variants.txt'
        f2 = open(filepath2,'w')              
        for change in single:
            if not ('-' in change): #Here I'm avoiding indels because I don't know how to include them
                number_spp = len(single.get(change))
                if (number_spp == 1) and (single.get(change)[0] == '<Species_of_interest>'): #I only want changes that are otherwise conserved and just change in my focus species.
                    f2.write(str(re.findall('d+', change)[0]) + ',' + change[0] + ',' + change[-1] + 'n')


       #Input: a protein sequence and amino acid variants
       #Provean input format per alignment: <position>,<reference_amino_acids>,<variant_amino_acids>



variant effect prediction


So,you got the problem solved in the end ?

before adding your answer.

Traffic: 2415 users visited in the last hour

Read more here: Source link