I am working on aligning proteins orthologs from different species. I am using the Ensembl API. Strangely, some protein sequences from non-human species have a lot of X. I wonder what does that mean? In theory, if their genome sequence is know, the protein sequence should be known, right? How do I score these X when I calculate the conservation scores? Thanks a lot. An example is shown below : ENSMEUP00000002410 from Notamacropus Eugenii.
if I remember correctly the X is the protein alternative for N in nucleotides, in other words an unknown aminoacid (and unknown as in “it couldn’t be determined” not as in “new, never seen before”).
this can happen is the genome where the gene/protein is determined in still has (quite some) Ns in the genomic sequence. if an N appears in the ‘wrong’ position in a codon you can’t determine which AA it will result to and as such it is ‘translated’ as an X
As lieven.sterck said, X is often used to denote an unkown amino acid, and Ensembl certainly seems to use this convention as evidenced by long stretches of X’s in some sequences. However, I’ve also noticed instances where it appears in the protein sequence even though directly translating the corresponding Ensembl coding sequence (CDS) would result in a stop codon at that position. This happens in multiple CDS/protein pairs (e.g. ENST00000673047.2 and ENST00000229022.9 in the human CDS/protein files, Ensembl release 104).
I think it is possible that Ensembl is using it to signify something else (in addition to an unknown amino acid), but I have yet to identify a pattern or find any info on this.