minimum number of protein sequences for a sequence logo

minimum number of protein sequences for a sequence logo

1

Hi,
I’m interested in generating sequence logo of a series of defensin related proteins I’ve clustered using CD-Hit. There are aproximately 7600 sequences with about 2500 clusters, but many of them have few sequences per cluster. Which is the minimum number of sequences should have a cluster of proteins I should use to generate reliable seuquence logos?
Thanks


logo

• 48 views

It depends on what you are trying to achieve. I think as long as there are 3 sequences in the alignment (which seems to be your average cluster size), the logo will contain some useful information. After all, the job of logos is to convey the diversity of a protein group, and smaller alignments are likely to be less diverse. There is also a small-sample correction that will bring down the information content for small alignments. From the WebLogo paper:

Limited sequence data results in a systematic underestimation of the entropy, which becomes significant if the multiple alignment contains fewer than about 20 nucleotide or 40 protein sequences. By default, WebLogo incorporates a small sample correction (Schneider et al. 1986), which can, in part, ameliorate this bias. In addition, WebLogo can optionally display error bars with heights twice this correction, which gives some idea of the sampling errors made. Note that the error bars may not have uniform height across the logo, as the magnitude of the small sample correction depends on the number of symbols observed at each position. This will vary due to the presence of gaps in the alignment.

From the WebLogo website:

Secondly, the background composition is used in the small sample correction of information content. Briefly, if only a few sequences are available in the multiple sequence alignment, then sites typically appear more conserved than they really are. Small samples bias the relative entropy upwards. To compensate, we add pseudocounts to the actual counts, proportional to the expected background composition. These pseudocounts smooth the data for small samples, but become irrelevant for large samples. The proportionality constant is set to 4 for nucleic acid sequences, and 20 for proteins (these numbers have been found to give reasonable results in practice).


Login
before adding your answer.

Traffic: 1302 users visited in the last hour

Read more here: Source link