CD-HIT Clustering evaluation

CD-HIT Clustering evaluation


I am using CD-HIT to cluster some protein sequences and I would like to evaluate the performance of the clustering for my dataset. Is there any tool for this provided I have a benchmarked clustering results for those sequences?

Also, Is there any script available to collect the actual sequences from cd-hit result file i.e. actual sequences instead of names in the following results

>Cluster 0 
0 2799aa, >PF04998.6|RPOC2_CHLRE/275-3073... *
>Cluster 1 
0 2214aa, >PF06317.1|Q6Y625_9VIRU/1-2214... at 80%
1 2215aa, >PF06317.1|O09705_9VIRU/1-2215... at 84% 
2 2217aa, >PF06317.1|Q6Y630_9VIRU/1-2217... * 
3 2216aa, >PF06317.1|Q6GWS6_9VIRU/1-2216... at 84% 
4 527aa, >PF06317.1|Q67E14_9VIRU/6-532... at 63%

UPDATE: for clustering performance evaluation, I am using scikit






yeah, one of the problems of bioinformatics are these odd data formats that do not lend themselves to automation. Here it looks like you would need to fashion a little data parser in a programming language. Here is a beast in awk:

cat out.clstr | awk ' /Cluster/ { no+=1;}; !/Cluster/ { id=substr($3, 2, length($3)-4); printf("%st%sn", no, id) } '

will print the cluster number and sequence id that you can then use to extract the sequence:

1   PF04998.6|RPOC2_CHLRE/275-3073
2   PF06317.1|Q6Y625_9VIRU/1-2214
2   PF06317.1|O09705_9VIRU/1-2215
2   PF06317.1|Q6Y630_9VIRU/1-2217
2   PF06317.1|Q6GWS6_9VIRU/1-2216
2   PF06317.1|Q67E14_9VIRU/6-532

before adding your answer.

Traffic: 2065 users visited in the last hour

Read more here: Source link