CD-HIT Clustering evaluation
I am using CD-HIT to cluster some protein sequences and I would like to evaluate the performance of the clustering for my dataset. Is there any tool for this provided I have a benchmarked clustering results for those sequences?
Also, Is there any script available to collect the actual sequences from cd-hit result file i.e. actual sequences instead of names in the following results
>Cluster 0
0 2799aa, >PF04998.6|RPOC2_CHLRE/275-3073... *
>Cluster 1
0 2214aa, >PF06317.1|Q6Y625_9VIRU/1-2214... at 80%
1 2215aa, >PF06317.1|O09705_9VIRU/1-2215... at 84%
2 2217aa, >PF06317.1|Q6Y630_9VIRU/1-2217... *
3 2216aa, >PF06317.1|Q6GWS6_9VIRU/1-2216... at 84%
4 527aa, >PF06317.1|Q67E14_9VIRU/6-532... at 63%
UPDATE: for clustering performance evaluation, I am using scikit
• 672 views
yeah, one of the problems of bioinformatics are these odd data formats that do not lend themselves to automation. Here it looks like you would need to fashion a little data parser in a programming language. Here is a beast in awk
:
cat out.clstr | awk ' /Cluster/ { no+=1;}; !/Cluster/ { id=substr($3, 2, length($3)-4); printf("%st%sn", no, id) } '
will print the cluster number and sequence id that you can then use to extract the sequence:
1 PF04998.6|RPOC2_CHLRE/275-3073
2 PF06317.1|Q6Y625_9VIRU/1-2214
2 PF06317.1|O09705_9VIRU/1-2215
2 PF06317.1|Q6Y630_9VIRU/1-2217
2 PF06317.1|Q6GWS6_9VIRU/1-2216
2 PF06317.1|Q67E14_9VIRU/6-532
Traffic: 2065 users visited in the last hour
Read more here: Source link