Clustering of highly homologous sequences to reduce the size of large protein databases

Li, Weizhong
Jaroszewski, Lukasz
Godzik, Adam

¹San Diego Supercomputer Center, La Jolla, CA 92093, USA
²The Burnham Institute, La Jolla, CA 92037, USA

^*To whom correspondence should be addressed.

Received on October 4, 2000; revised on November 1, 2000; accepted on November 6, 2000

Bioinformatics 17(3):p 282-283, March 2001.

Summary

We present a fast and flexible program for clustering large protein databases at different sequence identity levels. It takes less than 2 h for the all-against-all sequence comparison and clustering of the non-redundant protein database of over 560000 sequences on a high-end PC. The output database, including only the representative sequences, can be used for more efficient and sensitive database searches.

Availability

The program is available from http://bioinformatics.burnham-inst.org/cd-hi

Contact

[email protected] or [email protected]