Dataset

Benchmark Dataset

Several benchmark datasets of protein structures were constructed in the process of this work for the evaluation of algorithms for domain assignments. All of the datasets are based on consensus principle – i.e. agreement among several experts methodson how the structure is partitioned into domains (see Expert consensus approach).

Benchmark_1 known as "set of 467 chains" in Veretnik et al.. This is the largest set of structures that for which agreement between CATH, SCOP and AUTHORS methods was available at the time. Not all of 467 chains have an expert consensus for domain assignments as can be seen in the analysis. The expert consensus is achieved in 375 chains (80% of all the chains). This dataset is now called Benchmark_1_consensus. While this is the largest consensus benchmark dataset, it is severely biased toward single-domain chains: 318 (85%) are one-domain chains, 40 (10.7%) two-domain chains, 15 (4%) three-domain chains, 1 (0.3) four-domain chains. The original Benchmark_1 dataset is still available.

Balanced_Domain_Benchmark_2 known as Benchmark_2 in Holland et al. is based on the same principle as Benchmark_1_consensus dataset. However in order to construct a comprehensive benchmark in which majority of the structures are multi-domain proteins and each combination of structural topologies is represented only once, the AUTHORS assignments dataset was dramatically expanded to include structures with every type of domain combinations. This resulted in 315 protein structures in which over 66% of the structures are multi-domain proteins (see Table 1 for the detailed breakdown). Furthermore this is a non-redundant set of structures in which each combination of topologies is represent only once in each of the 2-, 3-,...,6-domain subsets (see Table 1).
Half of this dataset is available for the download, while the other half is reserved for independent evaluation of the domain assignment methods.

Balanced_Domain_Benchmark_3 known as Benchmark_3 in Holland et al. It is based on a more stringent definition of consensus among expert methods which requires significant match between boundaries of assigned domains. Fourty four chains were removed from the Balanced_Domain_Benchmark_3, as the overlap between the domains was below 90%. The entire Balanced_Domain_Benchmark_3 consist of 271 chains (see Table 1 for details).
Half of this dataset is available for the download, while the other half is reserved for independent evaluation of the domain assignment methods.

This work is sponsored by the National Institutes of Heath (NIH) Grant Number GM63208 (NIH/NIGMS)