Advantages of complete cross-validation in paired refinement

M. Malý1,2, K. Diederichs3, J. Dohnálek1, P. Kolenko2,1

1Institute of Biotechnology of the Czech Academy of Sciences, Biocev, Průmyslová 595, 252 50 Vestec, Czech Republic

2Czech Technical University in Prague, Faculty of Nuclear Sciences and Physical Engineering, Břehová 7, 115 19 Prague, Czech Republic

3University of Konstanz, Box M647, 78457 Konstanz, Germany

malymar9@fjfi.cvut.cz

In macromolecular crystallography, a high-resolution cutoff is usually applied to the diffraction data to avoid inclusion of noisy information in structure refinement. Suitability of its estimation can be later checked performing the paired refinement protocol [1]. Refinement is usually carried out against the majority of reflections in a working set, whereas a little fraction (often 5 %) is randomly selected and set aside in a free set. The latter data are not involved in the calculations but provide cross-validation to monitor the process [2].

In our calculations, we analysed a possible dependence of paired refinement results on the use of a particular free reflection selection. For this purpose, we chose crystal structure of cysteine-bound complex of cysteine dioxygenase from Rattus norvegicus (CDO) [1]. Diffraction data were processed up to 1.42 Å resolution and atomic coordinates and ADPs of the input structure model were perturbed. Then, paired refinement using REFMAC5 [3] was performed for each of all 20 free sets individually, i.e. the complete cross-validation protocol. The following high-resolution cutoffs were analysed: 2.0, 1.9, 1.8, 1.7, 1.6, 1.5, and 1.42 Å.

Obtained results confirmed the supposed dependence of paired refinement results on free reflection set. The differences between the used free sets were remarkable, especially in high resolution. The high-resolution cutoff estimation varies from 1.7 Å (one free set) through 1.6 Å (a few free sets) and 1.5 Å (half of the free sets) to 1.42 Å (a few free sets). The averaged results, that are statistically more significant, suggested cutting the data at 1.5 Å resolution. This case demonstrates that the complete cross-validation protocol provides more relevant information across the whole data set than a commonly-used single cross-validation protocol.

1. P. A. Karplus & K. Diederichs, Science, 336, (2012), pp. 1030–1033.

2. A. Brünger, Nature, 355, (1992), pp. 472–475.

3. G. N. Murshudov, A. A. Vagin, E. J. Dodson, Acta Cryst. D, 53, (1997), pp. 240255.

This publication was supported by the MEYS CR (projects CAAS – CZ.02.1.01/0.0/0.0/16_019/0000778 and BIOCEV – CZ.1.05/1.1.00/02.0109) from the ERDF fund, by the Czech science foundation (18-10687S) and by the GA CTU in Prague (SGS19/189/OHK4/3T/14).