research papers\(\def\hfill{\hskip 5em}\def\hfil{\hskip 3em}\def\eqno#1{\hfil {#1}}\)

Journal logoSTRUCTURAL
BIOLOGY
ISSN: 2059-7983

SAD phasing of XFEL data depends critically on the error model

CROSSMARK_Color_square_no_text.svg

aMolecular Biophysics and Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA, and bCenter for Advanced Mathematics for Energy Research Applications, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
*Correspondence e-mail: asbrewster@lbl.gov, nksauter@lbl.gov

Edited by R. J. Read, University of Cambridge, England (Received 7 March 2019; accepted 17 September 2019; online 30 October 2019)

A nonlinear least-squares method for refining a parametric expression describing the estimated errors of reflection intensities in serial crystallographic (SX) data is presented. This approach, which is similar to that used in the rotation method of crystallographic data collection at synchrotrons, propagates error estimates from photon-counting statistics to the merged data. Here, it is demonstrated that the application of this approach to SX data provides better SAD phasing ability, enabling the autobuilding of a protein structure that had previously failed to be built. Estimating the error in the merged reflection intensities requires the understanding and propagation of all of the sources of error arising from the measurements. One type of error, which is well understood, is the counting error introduced when the detector counts X-ray photons. Thus, if other types of random errors (such as readout noise) as well as uncertainties in systematic corrections (such as from X-ray attenuation) are completely understood, they can be propagated along with the counting error, as appropriate. In practice, most software packages propagate as much error as they know how to model and then include error-adjustment terms that scale the error estimates until they explain the variance among the measurements. If this is performed carefully, then during SAD phasing likelihood-based approaches can make optimal use of these error estimates, increasing the chance of a successful structure solution. In serial crystallography, SAD phasing has remained challenging, with the few examples of de novo protein structure solution each requiring many thousands of diffraction patterns. Here, the effects of different methods of treating the error estimates are estimated and it is shown that using a parametric approach that includes terms proportional to the known experimental uncertainty, the reflection intensity and the squared reflection intensity to improve the error estimates can allow SAD phasing even from weak zinc anomalous signal.

1. Introduction

Solving a novel protein structure using X-ray crystallography typically involves either a reliance on a similar structure from which molecular replacement (MR) can be used to derive phasing information, or the presence of heavy atoms that can provide anomalous differences for use in SAD (single-wavelength anomalous dispersion) or MAD (multiple-wavelength anomalous dispersion) phasing (among other methods). In SAD phasing, X-ray anomalous scattering by heavy atoms in the protein structure breaks inversion/Friedel symmetry in the diffraction pattern, with otherwise equivalent reflections typically exhibiting 3–4% differences in intensity. This information can be used to determine the heavy-atom substructure in the protein, which is then used to solve the phasing problem. This approach requires highly accurately measured intensities, and the analysis of such data has been shown to benefit from maximum-likelihood methods (de La Fortelle & Bricogne, 1997[La Fortelle, E. de & Bricogne, G. (1997). Methods Enzymol. 276, 472-494.]; McCoy et al., 2004[McCoy, A. J., Storoni, L. C. & Read, R. J. (2004). Acta Cryst. D60, 1220-1228.]), with the caveat that maximum-likelihood methods also require accurate estimates of the merged intensity errors.

In serial crystallography (SX), determining the reflection intensities with the required accuracy and estimating their error is challenging, which has made phasing new structures from SX data difficult. Typically, 102–107 crystals are exposed to either synchrotron or X-ray free-electron laser (XFEL) radiation. Each crystal is exposed once in a random orientation using liquid-stream injection, grid-based raster scanning or acoustic droplet injection (reviewed in Bergmann et al., 2017[Bergmann, U., Yachandra, V. & Yano, J. (2017). X-ray Free Electron Lasers. Cambridge: The Royal Society of Chemistry.]). Individual diffraction patterns are indexed to determine the crystal orientation and unit-cell dimensions, and reflection locations are then predicted and integrated. Because the crystals are not rotated the reflections are only partially recorded, and therefore a post-refinement algorithm is used to apply a partiality correction factor in order to re-express the summed intensity in terms of the structure-factor equivalent. Finally, the redundantly measured reflections are merged together using either a simple average or a weighted average (White, 2014[White, T. A. (2014). Philos. Trans. R. Soc. Lond. B Biol. Sci. 369, 20130330.]; Kabsch, 2014[Kabsch, W. (2014). Acta Cryst. D70, 2204-2216.]; Sauter, 2015[Sauter, N. K. (2015). J. Synchrotron Rad. 22, 239-248.]; Uervirojnangkoorn et al., 2015[Uervirojnangkoorn, M., Zeldin, O. B., Lyubimov, A. Y., Hattne, J., Brewster, A. S., Sauter, N. K., Brunger, A. T. & Weis, W. I. (2015). Elife, 4, e05421.]; Ginn et al., 2015[Ginn, H. M., Brewster, A. S., Hattne, J., Evans, G., Wagner, A., Grimes, J. M., Sauter, N. K., Sutton, G. & Stuart, D. I. (2015). Acta Cryst. D71, 1400-1410.]).

In crystallographic experiments, the error estimates from photon-counting statistics alone do not explain the variance observed in the measurements, always underestimating the variance owing to the presence of other sources of error. In 1985, an IUCr subcommittee on statistical descriptors was tasked to evaluate the validity of the statistical approaches used at the time to determine variances and provide recommendations (Schwarzenbach et al., 1989[Schwarzenbach, D., Abrahams, S. C., Flack, H. D., Gonschorek, W., Hahn, T., Huml, K., Marsh, R. E., Prince, E., Robertson, B. E., Rollett, J. S. & Wilson, A. J. C. (1989). Acta Cryst. A45, 63-75.]). In their report, they suggested that if the multiplicity of the measurements was high enough then simply the spread of the measurements is sufficient to estimate the error. Otherwise, they recommended that crystallographic methods developers use error-propagation approaches to combine uncertainty in photon counting with random and systematic sources of error. Random error sources include readout noise and dark current. Systematic sources of error include X-ray attenuation from air, sample or water, detector misalignment and errors in estimating the wavelength or flux and, in the case of SX, the partiality. Because the reflections are only partially recorded, every measurement is reduced by anywhere from 0% to 100% of its full intensity, depending on the crystal orientation, mosaicity and the spectral characteristics of the beam. It is likely that partiality is the dominant source of error for SX data, where reflection tails touching the Ewald sphere introduce orders of magnitude more uncertainty than reflections that directly intersect the Ewald sphere.

The full list of sources of error is extensive and it is difficult to ensure that all sources of error have been accounted for. To this end, procedures have been developed to adjust error estimates, usually inflating them to larger values, using intensity-dependent and intensity-independent factors, after applying any other known corrections (Leslie, 1999[Leslie, A. G. W. (1999). Acta Cryst. D55, 1696-1702.], 2006[Leslie, A. G. W. (2006). Acta Cryst. D62, 48-57.]; Otwinowski & Minor, 2001[Otwinowski, Z. & Minor, W. (2001). International Tables for Crystallography, Vol. F, edited by M. G. Rossmann & E. Arnold, pp. 226-235. Dordrecht: Kluwer Academic Publishers.]; Kabsch, 2010a[Kabsch, W. (2010a). Acta Cryst. D66, 125-132.],b[Kabsch, W. (2010b). Acta Cryst. D66, 133-144.]; Evans, 2006[Evans, P. (2006). Acta Cryst. D62, 72-82.], 2011[Evans, P. R. (2011). Acta Cryst. D67, 282-292.]). For a full set of references, see Rossmann & Arnold (2001[Rossmann, M. G. & Arnold, E. (2001). Editors. International Tables for Crystallography, Vol. F, ch. 11. Dordrecht: Kluwer Academic Publishers.]).

In the present study, we have found that how the error estimates are obtained directly affects our ability to use SAD phasing to solve an XFEL structure de novo. We examined three methods for treating error and show that only some of them allowed us to find the Zn sites of a thermolysin data set using SAD and subsequently autobuild the structure. We also show that with better error treatment, interpretable maps can be obtained even with fewer measurements.

2. Methods

This work follows directly from the work reported in Brewster et al. (2018[Brewster, A. S., Waterman, D. G., Parkhurst, J. M., Gildea, R. J., Young, I. D., O'Riordan, L. J., Yano, J., Winter, G., Evans, G. & Sauter, N. K. (2018). Acta Cryst. D74, 877-894.]). The data set can be downloaded from cxi.db entry 81 (https://www.cxidb.org/id-81.html), and after indexing and integration consists of over 160 000 crystals from a thermolysin data set collected at the CXI endstation of LCLS on a CSPAD detector (Kern et al., 2014[Kern, J., Tran, R., Alonso-Mori, R., Koroidov, S., Echols, N., Hattne, J., Ibrahim, M., Gul, S., Laksmono, H., Sierra, R. G., Gildea, R. J., Han, G., Hellmich, J., Lassalle-Kaiser, B., Chatterjee, R., Brewster, A. S., Stan, C. A., Glöckner, C., Lampe, A., DiFiore, D., Milathianaki, D., Fry, A. R., Seibert, M. M., Koglin, J. E., Gallo, E., Uhlig, J., Sokaras, D., Weng, T. C., Zwart, P. H., Skinner, D. E., Bogan, M. J., Messerschmidt, M., Glatzel, P., Williams, G. J., Boutet, S., Adams, P. D., Zouni, A., Messinger, J., Sauter, N. K., Bergmann, U., Yano, J. & Yachandra, V. K. (2014). Nature Commun. 5, 4371.]; Hart et al., 2012[Hart, P., Boutet, S., Carini, G., Dubrovin, M., Duda, B., Fritz, D., Haller, G., Herbst, R., Herrmann, S., Kenney, C., Kurita, N., Lemke, H., Messerschmidt, M., Nordby, M., Pines, J., Schafer, D., Swift, M., Weaver, M., Williams, G., Zhu, D., Van Bakel, N. & Morse, J. (2012). Proc. SPIE, 8504, 85040C.]). After indexing, time-dependent ensemble refinement was applied, in which the data were grouped into batches of images and the detector models were then refined to account for the time-dependent shifts in sample position that are likely to arise from instability in the liquid-jetting system (Brewster et al., 2018[Brewster, A. S., Waterman, D. G., Parkhurst, J. M., Gildea, R. J., Young, I. D., O'Riordan, L. J., Yano, J., Winter, G., Evans, G. & Sauter, N. K. (2018). Acta Cryst. D74, 877-894.]). The expected Bijvoet ratio for this system (〈|F+F|〉/〈F〉), comprising two Zn2+ and four Ca2+ atoms in a total of 2561 non-H atoms, is 2.1% (Terwilliger et al., 2016[Terwilliger, T. C., Bunkóczi, G., Hung, L.-W., Zwart, P. H., Smith, J. L., Akey, D. L. & Adams, P. D. (2016). Acta Cryst. D72, 346-358.]; Hendrickson & Teeter, 1981[Hendrickson, W. A. & Teeter, M. M. (1981). Nature (London), 290, 107-113.]).

Unlike in Brewster et al. (2018[Brewster, A. S., Waterman, D. G., Parkhurst, J. M., Gildea, R. J., Young, I. D., O'Riordan, L. J., Yano, J., Winter, G., Evans, G. & Sauter, N. K. (2018). Acta Cryst. D74, 877-894.]), the images were first converted from measured pixel values to photon units, dividing them by an estimated value of 25, as reported by the beamline staff. This experiment used an early-generation CSPAD with a non-uniform gain response; therefore, using a single gain-correction constant greatly oversimplifies the physics of the detector (Hart et al., 2012[Hart, P., Boutet, S., Carini, G., Dubrovin, M., Duda, B., Fritz, D., Haller, G., Herbst, R., Herrmann, S., Kenney, C., Kurita, N., Lemke, H., Messerschmidt, M., Nordby, M., Pines, J., Schafer, D., Swift, M., Weaver, M., Williams, G., Zhu, D., Van Bakel, N. & Morse, J. (2012). Proc. SPIE, 8504, 85040C.]). With these gain-corrected pixel values, we also needed to modify the merging protocol described in Brewster et al. (2018[Brewster, A. S., Waterman, D. G., Parkhurst, J. M., Gildea, R. J., Young, I. D., O'Riordan, L. J., Yano, J., Winter, G., Evans, G. & Sauter, N. K. (2018). Acta Cryst. D74, 877-894.]). We apply a per-image resolution filter during merging, in which the resolution cutoff of each image is determined by the point at which the signal-to-noise ratio (I/σ) falls below a given threshold. To compensate for the fact that I/σ decreases with the square root of the gain, we decreased the threshold from 0.5 to 0.1 [0.1 = 0.5/(25)1/2].

We analyzed three methods for the treatment of error from SX data, as described in Sections 2.1[link]–2.3[link]. After the integrated intensity error estimates had been treated using one of these methods, we used them to create merged intensities Ih and merged error estimates σh according to the following procedure. Given a Miller index h with n measurements of the intensity of h, we define the jth measurement of h as IPhj and the associated photon-counting error as σPhj [referred to as σc(Ihj) in Brewster et al. (2018[Brewster, A. S., Waterman, D. G., Parkhurst, J. M., Gildea, R. J., Young, I. D., O'Riordan, L. J., Yano, J., Winter, G., Evans, G. & Sauter, N. K. (2018). Acta Cryst. D74, 877-894.])]. The superscript P means that the reflection is only partially observed owing to the measurement being from a still image. The intensity and estimated error are both scaled to their full equivalent values, Ihj and σhj, using a per-image scale factor Gc, a Wilson B factor Bc and a per-reflection partiality correction Phj, all of which were determined during scaling and post-refinement according to Sauter (2015[Sauter, N. K. (2015). J. Synchrotron Rad. 22, 239-248.]),

[I_{hj} = {{I^{\rm P}_{hj}} \over {K_{hj}}}, \eqno (1)]

[\sigma_{hj} = {{\sigma^{\rm P}_{hj}} \over {K_{hj}}}, \eqno (2)]

[K_{hj} = P_{hj}G_{c}\exp\left[-2B_{c}\left({{\sin\theta_{h}} \over {\lambda_{c}}}\right)^{2}\right], \eqno (3)]

where θh is the Bragg angle for Miller index h, λc is the incident wavelength and the subscript c denotes the crystal which gave rise to reflection hj. Phj is the partiality-correction factor for this measurement [see equation (14)[link] of Uervirojnang­koorn et al. (2015[Uervirojnangkoorn, M., Zeldin, O. B., Lyubimov, A. Y., Hattne, J., Brewster, A. S., Sauter, N. K., Brunger, A. T. & Weis, W. I. (2015). Elife, 4, e05421.])], which depends on λc, mosaicity estimates and the unit-cell dimensions and orientation of crystal c. Importantly, the post-refinement of Sauter (2015[Sauter, N. K. (2015). J. Synchrotron Rad. 22, 239-248.]) is similar to the post-refinement described in Winkler et al. (1979[Winkler, F. K., Schutt, C. E. & Harrison, S. C. (1979). Acta Cryst. A35, 901-911.]) and Rossmann et al. (1979[Rossmann, M. G., Leslie, A. G. W., Abdel-Meguid, S. S. & Tsukihara, T. (1979). J. Appl. Cryst. 12, 570-581.]) in that the target function refines the difference between the observed and predicted intensity values. However, the choice of the parameters being refined differs. Here, we refine the misorientation angles of the crystals, Gc, and Bc for each frame, but not the mosaicity itself, which is instead derived from empirically examining which reflections are observed on the image (Sauter et al., 2014[Sauter, N. K., Hattne, J., Brewster, A. S., Echols, N., Zwart, P. H. & Adams, P. D. (2014). Acta Cryst. D70, 3299-3309.]).

After frame-by-frame post-refinement, scaling and partiality correction, we merge the corrected intensities and error estimates according the three protocols detailed below and summarized in Table 1[link].

Table 1
Summary of error-modeling methods

Protocol Weight Description
1 Unweighted error estimates
2 σhj Photon-counting error estimates as weights
3 σEv11 Refine SDFAC terms to inflate photon-counting error estimates
†These are the weights used in (7)[link] and (8)[link], such that the weight w = 1/σ2.

2.1. Protocol 1: unweighted mean

We begin with the suggestion of Schwarzenbach et al. (1989[Schwarzenbach, D., Abrahams, S. C., Flack, H. D., Gonschorek, W., Hahn, T., Huml, K., Marsh, R. E., Prince, E., Robertson, B. E., Rollett, J. S. & Wilson, A. J. C. (1989). Acta Cryst. A45, 63-75.]), in which we use the mean of the measurements to estimate the reflection intensity,

[I_{h} = {{\sum_{j=1}^{n}I_{hj}}\over n}, \eqno (4)]

and we use the observed spread of the measurements to determine the error estimates,

[\sigma_{\rm res} = \left[{{\sum _{j=1}^{n}(I_{hj}-\langle I_{h}\rangle)^{2}} \over {n-1}}\right]^{1/2}, \eqno (5)]

[\sigma_{h} = {{\sigma_{\rm res}} \over {n^{1/2}}}, \eqno (6)]

where σres refers to the residual differences between the measurements and their mean (i.e. the standard deviation), and σh refers to the merged error estimate for reflection h and is the standard error of the mean. Protocol 1 does not use the information in original error estimates from photon counting (σhj), and assumes that a large enough sample of the reflections is available to reliably estimate the uncertainty. This formulation is similar to that in Chapman et al. (2011[Chapman, H. N., Fromme, P., Barty, A., White, T. A., Kirian, R. A., Aquila, A., Hunter, M. S., Schulz, J., DePonte, D. P., Weierstall, U., Doak, R. B., Maia, F. R. N. C., Martin, A. V., Schlichting, I., Lomb, L., Coppola, N., Shoeman, R. L., Epp, S. W., Hartmann, R., Rolles, D., Rudenko, A., Foucar, L., Kimmel, N., Weidenspointner, G., Holl, P., Liang, M., Barthelmess, M., Caleman, C., Boutet, S., Bogan, M. J., Krzywinski, J., Bostedt, C., Bajt, S., Gumprecht, L., Rudek, B., Erk, B., Schmidt, C., Hömke, A., Reich, C., Pietschner, D., Strüder, L., Hauser, G., Gorke, H., Ullrich, J., Herrmann, S., Schaller, G., Schopper, F., Soltau, H., Kühnel, K.-U., Messer­schmidt, M., Bozek, J. D., Hau-Riege, S. P., Frank, M., Hampton, C. Y., Sierra, R. G., Starodub, D., Williams, G. J., Hajdu, J., Timneanu, N., Seibert, M. M., Andreasson, J., Rocker, A., Jönsson, O., Svenda, M., Stern, S., Nass, K., Andritschke, R., Schröter, C.-D., Krasniqi, F., Bott, M., Schmidt, K. E., Wang, X., Grotjohann, I., Holton, J. M., Barends, T. R. M., Neutze, R., Marchesini, S., Fromme, R., Schorb, S., Rupp, D., Adolph, M., Gorkhover, T., Andersson, I., Hirsemann, H., Potdevin, G., Graafsma, H., Nilsson, B. & Spence, J. C. H. (2011). Nature (London), 470, 73-77.]) and White et al. (2012[White, T. A., Kirian, R. A., Martin, A. V., Aquila, A., Nass, K., Barty, A. & Chapman, H. N. (2012). J. Appl. Cryst. 45, 335-341.]), differing slightly in the denominator of σres by using n − 1 instead of n.

2.2. Protocol 2: weighted mean

The distribution of measurement intensities from still images does not follow a Gaussian distribution because every intensity is measured only partially. The reflection partiality is a function of crystal orientation, unit-cell dimensions, wavelength spectrum and crystal mosaicity. Difficulty in estimating these parameters results in integrating weak and highly partial reflections that skew the distribution towards zero. Because of the skewed distribution, the mean is not an ideal estimator of the structure-factor intensity, and so protocol 2 uses a weighted mean and a weighted standard error of the mean to estimate the reflection intensity and the uncertainty in that estimation,

[I_{h} = {{ \sum _{j=1}^{n}w_{hj}I_{hj}} \over {\sum _{j=1}^{n}w_{hj}}}, \eqno (7)]

[\sigma_{h} = \left({{1} \over {\sum _{j=1}^{n}w_{hj}}}\right)^{1/2}, \eqno (8)]

where the weights whj are variance weights derived from the photon-counting error estimates σhj, i.e. the estimated error derived from summing photons as described in Leslie (1999[Leslie, A. G. W. (1999). Acta Cryst. D55, 1696-1702.]), which should follow a Poisson distribution:

[w_{hj} = {{1} \over {\sigma_{{hj}}^{2}}}. \eqno (9)]

2.3. Protocol 3: Ev11

Protocol 3 adjusts the error estimates using terms from Evans (2006[Evans, P. (2006). Acta Cryst. D62, 72-82.]) and Evans (2011[Evans, P. R. (2011). Acta Cryst. D67, 282-292.]): sfac, sB and sadd.1 In Brewster et al. (2018[Brewster, A. S., Waterman, D. G., Parkhurst, J. M., Gildea, R. J., Young, I. D., O'Riordan, L. J., Yano, J., Winter, G., Evans, G. & Sauter, N. K. (2018). Acta Cryst. D74, 877-894.]) we showed that applying these factors to the non-gain-corrected thermolysin data brings down the final merged I/σ estimate to around 30, which is more reasonable for protein crystallography (Diederichs, 2010[Diederichs, K. (2010). Acta Cryst. D66, 733-740.]). We also showed that applying these factors greatly increased the anomalous peak height of the Zn atom (from 44.6σ to 74.0σ). In Brewster et al. (2018[Brewster, A. S., Waterman, D. G., Parkhurst, J. M., Gildea, R. J., Young, I. D., O'Riordan, L. J., Yano, J., Winter, G., Evans, G. & Sauter, N. K. (2018). Acta Cryst. D74, 877-894.]), following the example of Evans (2011[Evans, P. R. (2011). Acta Cryst. D67, 282-292.]), our implementation used a simplex minimizer to refine these terms. In this work, we instead used a gradient-based nonlinear least-squares minimization procedure.

The equation to inflate the estimated error of the individual measurements is

[\sigma_{\rm Ev11}^{2} = s_{\rm fac}^{2}[\sigma_{hj}^{2}+s_{\rm B}^{2}\langle I_{h}\rangle + s_{\rm add}^{2}\langle I_{h}\rangle^{2}], \eqno (10)]

where 〈Ih〉 is the mean of the measurements of h after correcting by the factor Khj. This equation is similar to error propagation, in which additional errors proportional to the intensity, likely derived from instrument instability (sadd), are added in quadrature to the counting-error estimates σhj. In Evans (2011[Evans, P. R. (2011). Acta Cryst. D67, 282-292.]), the sfac term is considered to account for effects such as errors in the gain, converting detector counts to photon counts. The sB term was included to better fit the observed error estimates to a normal distribution, but in Evans (2011[Evans, P. R. (2011). Acta Cryst. D67, 282-292.]) the term was given no physical meaning. Here, we first show how we compute initial estimates of sfac, sB and sadd using normal probability analysis, following Evans (2006[Evans, P. (2006). Acta Cryst. D62, 72-82.]). We then use a limited-memory Broyden–Fletcher–Goldfarb–Shanno (LBFGS; Liu & Nocedal, 1989[Liu, D. C. & Nocedal, J. (1989). Math. Program. 45, 503-528.]) minimizer to refine these parameters until the deviation of normalized error estimates best approaches 1.

After refinement of the sfac, sB and sadd terms, 1/σ2Ev11 is used as a weight in (7)[link] and (8)[link] to compute the weighted mean and weighted standard error of the mean of each reflection, as in protocol 2.

2.3.1. Initial parameter estimates

Estimates of error such as σhj represent the deviation of the measurements Ihj from the unknown population mean value. If these deviations from the mean are normally distributed then the normalized deviations will follow a standard normal distribution, i.e. a Gaussian distribution centered on zero with a standard deviation of 1. We choose initial values of sfac, sB and sadd that best adjust the original deviations such that the normalized deviations approach a standard normal distribution, according to the following procedure.

Normalized deviations. This formulation of the normalized deviations of a set of intensities and sigmas is similar to that described in Evans (2011[Evans, P. R. (2011). Acta Cryst. D67, 282-292.]), but includes the (n − 1)/n factor as currently implemented by AIMLESS. The normalized deviation δhjnorm for Ihj is

[\delta_{{hj{\rm norm}}} = \left({{n-1} \over {n}}\right)^{1/2}{{I_{hj}-\langle I'_{hj}\rangle} \over {\sigma_{hj}}}, \eqno (11)]

where 〈Ihj〉 is the mean of the measurements of h except for Ihj. In the special case where n = 1, 〈Ihj〉 = 0, and since in that case n − 1 = 0, δ2hjnorm = 0. These observations are not included in the normal probability analysis below for the initial parameter estimates.

Normal probability analysis. Using normalized deviations, we can initialize the sfac, sB and sadd parameters using a graphical technique called a `normal probability plot', as suggested by Evans (2006[Evans, P. (2006). Acta Cryst. D62, 72-82.]) (see also Chambers et al., 1983[Chambers, J. M., Cleveland, W. S., Kleiner, B. & Tukey, P. A. (1983). Graphical Methods for Data Analysis, ch. 6. Belmont: Wadsworth.]). A normal probability plot helps to determine how near a sampling of data approaches a normal distribution. Given a sampling of m observations, we sort them and then plot them versus a set of m theoretical or expected values. The theoretical values are perfectly distributed according to the normal distribution. If our observations are indeed normally distributed then the plot will be a straight line with slope 1 and offset 0. The `perfect' theoretical values are normal order statistic medians, also referred to as rankits. In the simple case of m = 5 total observations, the second, third and fourth rankits are equal to the first quartile, median and third quartile of a normal distribution. We compute the rankits in the same way as qqnorm does in R (R Core Team, 2017[R Core Team, (2017). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.r-project.org/.]). The rankit zi for the ith value in m is

[z_{i} = \Phi^{-1}\left({{i-a} \over {m+1-2a}}\right), \eqno (12)]

where Φ−1 is the standard normal quantile function (the inverse of the cumulative distribution function) and where a = 3/8 if m ≤ 10 and 0.5 if m > 10. The expression (ia)/(m + 1 − 2a) in (12)[link] converts i to a number between 0 and 1; therefore, zi is the expected value of the ranked ith sample from a normal distribution. Again, the normal probability plot, or the plot of the rankits versus δhjnorm (where all δhjnorm are first sorted by value), will have a slope of 1 with an offset of 0 if the error estimates are normally distributed. To determine an initial set of parameters, we determine the slope and offset of a line fitted to the central area of this plot (using the area between −0.5 and 0.5 to avoid fitting outliers). sfac is initialized to the slope, as is performed in Evans (2006[Evans, P. (2006). Acta Cryst. D62, 72-82.]). In Evans (2006[Evans, P. (2006). Acta Cryst. D62, 72-82.]), sadd is set to 0.02. As we did not know whether this value was applicable to XFEL data, we experimented with initializing sadd to the normal probability plot offset and sB to sadd1/2. This seemed to give reasonable results. As refinement proceeds, the normal probability plot becomes more linear and the slope approaches 1 as the parameters better correct the estimated errors to approach those derived from sampling a normal distribution (Fig. 1[link]). Note that the normal probability analysis is only used to initialize the parameters; the refinement of the parameters is outlined below.

[Figure 1]
Figure 1
Normal probability plots for 5000 images. A 5000-image subset of the data was merged using protocol 3. During each step of the parameter refinement, a normal probability plot was generated (a). The rankits (equation 12[link]) are plotted versus the sorted normalized deviations from the mean (equation 11[link]). Each line represents one step during refinement and is colored using a rainbow color map from red (early steps) to blue (late steps). This is a non-gain-corrected data set. (b) Enlargement of the central area of (a) used to compute the slope and offset for initialization of the parameters. (c) As (a) but with a gain-corrected data set, in which each pixel was divided by 25. (d) As (b) for the central area of (c).

2.3.2. Parameter refinement

We refine the sfac, sB and sadd parameters using the LBFGS quasi-Newton minimizer requiring only first derivatives. For each step, we evaluate (10[link]) for each σhj and then compute the normalized deviations using (11[link]). The target function fσ minimizes the deviation of the root-mean-squared deviation (r.m.s.d.) of the normalized deviations from 1, as determined over 100 intensity bins. We bin the intensities as follows. For each Miller index h, determine the mean intensity 〈Ih〉 of the measurements of h. The bin width will be the maximum of all 〈Ih〉 for all h minus the minimum 〈Ih〉 for all h divided by 100. For each h, all the measurements of h will be assigned to a single bin based on 〈Ih〉. There will be mb measurements in intensity bin b. Call all the measurements in bin b Ibk, where k ranges from k = 1 to mb. Each Ibk is associated with a normalized deviation, δbknorm, computed using the adjusted error estimate for that measurement of h,

[\delta_{bk{\rm norm}}^{2} = {{n-1} \over {n}} {{(I_{bk}-\langle I'_{hk}\rangle)^{2}} \over {\sigma_{\rm Ev11}^{2}}}, \eqno (13)]

where 〈Ihk〉 is the mean of all measurements of Miller index h except for Ibk. Here, σEv11 is the corrected error estimate for measurement Ibk using (10[link]) (note that the subscripts b and k are suppressed in this reference to σEv11). The target function is then

[f_{\sigma} = \sum \limits_{b=1}^{100} w_{b}\left[1-\left({{\sum _{k=1}^{m_{b}}\delta_{bk{\rm norm}}^{2}} \over {m_{b}}}\right)^{1/2}\right]^{2}, \eqno (14)]

where b iterates over the 100 intensity bins. The term for each bin is weighted by wb = mb1/2. After refinement of the sfac, sB and sadd parameters, we apply them to each σhj to compute the final estimated error for each measurement, σEv11.

The derivatives of the target function (14[link]) with respect to the parameters are shown in Appendix A[link]. The refinement of these terms using LBFGS is protocol 3.

3. Results

We reprocessed the data files from cxi.db entry 81 (Brewster et al., 2018[Brewster, A. S., Waterman, D. G., Parkhurst, J. M., Gildea, R. J., Young, I. D., O'Riordan, L. J., Yano, J., Winter, G., Evans, G. & Sauter, N. K. (2018). Acta Cryst. D74, 877-894.]) comprising 160 000 lattices, including a gain correction (division of the pixel values by 25) prior to integration, and merged them using cxi.merge. In Brewster et al. (2018[Brewster, A. S., Waterman, D. G., Parkhurst, J. M., Gildea, R. J., Young, I. D., O'Riordan, L. J., Yano, J., Winter, G., Evans, G. & Sauter, N. K. (2018). Acta Cryst. D74, 877-894.]) the initial scale factors were derived from the known structure of thermolysin. Here, in contrast, we wanted to solve the structure de novo, so we used an alternate merging protocol. We first averaged all of the data without post-refinement using the cxi.merge default of weighted means and weighted standard errors of the mean (protocol 2). We then used this averaged data set as a scaling reference and merged again, applying post-refinement to each frame, refining the misorientation angles of the crystal, the scale factor and a Wilson B factor (Sauter, 2015[Sauter, N. K. (2015). J. Synchrotron Rad. 22, 239-248.]), but again using the cxi.merge default of weighted means and weighted standard errors of the mean (protocol 2). We then re-merged a third time, using this post-refined data set as a reference for scaling. During this third merging, each of the three error models were applied. This bootstrapping approach to obtain a reference from the unscaled data is similar to how we have merged data before without a reference (Uervirojnangkoorn et al., 2015[Uervirojnangkoorn, M., Zeldin, O. B., Lyubimov, A. Y., Hattne, J., Brewster, A. S., Sauter, N. K., Brunger, A. T. & Weis, W. I. (2015). Elife, 4, e05421.]). For protocol 3, the final values after refinement were sfac = 1.32, sB = 0.71 and sadd = 0.51.

As mentioned above, we applied a gain correction to all images prior to integration, dividing the pixel values by 25 to convert to units of photons. As expected, correcting for gain also had a dramatic effect on the refinement of the SDFAC parameters for Ev11 (protocol 3). We processed a 5000-image subset without gain correction and found that refinement of the SDFAC parameters drove the functional (14[link]) from 3306 to 122, driving the parameters from sfac = 7.47, sB = 0.72 and sadd = 0.52 to sfac = 4.14, sB = 0.00 and sadd = 0.52 over 66 steps. However, for the gain-corrected data, the refinement drove the functional from 156 to 149, driving the parameters from sfac = 1.44, sB = 0.67 and sadd = 0.45 to sfac = 1.43, sB = 0.96 and sadd = 0.45 over 14 steps. The difference between the two refinements can be seen in Fig. 1[link]. Not only did a more substantial minimization need to be performed on the non-gain-corrected data, but the final sfac parameter is quite a bit larger in magnitude, indicating a compensation for the absence of a gain correction. It is also worth noting that the difference between the two initial sfac values is related to the gain ratio (7.472/1.442 = 26.9), again indicating the relationship between sfac and the uncertainty in the gain estimate.

Properly scaled, partiality-corrected and merged intensities reported in units of photons from XFELs should be comparable to the full reflection intensities measured at synchrotrons if all systematic effects have been accounted for. One measure of comparison for the two techniques is the signal to noise, or the I/σ ratio. Fig. 2[link] shows I, σ and I/σ versus resolution plots for the merged data, which show that the error estimates for protocol 2 are orders of magnitude lower than for protocols 1 and 3. Fig. 3[link] shows I/σ versus I plots for all three data sets, as presented in Diederichs (2010[Diederichs, K. (2010). Acta Cryst. D66, 733-740.]) (note that these are for the unmerged data). While Diederichs (2010[Diederichs, K. (2010). Acta Cryst. D66, 733-740.]) was working with reflections that were much better measured and had much lower redundancy, I/σ in photons should be comparable (between 20–40), and indeed we see the overall I values are of the order that is expected (100–104). Protocols 1 and 3 show I/σ values of the order that would be expected from protein crystallography, while protocol 2 has I/σ values that are higher than expected. We also see that data sets 1 and 3 do not display the sigmoidal shape demonstrated in Diederichs (2010[Diederichs, K. (2010). Acta Cryst. D66, 733-740.]), indicating that the signal-to-noise ratio has not reached its limit for this system. This implies there is further work to be performed to remove systematic errors. Finally, note that the scattered data points in protocol 1 (upper left of Fig. 3[link]) that have high I/σ but low I come from reflections with low redundancy (≤2–4). These error estimates, which for protocol 1 come only from the standard error of the mean of the observations, become unreliable with low redundancy. It is likely that a redundancy of at least 5 is required for SX data to be reliable using protocol 1.

[Figure 2]
Figure 2
Intensity and σ versus resolution. 2D histograms of I, σ and I/σ (top, middle and bottom) versus resolution for the three error models. Data are for merged values. Note that the y axes and the color are on a logarithmic scale.
[Figure 3]
Figure 3
I/σ versus I plots with different error models. 2D histograms of I/σ versus I for the three error models. Unmerged intensities and error estimates are shown. In the top and bottom plots the same data are presented but with different scales for the y axis. Note that the color is on a logarithmic scale.

We also examined the overall I/σ trends in the data set. In Hattne et al. (2014[Hattne, J., Echols, N., Tran, R., Kern, J., Gildea, R. J., Brewster, A. S., Alonso-Mori, R., Glöckner, C., Hellmich, J., Laksmono, H., Sierra, R. G., Lassalle-Kaiser, B., Lampe, A., Han, G., Gul, S., DiFiore, D., Milathianaki, D., Fry, A. R., Miahnahri, A., White, W. E., Schafer, D. W., Seibert, M. M., Koglin, J. E., Sokaras, D., Weng, T. C., Sellberg, J., Latimer, M. J., Glatzel, P., Zwart, P. H., Grosse-Kunstleve, R. W., Bogan, M. J., Messerschmidt, M., Williams, G. J., Boutet, S., Messinger, J., Zouni, A., Yano, J., Bergmann, U., Yachandra, V. K., Adams, P. D. & Sauter, N. K. (2014). Nature Methods, 11, 545-548.]), we observed numerous intensities at large negative multiples of I/σ, and we used these negative measurements to compute an additional error-adjustment term to account for this extra uncertainty. To determine whether this approach (termed Ha14; see also Brewster et al., 2018[Brewster, A. S., Waterman, D. G., Parkhurst, J. M., Gildea, R. J., Young, I. D., O'Riordan, L. J., Yano, J., Winter, G., Evans, G. & Sauter, N. K. (2018). Acta Cryst. D74, 877-894.]) was applicable to the data in this work, we examined the distribution of I/σ in a subset of images. We selected integration regions void of Bragg spots and compared the distribution of I/σ between these empty measurements and the measurements where signal is predicted. We found that the large negative intensity outliers seen in Hattne et al. (2014[Hattne, J., Echols, N., Tran, R., Kern, J., Gildea, R. J., Brewster, A. S., Alonso-Mori, R., Glöckner, C., Hellmich, J., Laksmono, H., Sierra, R. G., Lassalle-Kaiser, B., Lampe, A., Han, G., Gul, S., DiFiore, D., Milathianaki, D., Fry, A. R., Miahnahri, A., White, W. E., Schafer, D. W., Seibert, M. M., Koglin, J. E., Sokaras, D., Weng, T. C., Sellberg, J., Latimer, M. J., Glatzel, P., Zwart, P. H., Grosse-Kunstleve, R. W., Bogan, M. J., Messerschmidt, M., Williams, G. J., Boutet, S., Messinger, J., Zouni, A., Yano, J., Bergmann, U., Yachandra, V. K., Adams, P. D. & Sauter, N. K. (2014). Nature Methods, 11, 545-548.]) are absent from our data and that the negative intensities have a similar distribution to the empty measurements (see Fig. 4[link]). Therefore, Ha14 methods do not seem to apply.

[Figure 4]
Figure 4
Histogram of I/σ for signal versus noise. (a) A random subset of 3800 images from one processing run of thermolysin was re-integrated, including the prediction of non-existent reflections at the halfway positions along the c* axis. These predictions, which are halfway between observed reflections, are composed of only noise. (b) Example of reflections labeled with integer L and fractional L indices.

Phasing and autobuilding was performed using phenix.autosol (Adams et al., 2010[Adams, P. D., Afonine, P. V., Bunkóczi, G., Chen, V. B., Davis, I. W., Echols, N., Headd, J. J., Hung, L.-W., Kapral, G. J., Grosse-Kunstleve, R. W., McCoy, A. J., Moriarty, N. W., Oeffner, R., Read, R. J., Richardson, D. C., Richardson, J. S., Terwilliger, T. C. & Zwart, P. H. (2010). Acta Cryst. D66, 213-221.]), supplying the thermolysin amino-acid sequence from PDB entry 4tnl (Kern et al., 2014[Kern, J., Tran, R., Alonso-Mori, R., Koroidov, S., Echols, N., Hattne, J., Ibrahim, M., Gul, S., Laksmono, H., Sierra, R. G., Gildea, R. J., Han, G., Hellmich, J., Lassalle-Kaiser, B., Chatterjee, R., Brewster, A. S., Stan, C. A., Glöckner, C., Lampe, A., DiFiore, D., Milathianaki, D., Fry, A. R., Seibert, M. M., Koglin, J. E., Gallo, E., Uhlig, J., Sokaras, D., Weng, T. C., Zwart, P. H., Skinner, D. E., Bogan, M. J., Messerschmidt, M., Glatzel, P., Williams, G. J., Boutet, S., Adams, P. D., Zouni, A., Messinger, J., Sauter, N. K., Bergmann, U., Yano, J. & Yachandra, V. K. (2014). Nature Commun. 5, 4371.]), one NCS copy and using all defaults, except for specifying two Zn atoms as the search target, using a thorough HySS search to 4.0 Å and using a solvent fraction of 0.467 with extreme density modification. Phasing results are shown in Table 2[link].

Table 2
SAD phasing results for different error models

Values in parentheses are for the highest resolution bin.

Protocol 1 2 3
Weight σhj σEv11
Resolution (Å) 80.78–1.80 (1.86–1.80)
I/σ 13.8 (2.7) 59.7 (2.0) 14.0 (1.4)
CC1/2 (%) 99.9 (73.8) 99.8 (63.3) 99.9 (81.4)
Zn2+ peak height (σ) 53.1 50.5 67.0
No. of sites found by HySS§ 6 ± 0 6 ± 0 6 ± 0
No. of residues built (of 316)§ 252.4 ± 15.2 104.1 ± 1.4 297.2 ± 6.5
Model–map CC§ (%) 71.0 ± 0.4 30.3 ± 0.2 80.0 ± 0.1
Rwork§ (%) 27.4 ± 1.8 54.8 ± 0.3 21.2 ± 1.3
Rfree§ (%) 29.9 ± 2.0 57.3 ± 0.8 23.7 ± 1.7
†As in Table 1[link], for a given weight w where w = 1/σ2.
‡These are higher than in Fig. 3[link] because the higher intensity observations are given a higher weight during merging.
§Numbers are mean ± standard deviation over ten trials with differing random-number seeds.
¶Phased map correlation to the known structure.

While all protocols were able to find the six heavy-atom sites, protocol 2 essentially failed during SAD phasing and autobuilding, while the unweighted protocol 1 partially succeeded. Over two thirds of the structure was built with protocol 1 and it is likely that the model could be finished manually. Phasing and autobuilding were successful using error estimates inflated by SDFAC parameters (protocol 3). This protocol also showed an improved ability to phase and autobuild the structure compared with using the unweighted variance (protocol 1). The LBFGS version of SDFAC refinement shows nearly the same results as a simplex minimizer (not shown), but importantly LBFGS is deterministic, does not rely on the randomness in the initialization inherent to simplex minimization and converges in less time and in fewer steps than the simplex minimizer (see below).

To determine whether these algorithms improve the number of images needed for phasing, for each of the three protocols we re-merged the data using increasing numbers of images from 1000 images to the full data set (160 000+ images; Fig. 5[link]). In addition, since we used random sampling to create these subsets, we repeated this sampling ten times for each subset. For the full data set, which could not be sub-sampled, we instead ran the auto-solver ten times with random seeds, as recommended by Bunkóczi et al. (2015[Bunkóczi, G., McCoy, A. J., Echols, N., Grosse-Kunstleve, R. W., Adams, P. D., Holton, J. M., Read, R. J. & Terwilliger, T. C. (2015). Nature Methods, 12, 127-130.]).

[Figure 5]
Figure 5
Effect of image count on autobuilding success. For each of the three protocols, increasing numbers of images were processed. The anomalous peak height for the Zn2+ atom (a), the number of heavy-atom sites found (out of six) (b), the known model-to-map CC (c) and the number of residues built (d) are shown versus the number of images in the data set. In each case, shaded areas indicate the standard deviation of either the ten subsamples (data sets 1000–100 000) or the ten random seeds (full data set, 164 063 images). Note that for (b) certain data points have the same number of sites found in all trials and hence have no standard deviation.

We found that for this zinc SAD phasing experiment we still needed nearly all of the images to autobuild the structure. Autobuilding built about half of the structure with 100 000 images with protocol 3 (Fig. 5d), but failed with fewer images and with the other protocols. However, we can still examine the phasing ability of the data by examining the Zn2+ anomalous peak height (Fig. 5[link]a), the CCmap statistic, which is the correlation of the phased map with the known structure with PDB code 1lnd (Holland et al., 1995[Holland, D. R., Hausrath, A. C., Juers, D. & Matthews, B. W. (1995). Protein Sci. 4, 1955-1965.]) (Fig. 5[link]c), and the number of sites found by phenix.hyss out of the six possible, as determined by using phenix.emma to match sites between the known structure and the SAD-determined sites (Fig. 5[link]b). SDFAC treatment improves the result over the residual-only treatment (compare protocols 1 and 3). Protocol 2 consistently underperformed.

Finally, a note on the performance of simplex versus LBFGS. Using derivative-based minimization drives the optimization to a similar solution using fewer steps. During one trial (not shown) with 10 000 images, the simplex refiner took 88 steps over 932.8 s. However, the LBFGS minimizer took 51 steps over 444.5 s. Both implementations are in Python with C++ sections for the computing-intensive portions. The further addition of OpenMP multiprocessing during the C++ computation of the normalized deviations and derivatives reduced the LBFGS runtime to 322 s (64 cores, accelerating equations 10[link], 11[link] and 15[link]).

4. Discussion

The phasing of serial crystallographic data has been notoriously difficult. Unlike in the rotation method, where the integration, scaling, error-treatment and merging protocols have been well studied, in SX the algorithms continue to be refined to account for the sparseness of the data recorded from each crystal. Most of the few de novo XFEL structures in the PDB required tens to hundreds of thousands of images to solve (Barends et al., 2014[Barends, T. R. M., Foucar, L., Botha, S., Doak, R. B., Shoeman, R. L., Nass, K., Koglin, J. E., Williams, G. J., Boutet, S., Messerschmidt, M. & Schlichting, I. (2014). Nature (London), 505, 244-247.]; Nakane et al., 2015[Nakane, T., Song, C., Suzuki, M., Nango, E., Kobayashi, J., Masuda, T., Inoue, S., Mizohata, E., Nakatsu, T., Tanaka, T., Tanaka, R., Shimamura, T., Tono, K., Joti, Y., Kameshima, T., Hatsui, T., Yabashi, M., Nureki, O., Iwata, S. & Sugahara, M. (2015). Acta Cryst. D71, 2519-2525.], 2016[Nakane, T., Hanashima, S., Suzuki, M., Saiki, H., Hayashi, T., Kakinouchi, K., Sugiyama, S., Kawatake, S., Matsuoka, S., Matsumori, N., Nango, E., Kobayashi, J., Shimamura, T., Kimura, K., Mori, C., Kunishima, N., Sugahara, M., Takakyu, Y., Inoue, S., Masuda, T., Hosaka, T., Tono, K., Joti, Y., Kameshima, T., Hatsui, T., Yabashi, M., Inoue, T., Nureki, O., Iwata, S., Murata, M. & Mizohata, E. (2016). Proc. Natl Acad. Sci. USA, 113, 13039-13044.]; Colletier et al., 2016[Colletier, J.-P., Sawaya, M. R., Gingery, M., Rodriguez, J. A., Cascio, D., Brewster, A. S., Michels-Clark, T., Hice, R. H., Coquelle, N., Boutet, S., Williams, G. J., Messerschmidt, M., DePonte, D. P., Sierra, R. G., Laksmono, H., Koglin, J. E., Hunter, M. S., Park, H.-W., Uervirojnangkoorn, M., Bideshi, D. K., Brunger, A. T., Federici, B. A., Sauter, N. K. & Eisenberg, D. S. (2016). Nature (London), 539, 43-47.]; Nass et al., 2016[Nass, K., Meinhart, A., Barends, T. R. M., Foucar, L., Gorel, A., Aquila, A., Botha, S., Doak, R. B., Koglin, J., Liang, M., Shoeman, R. L., Williams, G., Boutet, S. & Schlichting, I. (2016). IUCrJ, 3, 180-191.]; Hunter et al., 2016[Hunter, M. S., Yoon, C. H., DeMirci, H., Sierra, R. G., Dao, E. H., Ahmadi, R., Aksit, F., Aquila, A. L., Ciftci, H., Guillet, S., Hayes, M. J., Lane, T. J., Liang, M., Lundström, U., Koglin, J. E., Mgbam, P., Rao, Y., Zhang, L., Wakatsuki, S., Holton, J. M. & Boutet, S. (2016). Nature Commun. 7, 13388.]; Yamashita et al., 2015[Yamashita, K., Pan, D., Okuda, T., Sugahara, M., Kodan, A., Yamaguchi, T., Murai, T., Gomi, K., Kajiyama, N., Mizohata, E., Suzuki, M., Nango, E., Tono, K., Joti, Y., Kameshima, T., Park, J., Song, C., Hatsui, T., Yabashi, M., Iwata, S., Kato, H., Ago, H., Yamamoto, M. & Nakatsu, T. (2015). Sci. Rep. 5, 14017.]; Gorel et al., 2017[Gorel, A., Motomura, K., Fukuzawa, H., Doak, R. B., Grünbein, M. L., Hilpert, M., Inoue, I., Kloos, M., Kovácsová, G., Nango, E., Nass, K., Roome, C. M., Shoeman, R. L., Tanaka, R., Tono, K., Joti, Y., Yabashi, M., Iwata, S., Foucar, L., Ueda, K., Barends, T. R. M. & Schlichting, I. (2017). Nature Commun. 8, 1170.]). In this work, we have shown that some of the difficulty arises from how the error estimates are treated and used when merging SX data. In addition to affecting the final merged intensity by being used in the weighted sum, the merged error estimates themselves are used extensively in the maximum-likelihood techniques used by phasing algorithms. For example, in McCoy et al. (2004[McCoy, A. J., Storoni, L. C. & Read, R. J. (2004). Acta Cryst. D60, 1220-1228.]), equation (2) describes the probability of the magnitudes of a set of unphased structure factors given a set of phased structure-factor vectors. While this equation uses only the intensities as inputs, a set of adjustments propagated from the estimated experimental error is presented in Appendix B[link] as initial values used in maximum-likelihood refinement. With this, it is not surprising that accurate estimates of the errors are useful, but it is notable here how striking the difference improving the estimates of errors in the measured reflections makes in the ability to phase XFEL data.

It also interesting to note how important using a good set of weights is when merging the data using a weighted mean. Protocol 1 (unweighted mean) performed consistently better than protocol 2 (weighted mean), even though a weighted mean should be a better estimator of a population mean (especially for the left-skewed intensity distributions seen in XFEL data). Stated differently, using the photon-counting error estimates alone as weights is not optimal, at least in terms of this anomalous phasing exercise. It is only after adjusting the individual measurement error estimates such that they better explain the observed variance through the Ev11 approach that using the error estimates as weights improves the results over the unweighted mean (protocol 3, Ev11).

By no means do we assert that the methods presented here are an exhaustive list of possible ways of treating the errors in XFEL data collection. While we want to note that using the initial estimates of error from integration and applying adjustments to bring them closer to explaining the observed error is important for de novo phasing success, at least when using a weighted sum for averaging intensities together, none of the methods here propagate the error from the partiality correction itself. From (1)[link]–(3)[link] we see that the intensities are corrected using a partiality term, a scale factor and a Wilson B factor. (2)[link] propagates the error in inflating Ihj by Khj, assuming that Khj is a constant, but in reality the parameters comprising Khj are refined quantities. The partiality is dependent on the crystal orientation, unit-cell dimensions, wavelength spectrum and estimated mosaicity (Sauter, 2015[Sauter, N. K. (2015). J. Synchrotron Rad. 22, 239-248.]), and while the true errors in these terms are unknown, they could be estimated from the population of crystals used for merging and then propagated to the factor Khj (see Appendix B[link]). Likewise, the estimated error estimates of these terms could be refined. Initial efforts in this direction have been made and can be accessed using experimental parameters in cctbx.xfel. While this still will not account for the full set of unknown random and systematic errors present in SX data collection, any error propagation in this manner should reduce the reliance on inflationary terms to account for the observed variance in the sample.

5. Software availability

Instructions for downloading and using cctbx.xfel are available at the cctbx.xfel wiki at https://cci.lbl.gov/xfel. See also Brewster et al. (2019[Brewster, A. S., Young, I. D., Lyubimov, A., Bhowmick, A. & Sauter, N. K. (2019). Comput. Crystallogr. Newslett. 10, 22-39.]) for instructions on using the cctbx.xfel graphical user interface (GUI).

APPENDIX A

Derivatives of the target function

As part of least-squares minimization we take the partial derivative of (14[link]) with respect to each of the sfac, sB and sadd parameters, which we refer collectively here as the parameters p. We can perform this via (10)[link], (13)[link], (14)[link] and the chain rule. We first take derivatives with respect to the square of each parameter:

[\eqalignno {{{\partial f_{\sigma}} \over {\partial p^{2}}} & = 2{\sum _{b=1}^{100}}w_{b}\left\{ \left[1-\left({{\sum _{k=1}^{m_{b}} \delta_{bk{\rm norm}}^{2}} \over {m_{b}}}\right)^{1/2}\right]\right. \cr & \times \left. \left[-{{1} \over {2}} \left({{ \sum_{k=1}^{m_{b}} \delta_{bk{\rm norm}}^{2}} \over {m_{b}}}\right)^{-1/2}\right] {{\sum _{k=1}^{m_{b}} {{{\partial\delta_{bk{\rm norm\phantom{_a}}}^{2}}\over {\partial p^{2\phantom{^a}}}}} }\over {m_{b}}}\right\}. & (15)}]

Since the intensity value as used in the computation of δbknorm does not depend on the parameters being refined,

[{{\partial\delta_{bk{\rm norm}}^{2}} \over {\partial p^{2}}} = {{\partial\delta_{bk{\rm norm}}^{2}} \over {\partial\sigma_{\rm Ev11}^{2}}} {{\partial\sigma_{\rm Ev11}^{2}} \over {\partial p^{2}}} = - {{n-1}\over {n}} {{(I_{bk}-\langle I'_{hk}\rangle)^{2}} \over {(\sigma_{\rm Ev11}^{2})^{2}}} {{\partial\sigma_{\rm Ev11}^{2}} \over {\partial p^{2}}}. \eqno (16)]

We can now compute the partial derivatives of (10[link]) with respect to the parameters p. Note that the minimizer refines the terms themselves instead of the squares of the terms, and that (∂p2/∂p) = 2p. Therefore,

[\eqalignno {{{\partial\sigma_{\rm Ev11}^{2}} \over {\partial s_{\rm fac}}} & = {{\partial\sigma_{\rm Ev11}^{2}} \over {\partial s_{\rm fac}^{2}}} {{\partial s_{\rm fac}^{2}} \over {\partial s_{\rm fac}}} = (\sigma_{hj}^{2}+s_{\rm B}^{2}\langle I_{h}\rangle+s_{\rm add}^{2}\langle I_{h}\rangle^{2})2s_{\rm fac}, & (17) \cr{{\partial\sigma_{\rm Ev11}^{2}} \over {\partial s_{\rm B}}} &= {{\partial\sigma_{\rm Ev11}^{2}} \over {\partial s_{\rm B}^{2}}} {{\partial s_{\rm B}^{2}} \over {\partial s_{\rm B}}} = s_{\rm fac}^{2}\langle I_{h}\rangle 2s_{\rm B}, & (18)\cr {{\partial\sigma_{\rm Ev11}^{2}} \over {\partial s_{\rm add}}} &= {{\partial\sigma_{\rm Ev11}^{2}} \over {\partial s_{\rm add}^{2}}} {{\partial s_{\rm fac}^{2}} \over {\partial s_{\rm add}}} = s_{\rm fac}^{2}\langle I_{h}\rangle^{2}2s_{\rm add}. & (19)}]

APPENDIX B

Error propagation for partial reflections

In this work, for each reflection we compute a scaling term Khj that includes the partiality correction, the Wilson B factor and the scaling factor G. Khj depends on the crystal orientation, unit-cell parameters, wavelength, mosaicity and so forth (equations 1[link] and 3[link]). A simple error propagation including the photon-counting error σPhj and assuming no error in Khj is shown in (2)[link]. However, if estimates of the errors in terms comprising Khj were available they could be propagated, and the first few steps of this process are shown here. Given parameters p (p1, p2, …) that contribute to Khj, the propagated error is

[\sigma_{hj}^{2} = (\sigma_{hj}^{\rm P})^{2} \left({{\partial\sigma_{hj}^{\rm P}} \over {\partial p_{1}}}\right)^{2} + (\sigma_{hj}^{\rm P})^{2} \left({{\partial\sigma_{hj}^{\rm P}} \over {\partial p_{2}}}\right)^{2} + \ldots, \eqno (20)]

where again σPhj is the photon-counting error and σ2hj is the propagated error. By the chain rule,

[\eqalignno {\sigma_{hj}^{2} & = (\sigma_{hj}^{\rm P})^{2} \left({{\partial\sigma_{hj}^{\rm P}} \over {\partial K_{hj}}}\right)^{2} \left({{\partial K_{hj}} \over {\partial p_{1}}}\right)^{2} + (\sigma_{hj}^{\rm P})^{2}\left({{\partial\sigma_{hj}^{\rm P}} \over {\partial K_{hj}}}\right)^{2} \left({{\partial K_{hj}} \over {\partial p_{2}}}\right)^{2} + \ldots \cr & = {{(\sigma _{hj}^{\rm P})^{2}} \over {K_{hj}^{2}}}\left({{\partial K_{hj}} \over {\partial p_{1}}}\right)^{2} + {{(\sigma_{hj}^{\rm P})^{2}} \over {K_{hj}^{2}}} \left({{\partial K_{hj}} \over {\partial p_{2}}}\right)^{2}+ \ldots, & (21)}]

which reduces to (2)[link] if the errors in the parameters p are ignored.

Initial implementation of these and further derivatives of Khj with respect to the parameters p as well as refinement of the associated error terms is available through experimental options in cctbx.xfel.

Footnotes

1These terms are re-expressed from the Sdfac, SdB and Sdadd terms in Evans (2011[Evans, P. R. (2011). Acta Cryst. D67, 282-292.]), such that sfac = Sdfac, sB = (SdB)1/2 and sadd = Sdadd.

Acknowledgements

We thank Phil Evans, James Holton and Jan Kern for productive conversations regarding error modeling.

Funding information

This research was supported by NIH grant GM117126. NKS acknowledges support from the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the US Department of Energy (DOE) Office of Science and the National Nuclear Security Administration. Further support originates from the National Institute of General Medical Sciences of the National Institutes of Health under Award R01GM109019 and the Advanced Scientific Computing Research and the Basic Energy Sciences programs, which are supported by the Office of Science of the US Department of Energy (DOE) under Contract DE-AC02-05CH11231. Portions of this research were carried out at LCLS at the SLAC National Accelerator Laboratory, supported by the DOE Office of Science, OBES under Contract No. DE-AC02-76SF00515. Data processing was performed in part at the National Energy Research Scientific Computing Center, supported by the DOE Office of Science, Contract No. DEAC02-05CH11231.

References

First citationAdams, P. D., Afonine, P. V., Bunkóczi, G., Chen, V. B., Davis, I. W., Echols, N., Headd, J. J., Hung, L.-W., Kapral, G. J., Grosse-Kunstleve, R. W., McCoy, A. J., Moriarty, N. W., Oeffner, R., Read, R. J., Richardson, D. C., Richardson, J. S., Terwilliger, T. C. & Zwart, P. H. (2010). Acta Cryst. D66, 213–221.  Web of Science CrossRef CAS IUCr Journals Google Scholar
First citationBarends, T. R. M., Foucar, L., Botha, S., Doak, R. B., Shoeman, R. L., Nass, K., Koglin, J. E., Williams, G. J., Boutet, S., Messerschmidt, M. & Schlichting, I. (2014). Nature (London), 505, 244–247.  Web of Science CrossRef CAS PubMed Google Scholar
First citationBergmann, U., Yachandra, V. & Yano, J. (2017). X-ray Free Electron Lasers. Cambridge: The Royal Society of Chemistry.  Google Scholar
First citationBrewster, A. S., Waterman, D. G., Parkhurst, J. M., Gildea, R. J., Young, I. D., O'Riordan, L. J., Yano, J., Winter, G., Evans, G. & Sauter, N. K. (2018). Acta Cryst. D74, 877–894.  Web of Science CrossRef IUCr Journals Google Scholar
First citationBrewster, A. S., Young, I. D., Lyubimov, A., Bhowmick, A. & Sauter, N. K. (2019). Comput. Crystallogr. Newslett. 10, 22–39.  Google Scholar
First citationBunkóczi, G., McCoy, A. J., Echols, N., Grosse-Kunstleve, R. W., Adams, P. D., Holton, J. M., Read, R. J. & Terwilliger, T. C. (2015). Nature Methods, 12, 127–130.  Web of Science PubMed Google Scholar
First citationChambers, J. M., Cleveland, W. S., Kleiner, B. & Tukey, P. A. (1983). Graphical Methods for Data Analysis, ch. 6. Belmont: Wadsworth.  Google Scholar
First citationChapman, H. N., Fromme, P., Barty, A., White, T. A., Kirian, R. A., Aquila, A., Hunter, M. S., Schulz, J., DePonte, D. P., Weierstall, U., Doak, R. B., Maia, F. R. N. C., Martin, A. V., Schlichting, I., Lomb, L., Coppola, N., Shoeman, R. L., Epp, S. W., Hartmann, R., Rolles, D., Rudenko, A., Foucar, L., Kimmel, N., Weidenspointner, G., Holl, P., Liang, M., Barthelmess, M., Caleman, C., Boutet, S., Bogan, M. J., Krzywinski, J., Bostedt, C., Bajt, S., Gumprecht, L., Rudek, B., Erk, B., Schmidt, C., Hömke, A., Reich, C., Pietschner, D., Strüder, L., Hauser, G., Gorke, H., Ullrich, J., Herrmann, S., Schaller, G., Schopper, F., Soltau, H., Kühnel, K.-U., Messer­schmidt, M., Bozek, J. D., Hau-Riege, S. P., Frank, M., Hampton, C. Y., Sierra, R. G., Starodub, D., Williams, G. J., Hajdu, J., Timneanu, N., Seibert, M. M., Andreasson, J., Rocker, A., Jönsson, O., Svenda, M., Stern, S., Nass, K., Andritschke, R., Schröter, C.-D., Krasniqi, F., Bott, M., Schmidt, K. E., Wang, X., Grotjohann, I., Holton, J. M., Barends, T. R. M., Neutze, R., Marchesini, S., Fromme, R., Schorb, S., Rupp, D., Adolph, M., Gorkhover, T., Andersson, I., Hirsemann, H., Potdevin, G., Graafsma, H., Nilsson, B. & Spence, J. C. H. (2011). Nature (London), 470, 73–77.  Web of Science CrossRef CAS PubMed Google Scholar
First citationColletier, J.-P., Sawaya, M. R., Gingery, M., Rodriguez, J. A., Cascio, D., Brewster, A. S., Michels-Clark, T., Hice, R. H., Coquelle, N., Boutet, S., Williams, G. J., Messerschmidt, M., DePonte, D. P., Sierra, R. G., Laksmono, H., Koglin, J. E., Hunter, M. S., Park, H.-W., Uervirojnangkoorn, M., Bideshi, D. K., Brunger, A. T., Federici, B. A., Sauter, N. K. & Eisenberg, D. S. (2016). Nature (London), 539, 43–47.  Web of Science CrossRef CAS PubMed Google Scholar
First citationDiederichs, K. (2010). Acta Cryst. D66, 733–740.  Web of Science CrossRef CAS IUCr Journals Google Scholar
First citationEvans, P. (2006). Acta Cryst. D62, 72–82.  Web of Science CrossRef CAS IUCr Journals Google Scholar
First citationEvans, P. R. (2011). Acta Cryst. D67, 282–292.  Web of Science CrossRef CAS IUCr Journals Google Scholar
First citationGinn, H. M., Brewster, A. S., Hattne, J., Evans, G., Wagner, A., Grimes, J. M., Sauter, N. K., Sutton, G. & Stuart, D. I. (2015). Acta Cryst. D71, 1400–1410.  Web of Science CrossRef IUCr Journals Google Scholar
First citationGorel, A., Motomura, K., Fukuzawa, H., Doak, R. B., Grünbein, M. L., Hilpert, M., Inoue, I., Kloos, M., Kovácsová, G., Nango, E., Nass, K., Roome, C. M., Shoeman, R. L., Tanaka, R., Tono, K., Joti, Y., Yabashi, M., Iwata, S., Foucar, L., Ueda, K., Barends, T. R. M. & Schlichting, I. (2017). Nature Commun. 8, 1170.  CrossRef Google Scholar
First citationHart, P., Boutet, S., Carini, G., Dubrovin, M., Duda, B., Fritz, D., Haller, G., Herbst, R., Herrmann, S., Kenney, C., Kurita, N., Lemke, H., Messerschmidt, M., Nordby, M., Pines, J., Schafer, D., Swift, M., Weaver, M., Williams, G., Zhu, D., Van Bakel, N. & Morse, J. (2012). Proc. SPIE, 8504, 85040C.  CrossRef Google Scholar
First citationHattne, J., Echols, N., Tran, R., Kern, J., Gildea, R. J., Brewster, A. S., Alonso-Mori, R., Glöckner, C., Hellmich, J., Laksmono, H., Sierra, R. G., Lassalle-Kaiser, B., Lampe, A., Han, G., Gul, S., DiFiore, D., Milathianaki, D., Fry, A. R., Miahnahri, A., White, W. E., Schafer, D. W., Seibert, M. M., Koglin, J. E., Sokaras, D., Weng, T. C., Sellberg, J., Latimer, M. J., Glatzel, P., Zwart, P. H., Grosse-Kunstleve, R. W., Bogan, M. J., Messerschmidt, M., Williams, G. J., Boutet, S., Messinger, J., Zouni, A., Yano, J., Bergmann, U., Yachandra, V. K., Adams, P. D. & Sauter, N. K. (2014). Nature Methods, 11, 545–548.  CrossRef CAS PubMed Google Scholar
First citationHendrickson, W. A. & Teeter, M. M. (1981). Nature (London), 290, 107–113.  CrossRef CAS PubMed Web of Science Google Scholar
First citationHolland, D. R., Hausrath, A. C., Juers, D. & Matthews, B. W. (1995). Protein Sci. 4, 1955–1965.  CrossRef CAS PubMed Web of Science Google Scholar
First citationHunter, M. S., Yoon, C. H., DeMirci, H., Sierra, R. G., Dao, E. H., Ahmadi, R., Aksit, F., Aquila, A. L., Ciftci, H., Guillet, S., Hayes, M. J., Lane, T. J., Liang, M., Lundström, U., Koglin, J. E., Mgbam, P., Rao, Y., Zhang, L., Wakatsuki, S., Holton, J. M. & Boutet, S. (2016). Nature Commun. 7, 13388.  CrossRef Google Scholar
First citationKabsch, W. (2010a). Acta Cryst. D66, 125–132.  Web of Science CrossRef CAS IUCr Journals Google Scholar
First citationKabsch, W. (2010b). Acta Cryst. D66, 133–144.  Web of Science CrossRef CAS IUCr Journals Google Scholar
First citationKabsch, W. (2014). Acta Cryst. D70, 2204–2216.  Web of Science CrossRef IUCr Journals Google Scholar
First citationKern, J., Tran, R., Alonso-Mori, R., Koroidov, S., Echols, N., Hattne, J., Ibrahim, M., Gul, S., Laksmono, H., Sierra, R. G., Gildea, R. J., Han, G., Hellmich, J., Lassalle-Kaiser, B., Chatterjee, R., Brewster, A. S., Stan, C. A., Glöckner, C., Lampe, A., DiFiore, D., Milathianaki, D., Fry, A. R., Seibert, M. M., Koglin, J. E., Gallo, E., Uhlig, J., Sokaras, D., Weng, T. C., Zwart, P. H., Skinner, D. E., Bogan, M. J., Messerschmidt, M., Glatzel, P., Williams, G. J., Boutet, S., Adams, P. D., Zouni, A., Messinger, J., Sauter, N. K., Bergmann, U., Yano, J. & Yachandra, V. K. (2014). Nature Commun. 5, 4371.  CrossRef Google Scholar
First citationLa Fortelle, E. de & Bricogne, G. (1997). Methods Enzymol. 276, 472–494.  PubMed Web of Science Google Scholar
First citationLeslie, A. G. W. (1999). Acta Cryst. D55, 1696–1702.  Web of Science CrossRef CAS IUCr Journals Google Scholar
First citationLeslie, A. G. W. (2006). Acta Cryst. D62, 48–57.  Web of Science CrossRef CAS IUCr Journals Google Scholar
First citationLiu, D. C. & Nocedal, J. (1989). Math. Program. 45, 503–528.  CrossRef Web of Science Google Scholar
First citationMcCoy, A. J., Storoni, L. C. & Read, R. J. (2004). Acta Cryst. D60, 1220–1228.  Web of Science CrossRef CAS IUCr Journals Google Scholar
First citationNakane, T., Hanashima, S., Suzuki, M., Saiki, H., Hayashi, T., Kakinouchi, K., Sugiyama, S., Kawatake, S., Matsuoka, S., Matsumori, N., Nango, E., Kobayashi, J., Shimamura, T., Kimura, K., Mori, C., Kunishima, N., Sugahara, M., Takakyu, Y., Inoue, S., Masuda, T., Hosaka, T., Tono, K., Joti, Y., Kameshima, T., Hatsui, T., Yabashi, M., Inoue, T., Nureki, O., Iwata, S., Murata, M. & Mizohata, E. (2016). Proc. Natl Acad. Sci. USA, 113, 13039–13044.  Web of Science CrossRef CAS PubMed Google Scholar
First citationNakane, T., Song, C., Suzuki, M., Nango, E., Kobayashi, J., Masuda, T., Inoue, S., Mizohata, E., Nakatsu, T., Tanaka, T., Tanaka, R., Shimamura, T., Tono, K., Joti, Y., Kameshima, T., Hatsui, T., Yabashi, M., Nureki, O., Iwata, S. & Sugahara, M. (2015). Acta Cryst. D71, 2519–2525.  Web of Science CrossRef IUCr Journals Google Scholar
First citationNass, K., Meinhart, A., Barends, T. R. M., Foucar, L., Gorel, A., Aquila, A., Botha, S., Doak, R. B., Koglin, J., Liang, M., Shoeman, R. L., Williams, G., Boutet, S. & Schlichting, I. (2016). IUCrJ, 3, 180–191.  Web of Science CrossRef CAS PubMed IUCr Journals Google Scholar
First citationOtwinowski, Z. & Minor, W. (2001). International Tables for Crystallography, Vol. F, edited by M. G. Rossmann & E. Arnold, pp. 226–235. Dordrecht: Kluwer Academic Publishers.  Google Scholar
First citationR Core Team, (2017). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.r-project.org/Google Scholar
First citationRossmann, M. G. & Arnold, E. (2001). Editors. International Tables for Crystallography, Vol. F, ch. 11. Dordrecht: Kluwer Academic Publishers.  Google Scholar
First citationRossmann, M. G., Leslie, A. G. W., Abdel-Meguid, S. S. & Tsukihara, T. (1979). J. Appl. Cryst. 12, 570–581.  CrossRef CAS IUCr Journals Web of Science Google Scholar
First citationSauter, N. K. (2015). J. Synchrotron Rad. 22, 239–248.  Web of Science CrossRef CAS IUCr Journals Google Scholar
First citationSauter, N. K., Hattne, J., Brewster, A. S., Echols, N., Zwart, P. H. & Adams, P. D. (2014). Acta Cryst. D70, 3299–3309.  Web of Science CrossRef IUCr Journals Google Scholar
First citationSchwarzenbach, D., Abrahams, S. C., Flack, H. D., Gonschorek, W., Hahn, T., Huml, K., Marsh, R. E., Prince, E., Robertson, B. E., Rollett, J. S. & Wilson, A. J. C. (1989). Acta Cryst. A45, 63–75.  CrossRef CAS Web of Science IUCr Journals Google Scholar
First citationTerwilliger, T. C., Bunkóczi, G., Hung, L.-W., Zwart, P. H., Smith, J. L., Akey, D. L. & Adams, P. D. (2016). Acta Cryst. D72, 346–358.  Web of Science CrossRef IUCr Journals Google Scholar
First citationUervirojnangkoorn, M., Zeldin, O. B., Lyubimov, A. Y., Hattne, J., Brewster, A. S., Sauter, N. K., Brunger, A. T. & Weis, W. I. (2015). Elife, 4, e05421.  Web of Science CrossRef Google Scholar
First citationWhite, T. A. (2014). Philos. Trans. R. Soc. Lond. B Biol. Sci. 369, 20130330.  Web of Science CrossRef PubMed Google Scholar
First citationWhite, T. A., Kirian, R. A., Martin, A. V., Aquila, A., Nass, K., Barty, A. & Chapman, H. N. (2012). J. Appl. Cryst. 45, 335–341.  Web of Science CrossRef CAS IUCr Journals Google Scholar
First citationWinkler, F. K., Schutt, C. E. & Harrison, S. C. (1979). Acta Cryst. A35, 901–911.  CrossRef CAS IUCr Journals Web of Science Google Scholar
First citationYamashita, K., Pan, D., Okuda, T., Sugahara, M., Kodan, A., Yamaguchi, T., Murai, T., Gomi, K., Kajiyama, N., Mizohata, E., Suzuki, M., Nango, E., Tono, K., Joti, Y., Kameshima, T., Park, J., Song, C., Hatsui, T., Yabashi, M., Iwata, S., Kato, H., Ago, H., Yamamoto, M. & Nakatsu, T. (2015). Sci. Rep. 5, 14017.  Web of Science CrossRef PubMed Google Scholar

This is an open-access article distributed under the terms of the Creative Commons Attribution (CC-BY) Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original authors and source are cited.

Journal logoSTRUCTURAL
BIOLOGY
ISSN: 2059-7983
Follow Acta Cryst. D
Sign up for e-alerts
Follow Acta Cryst. on Twitter
Follow us on facebook
Sign up for RSS feeds