computer programs\(\def\hfill{\hskip 5em}\def\hfil{\hskip 3em}\def\eqno#1{\hfil {#1}}\)

Journal logoJOURNAL OF
APPLIED
CRYSTALLOGRAPHY
ISSN: 1600-5767

SNAP-1D: a computer program for qualitative and qu­antitative powder diffraction pattern analysis using the full pattern profile

aDepartment of Chemistry, University of Glasgow, Glasgow G12 8QQ, Scotland
*Correspondence e-mail: gbarr@chem.gla.ac.uk

(Received 19 January 2004; accepted 14 May 2004)

SNAP-1D is a computer program for the qualitative and quantitative analysis of powder diffraction data using the full measured data set. As measures of similarity between patterns, non-parametric statistical tests based on Spearman's correlation coefficient and the Kolmogorov–Smirnov test are used. Traditional correlation coefficients based on the Pearson formalism are also employed. This combination, suitably weighted, gives a reliable measure of qualitative pattern similarity. The method can be extended to the quantitative analysis of mixtures by using the above methods in conjunction with singular value decomposition techniques. A full description of the theory with suitable examples has been published elsewhere [Gilmore et al. (2004). J. Appl. Cryst. 37, 231–242]; here the focus is on the computer software itself. The program is commercially available, and runs on PCs under the Windows 2000 and XP operating systems with modest hardware requirements. An easy to use graphical interface is supplied.

1. Introduction

Pattern-matching software in X-ray powder diffraction patterns has, until recently, relied on simplified patterns in which the full diffraction profile is reduced to a set of the strongest peaks, which are usually further reduced to a d-spacing (or 2θ value) and the corresponding intensity (the dI system). This simplified approach to the analysis of powder diffraction patterns has advantages primarily in computer storage requirements and the speed of the associated search algorithms, especially when handling very large databases. SNAP-1D, in contrast, is a computer program that employs every measured data point for both qualitative pattern matching (`What is most like a given pattern?') and quantitative calculations (`What are the components of this mixture?'). The theory and several examples have been published (Gilmore et al., 2004[Gilmore, C. J., Barr, G. & Paisley, J. (2004). J. Appl. Cryst. 37, 231-242.]), but we present here a detailed description of the software and its options.

2. Importing and pre-processing data

On opening the program, the user can either select an existing database or create a new one into which a set of patterns is incorporated. Data import and pre-processing proceeds as follows.

(a) Data are imported either as ASCII xy data (2θ, intensity) with comma or tab delimiters, CIF format (Hall et al., 1991[Hall, S. R., Allen, F. H. & Brown, I. D. (1991). Acta Cryst. A47, 655-685.]), MDI ASCII or in Bruker raw data format. CIF files are a preferred option and the entries are scanned for unit-cell information, cell contents, formula etc., which can be examined later in the program. A platform-independent binary format is also employed for this data, being used internally in the associated software. The ASCII format can also be used to import other data types, such as IR or Raman data, which can, with modification, be used with this software.

(b) The intensity data are normalized.

(c) The pattern is interpolated or extrapolated if necessary to give increments of 0.02° in 2θ. Neville's algorithm is used (Press et al., 1992[Press, W. H., Teukolsky, S. A., Vetterling, W. T. & Flannery, B. P. (1992). Numerical Recipes in C. Cambridge University Press.]). It is important that all patterns have the same constant data step size.

(d) Background removal is optional. When requested, local nth-order polynomial functions are fitted to the data and then subtracted to remove the background. Three independent 2θ domains are usually defined, but this can be modified for difficult cases.

(e) Background removal is followed by the optional smoothing of the data using wavelets via the SURE (Stein's Unbiased Risk Estimate) thresholding procedure (Donoho & Johnstone, 1995[Donoho, D. L. & Johnstone, I. M. (1995). J. Am. Stat. Assoc. 90, 1200-1224.]).

(f) Peak positions are also optionally found using Savitsky–Golay filtering (Savitzky & Golay, 1964[Savitzky, A. & Golay, M. J. E. (1964). Anal. Chem. 36, 1627-1639.]). Only two of the four matching techniques use the peak positions; if these tests are not used, peak positions are not needed.

Fig. 1[link] shows the Pattern Editor window for SNAP-1D in which all these facilities are used. All the options described above are set in this window. Processing may be applied to all patterns in a database at once, or individually as required.

[Figure 1]
Figure 1
The Pattern Editor window in SNAP-1D. The options to subtract the background, find the peaks, set the peak level and smooth the data using wavelets are all set here. If CIF or raw files are used as the data source, extra data fields can be examined. The Advanced tab allows the input of unit-cell dimensions and contents for quantitative analysis to obtain the weight percentage. Multiple excluded regions in 2θ can also be defined here.

2.1. Qualitative pattern matching

The sample pattern to be matched against the database is selected, pre-processed as necessary and then compared automatically in turn to each of the database patterns, data point by data point. For each sample pattern, a comparison is made as follows.

(i) The intersecting 2θ range of the two data sets is calculated, and each of the pattern-matching tests is performed using only that region.

(ii) A minimum intensity is set, below which profile data are set to zero. This eliminates noise and does not reduce the discriminating power of the method. By default, this is set to 0.1Imax, where Imax is the maximum measured intensity.

(iii) The full profiles of the patterns are compared on a point-by-point basis using the non-parametric Spearman rank-order coefficient test (Spearman, 1904[Spearman, C. (1904). Am. J. Psychol. 15, 72-101.]; Conover, 1998[Conover, W. J. (1998). Practical Nonparametric Statistics. 3rd ed. New York: John Wiley.]). A score of 1.0 represents a perfect match, 0.0 a zero match, and −1.0 an anti-correlation (which is highly unusual).

(iv) A parametric Pearson equivalent of the Spearman test is then applied, as in (iii) above, again to all intersecting data points.

(v) If any peaks have been marked in either the sample or a particular database pattern and have the same value of 2θmax within a user-specified tolerance, the correlation between the two peaks and its associated probability is calculated using the Kolmogorov–Smirnov (KS) test (Smirnov, 1939[Smirnov, N. V. (1939). Bull. Moscow Univ. 2, 3-16.]; Steck & Smirnov, 1969[Steck, G. P. & Smirnov, G. N. (1969). Annal. Math. Stat. 40, 1449-1466.]; Conover, 1998[Conover, W. J. (1998). Practical Nonparametric Statistics. 3rd ed. New York: John Wiley.]). The range of each peak to be tested is taken to be the intersection of the two peak ranges, calculated by tracing their shoulders until either the intensity falls below a set threshold, or the intensity of either starts to increase. The pattern with the greater number of peaks is taken as a reference. The KS test is then performed on each of these peaks and an associated probability, pi, is returned for each. This has a value of 1.0 when the peaks are identical and zero when a peak is matched against no peak. The overall KS value, pKS, is

[p_{\rm KS} = \textstyle\sum\limits_{i = 1}^m p_i \bigg/ m\eqno (1)]

for m peaks in the reference sample; pKS takes the values 0 ≤ pKS ≤ 1.0.

(vi) The parametric equivalent of equation (1[link]) is also computed in the same way except that the Pearson correlation coefficient is used instead of the non-parametric KS test.

(vii) Finally, a rank value, rw, comprising a weighted mean of each of the available statistics, is calculated for each sample. These weights are user-definable and default to equal weighting for the Spearman and Pearson tests and zero for the KS test and its parametric equivalent.

(viii) An optimal shift in 2θ between patterns is often required, arising from equipment settings, sample preparation and data collection protocols. SNAP-1D provides three possible corrections, although these by no means encompasses all the possible correction geometries that can arise. These take the form

[\Delta ( 2\theta ) = a_0 + a_1 \cos \theta , \eqno (2)]

which corrects for varying sample heights in reflection mode, or

[\Delta ( 2\theta ) = a_0 + a_1 \sin \theta ,\eqno (3)]

which corrects for transparency errors or, for example, transmission geometry with constant specimen–detector distance, and

[\Delta ( 2\theta ) = a_0 + a_1 \sin 2\theta, \eqno (4)]

which provides transparency and thick-specimen error corrections. The parameters a0 and a1 are constants that can be determined by maximizing the pattern–pattern correlation. It is difficult to obtain analytic expressions for the derivatives ∂a0/∂rw and ∂a1/∂rw for use in the optimization, so we use the downhill simplex method (Nelder & Mead, 1965[Nelder, J. A. & Mead, R. (1965). Comput. J. 7, 308-313.]), which does not require the calculation of derivatives.

(ix) It is possible to define multiple 2θ regions that are excluded from the calculations.

Fig. 2[link] shows a typical window display for qualitative pattern matching.

[Figure 2]
Figure 2
The Qualitative Analysis window. The patterns are sorted in descending rw value and listed in the column labelled Rank. Patterns 1 and 2 are superimposed in the graphics pane. The individual correlation coefficients are in the next four columns. Only the Spearman and Pearson coefficients were calculated for this data set. The calculation of optimal 2θ offsets can be initiated here, and the maximum value specified. The 2θ ranges can also be set. The Quantitative Analysis tab opens the window shown in Fig. 3[link].

2.2. Generating a correlation matrix

Instead of selecting a single pattern and matching it against every entry in the database, it is possible to match every pattern against every other. If there are n patterns, this generates a symmetric (n × n) correlation matrix, which can be exported to other statistics packages, e.g. for principal-component analysis, cluster analysis, etc. The use of the correlation matrix forms the basis of the PolySNAP computer program, which is discussed elsewhere (Barr et al., 2004[Barr, G., Dong, W. & Gilmore, C. J. (2004). J. Appl. Cryst. 37, 243-252.]).

3. Quantitative analysis

If patterns corresponding to all pure phases in the mixture are present in an associated database, quantitative analysis can be carried out. The method used is an alternative to Rietveld refinement (e.g. Hill & Howard, 1987[Hill, R. J. & Howard, C. J. (1987). J. Appl. Cryst. 20, 467-474.]) and other methods. The Rietveld approach requires crystal structures to be known for all individual phases in the mixture; this approach does not require knowledge of the atomic coordinates in the unit cell or data of great accuracy; it is, however, less accurate.

The method has been fully described by Gilmore et al. (2004[Gilmore, C. J., Barr, G. & Paisley, J. (2004). J. Appl. Cryst. 37, 231-242.]) and employs full-matrix least squares with every measured data point, with singular value decomposition (SVD) (Press et al., 1992[Press, W. H., Teukolsky, S. A., Vetterling, W. T. & Flannery, B. P. (1992). Numerical Recipes in C. Cambridge University Press.]) for the matrix inversion procedure. A brief summary, however, may be useful.

Assume we have a sample pattern, S, which is considered to be a mixture of up to N components. S comprises m data points, S1, S2, …, Sm. The N patterns can be considered to make up fractions p1, p2, …, pN of the sample pattern. The required equation to solve for pi takes the form

[\left ({\matrix{ {x_{11} } & {x_{12} } & {x_{13} } & \cdots & {x_{1N} } \cr {x_{21} } & {x_{22} } & {x_{23} } & \cdots & {x_{2N} } \cr \vdots & \vdots & \vdots & \ddots & \vdots \cr {x_{m1} } & {x_{m2} } & {x_{m3} } & \cdots & {x_{mN} } \cr } } \right)\left ({\matrix{ {p_1 } \cr {p_2 } \cr \vdots \cr {p_N } \cr } } \right) = \left ({\matrix{ {S_1 } \cr {S_2 } \cr \vdots \cr {S_N } \cr } } \right)\eqno (5)]

where xij is the ith measured data point for the jth pattern. Writing equation (5)[link] in matrix form,

[{\bf x} \cdot {\bf p} = {\bf S} .\eqno (6)]

The SVD methods allow x to be decomposed into three smaller matrices U, V and W, and gives the solution

[{\bf p} = {\bf V}\cdot {\rm diag}\,(1/w_j) \cdot {\bf U}^{ T} \cdot {\bf S} . \eqno (7)]

W is a diagonal matrix with positive or zero elements. We accept the top min(15,N) values of p components of the mixture ranked on rw. We also examine the elements of W and exclude any contributors with small values, and build a new matrix p, thus repeating the entire procedure. Finally, the top j patterns (where j is an integer, 1 ≤ j ≤ 15) are processed via the matrix decomposition once more. The results returned are the fractions of each pattern included in the mixture pattern. These are scaled to a percentage, and the number of possible phases is limited to j. The composition is normally displayed as a scale percentage, i.e. the percentage of the mixture pattern accounted for by each individual phase. If the unit-cell dimensions and contents for each component are available, the program converts this scale percentage to a weight percentage (Leroux et al., 1953[Leroux, J., Lennox, D. H. & Kay, K. (1953). Anal. Chem. 25, 740-743.]). The estimated error is also reported for each component. Additional feedback on the reliability of the results is given by these estimated errors, and by how good the matching results of the Spearman, parametric and KS tests are for each phase. Occasionally, if an incorrect pattern has been suggested by the program, this may be indicated by abnormally low values of the Spearman, Pearson and KS tests, and such patterns can be marked as ignored during subsequent runs of the procedure.

One drawback with the SVD procedure is that, because of its power and stability, it is almost always possible to decompose a matrix. This can mean that in some situations, for example, if the actual phases contained within a mixture are not present in the database, the method will give an incorrect solution. In these cases there are several signs available to warn the user to be cautious of an answer, such as abnormally high error values on the reported fractions, and/or large residuals when comparing the mixture and simulated patterns.

Other options for quantitative analysis are also available, as follow.

(a) Offsets. A 2θ offset can be refined to optimize rw as described in §2[link].

(b) Residuals. To see if the suggested results are correct, or if they include a pattern not present in the mixture, or if they miss a phase that should be present, the Residual window constructs a calculated pattern made up from the individual patterns suggested as mixture components, in the proportions calculated. A difference plot between this and the sample pattern is available. The simulated mixture pattern can be saved as an ASCII text file, as can the difference plot. Fig. 4[link] shows the Residual window corresponding to Fig. 3[link].

[Figure 4]
Figure 4
The residuals following the quantitative analysis displayed in Fig. 3[link]. The component patterns are superimposed in the upper pane to give a resultant, while the residual intensity is plotted in the lower pane. Both of these can be exported as ASCII files and re-imported into SNAP-1D or other software.
[Figure 3]
Figure 3
The Quantitative Analysis window. The mixture comprises lactose as entry 1 and paracetamol as entry 2. The weight percentages are 85.2 and 14.8%, respectively, with estimated errors of 1.7 and 4.0%. Just as in the Qualitative Analysis window, the calculation of optimal 2θ offsets can be initiated here, the maximum shift specified, and the 2θ ranges set.

(c) Automatic missing-phase detection. The program examines the results from the analysis and, using an algorithm based on the calculated error and the residual, can suggest if the resulting composition does not account sufficiently for all of the unknown pattern intensity. This can occur, for example, when not all of the phases present in a mixture pattern were in the pure-phase database. The quantitative analysis is then re-run to include a simulated pattern of the missing fraction as a known phase.

(d) Pattern exclusion. It can be useful to narrow down the number of patterns to be considered as components of the mixture. This is done by excluding patterns that are below user-set thresholds on any of the correlation coefficients included in the quantitative calculation. Generally, the best approach is to perform an initial standard analysis with defaults, and see if any poorly matching patterns have been included. The results from this will then give a feel for suitable cut-off values, and the analysis can be re-run.

(e) Limiting the 2θ range. It is also possible to limit the analysis to subsets of the 2θ range of the unknown pattern. This can be useful if a particular feature of the pattern is causing problems, e.g. the presence of standards.

(f) Ignoring patterns. If a particular pattern included in the list of suggested results is known to be incorrect, it can be excluded from the calculation. It is possible to mark multiple patterns in this way. One can also ignore all patterns except those in a selected list if knowledge of the component phases is available.

A typical output display window is shown in Fig. 3[link].

4. Program details

The program is written in a mixture of C++ and Visual Basic. It runs on a PC using the Windows XP SP1 or Windows 2000 SP2 operating system or better. A minimal system requires a P4 processor (or AMD equivalent) operating at above 1 GHz, and 128 MByte of memory. Graphics and disk-storage needs are modest. There is a complete on-line and printed manual, tutorial and test data.

Up to 1000 patterns can be imported. In general, the manipulation of 100 patterns takes a matter of seconds, and matching 1000 patterns takes less than 1 min on a PC with a 2.0 GHz processor and 256 MByte of memory. These timings increase by a factor of ten if optimal 2θ shifts are calculated.

The program is available commercially from Bruker-AXS.

Acknowledgements

The authors would like to thank the Ford Motor Company, Detroit, for funding this work, and especially Charlotte Lowe-Ma whose input and support have been invaluable.

References

First citationBarr, G., Dong, W. & Gilmore, C. J. (2004). J. Appl. Cryst. 37, 243–252.  Web of Science CrossRef CAS IUCr Journals Google Scholar
First citationBarr, G., Gilmore, C. J. & Paisley, J. (2003). SNAP-1D: Systematic Non-Parametric Analysis of Patterns – a Computer Program to Perform Full-Profile Qualitative and Quantitative Analysis of Powder Diffraction Patterns, University of Glasgow. (See also http://www.chem.gla.ac.uk/staff/chris/snap.html .)  Google Scholar
First citationConover, W. J. (1998). Practical Nonparametric Statistics. 3rd ed. New York: John Wiley.  Google Scholar
First citationDonoho, D. L. & Johnstone, I. M. (1995). J. Am. Stat. Assoc. 90, 1200–1224.  CrossRef Web of Science Google Scholar
First citationGilmore, C. J., Barr, G. & Paisley, J. (2004). J. Appl. Cryst. 37, 231–242.  Web of Science CrossRef IUCr Journals Google Scholar
First citationHall, S. R., Allen, F. H. & Brown, I. D. (1991). Acta Cryst. A47, 655–685.  CrossRef CAS Web of Science IUCr Journals Google Scholar
First citationHill, R. J. & Howard, C. J. (1987). J. Appl. Cryst. 20, 467–474.  CrossRef CAS Web of Science IUCr Journals Google Scholar
First citationLeroux, J., Lennox, D. H. & Kay, K. (1953). Anal. Chem. 25, 740–743.  CrossRef CAS Google Scholar
First citationNelder, J. A. & Mead, R. (1965). Comput. J. 7, 308–313.  CrossRef Google Scholar
First citationPress, W. H., Teukolsky, S. A., Vetterling, W. T. & Flannery, B. P. (1992). Numerical Recipes in C. Cambridge University Press.  Google Scholar
First citationSavitzky, A. & Golay, M. J. E. (1964). Anal. Chem. 36, 1627–1639.  CrossRef CAS Web of Science Google Scholar
First citationSmirnov, N. V. (1939). Bull. Moscow Univ. 2, 3–16.  Google Scholar
First citationSpearman, C. (1904). Am. J. Psychol. 15, 72–101.  CrossRef Google Scholar
First citationSteck, G. P. & Smirnov, G. N. (1969). Annal. Math. Stat. 40, 1449–1466.  CrossRef Web of Science Google Scholar

© International Union of Crystallography. Prior permission is not required to reproduce short quotations, tables and figures from this article, provided the original authors and source are cited. For more information, click here.

Journal logoJOURNAL OF
APPLIED
CRYSTALLOGRAPHY
ISSN: 1600-5767
Follow J. Appl. Cryst.
Sign up for e-alerts
Follow J. Appl. Cryst. on Twitter
Follow us on facebook
Sign up for RSS feeds