Correctly solved crystal structure should be in agreement with experimental data and its geometry should be located in a local minimum on a Potential Energy Surface (PES). The idea to verify crystal structure by comparison with DFT calculation results was introduced already 12 years ago [1,2]. Machine Learning Interatomic Potentials (MLIP) offer a significantly faster alternative to DFT as a tool for theoretical crystal structure calculation with comparable accuracy and correctness. The geometry optimized structures based on UMA(S) MLIP (parameterised on OMC25 dataset) [3] were used to verify 216 911 selected structures from Cambridge Structure Database (CSD). The targeted was to detect issues with both original X-ray experiments interpretation as well as with the MLIP functionality.
Before the structure validation we had pre-filtered from CSD 6.0 a subset of structures, for which it had sense to run the test (See Table1). We had rejected structures with disorder, used MLIP non supported elements, voids, incorrect space group and to old structures measured on historical devices. Current MLIP do not support well electrostatic long-range forces so we had rejected structures with charged atoms as well. From used GPU memory limits reasons structures with unit cell volume more than 4000Å3 were not investigated. See Table 1 for the pre-filtering results.
Table 1. Result of structures pre-filtering
|
Rejection reason |
Structures rejected based on given test |
Structures left after test |
|
Total structures in CSD 6.0 dataset |
|
1371757 |
|
Presence of other elements than supported one ( H,C,N,O,S,P,F,Cl,Br,I) |
853820 |
517937 |
|
Disordered structure |
90765 |
427172 |
|
Future not satisfied: No error, 3D coordinates present, single crystal data |
32241 |
394931 |
|
Not published after 2004 |
95698 |
299233 |
|
Ionized molecules or charged atoms present |
41768 |
257465 |
|
Volume over 4000Å3 |
17725 |
239740 |
|
3D coordinates not present (not covered by previous check) |
4 |
239736 |
|
Solvent accessible volume over 40Å3 |
16463 |
223273 |
|
Incorrect space group or super cell detected |
1128 |
222145 |
|
Incorrect element valence detected |
3981 |
218164 |
|
Detected atoms marked by "?" indicating disorder |
1253 |
216911 |
The final MLIP calculation was running 4 months in parallel on 2 computers with NVIDIA RTX PRO 4000 Blackwell graphic card equipped by 24GB/ GDDR7 memory. Execution was done almost exclusively in GPU on 8 960 CUDA cores. The processing and descriptors calculation was driven by our proprietary software checkCIF-DFT.
The original verification method [1,2] was primary based on the experimental and DFT results comparison by Cartesian displacement (RMSCD) descriptor only. RMSCD is de-facto a RMSD modified to be able to compare atomic positions in different unit cells. It was already mentioned [1] this descriptor was not able to clearly separate e.g. totally artificial fraud structures from correctly solved one. We had tested several other descriptors to separate problematic structures from correct ones. Really problematic structures can be detected efficiently by bond-breaking (see Table 2 and Figure 1).
Table 2. Issues detected during the CSD 216911 structures test
|
Problem detection reason |
Structures rejected based on given test |
Structures left after test |
|
Initial dataset |
|
216911 |
|
Non existing (Unknown) element present |
1 |
216910 |
|
CIF space group does not correspond to original CIF |
1 |
216909 |
|
Bond breaking (not including H-atom) |
466 |
216443 |
|
Bond breaking (H-atom involved) |
1400 |
215043 |
|
RMSCD over 1.0 Å |
167 |
214876 |
|
RMSCD over 0.25 Å |
3167 |
211709 |

Figure 1. R factor versus RMSCD (hydrogen atoms excluded). green - no bond breaking, blue - only bonds including hydrogen break, red - breaking of bonds not including hydrogen
Results
We had manually inspected 466 structures with bond breaking (hydrogen not involved) and 167 structures with RMSCD higher than 1.0 Å. The inspection was based on comparison with original X-ray data, higher level full DFT calculation and inspection of the original publications. We had identified several types of errors with the CSD deposited data: element type assignment miss-match, missing hydrogen, incorrect hydrogen placement or incorrect experimental information in the CIF (pressure typically). We had identified issues with the used MLIP PES generation as well - incorrect proton transfer and pure results for elements not often present in teaching dataset. Most of the PES related issues were related to the PBE+D3 functional used for OMC25 dataset generation. PBE-D3 shows known errors related to ion transfer. A better teaching set utilizing e.g. r2SCAN-MBD meta-GGA functional will be more suitable for future work. In rare situation we had detected as problematic correctly solved structures with non-usual futures. .
During examination of our results we had identified issues with multiple structures deposited in CSD. It should be noted that based on our finding the use of CSD as reliable source of molecular geometry information is questionable. Marking the problematic structures in the database with some other flack than "no error" will be suitable. On the other side we had identified issues with the used MLIP as well - incorrect proton transfer and incorrect geometry for bond types not occurring often in the OMC25 dataset. The computational time required for single structures sets is typically less then one minute - the described methodology is suitable for routine structure checks.