scholarly journals The predictive power of data-processing statistics

IUCrJ ◽  
2020 ◽  
Vol 7 (2) ◽  
pp. 342-354
Author(s):  
Melanie Vollmar ◽  
James M. Parkhurst ◽  
Dominic Jaques ◽  
Arnaud Baslé ◽  
Garib N. Murshudov ◽  
...  

This study describes a method to estimate the likelihood of success in determining a macromolecular structure by X-ray crystallography and experimental single-wavelength anomalous dispersion (SAD) or multiple-wavelength anomalous dispersion (MAD) phasing based on initial data-processing statistics and sample crystal properties. Such a predictive tool can rapidly assess the usefulness of data and guide the collection of an optimal data set. The increase in data rates from modern macromolecular crystallography beamlines, together with a demand from users for real-time feedback, has led to pressure on computational resources and a need for smarter data handling. Statistical and machine-learning methods have been applied to construct a classifier that displays 95% accuracy for training and testing data sets compiled from 440 solved structures. Applying this classifier to new data achieved 79% accuracy. These scores already provide clear guidance as to the effective use of computing resources and offer a starting point for a personalized data-collection assistant.

2014 ◽  
Vol 70 (a1) ◽  
pp. C792-C792
Author(s):  
Kathryn Janzen ◽  
Michel Fodje ◽  
Shaun Labiuk ◽  
James Gorin ◽  
Pawel Grochulski

The Canadian Macromolecular Crystallography Facility (CMCF) is a suite of two beamlines 08ID-1 and 08B1-1. Beamline 08B1-1 is a bending-magnet beamline for high-throughput macromolecular crystallography enabling Multiple-Wavelength Anomalous Dispersion (MAD) and Single-Wavelength Anomalous Dispersion (SAD) experiments with a high level of automation. We have developed an integrated software system with modules for beamline control, experiment management, and automated data processing for both on-side and remote users. The experiment management module, also known as MxLIVE (Macromolecular Crystallography Laboratory Information Virtual Environment) is responsible for managing the storage of information about samples, sample shipments, experiment requests, experiment results and data sets. It provides a web-based interface for users to submit sample information and experiment requests, track shipments en route to the CLS and review experiment results and data sets as they are completed on site, and for beamline staff to manage Mail-In data acquisition sessions, reducing the need for user travel to the synchrotron. The beamline control module includes a user-friendly interface for data collection, MxDC (Macromolecular Crystallography Data Collector). MxDC is fully integrated with beamline hardware as well as software applications such as MxLIVE and AutoProcess, an innovative data processing pipeline. This makes MxDC a hub for all experiment-focused activities at CMCF beamlines, including sample auto-mounting, centering and screening crystals, diffraction experiments, and automated data reduction.


1999 ◽  
Vol 55 (3) ◽  
pp. 327-332 ◽  
Author(s):  
M. Helliwell ◽  
J. R. Helliwell ◽  
V. Kaucic ◽  
N. Zabukovec Logar ◽  
L. Barba ◽  
...  

Data were collected from a crystal of CoZnPO-CZP {sodium cobalt–zinc phosphate hydrate, Na6[Co0.2Zn0.8PO4]6.6H2O} using synchrotron radiation at ELETTRA at the inflection point and `white line' for both the cobalt and zinc K edges, and at 1.45 Å, a wavelength remote from the K edges of both metals. The data were processed using the programs DENZO and SCALEPACK. The CCP4 program suite was used for the scaling of data sets and the subsequent calculation of dispersive difference Fourier maps. Optimal scaling was achieved by using a subset of reflections with little or no contribution from the metal atoms (i.e. which were essentially wavelength independent in their intensities) and using weights based on the σ's to obtain an overall scale factor in each case. Phases were calculated with SHELXL97 based on the refined structure using a much higher resolution and complete Cu Kα data set. An occupancy of 100% by zinc at the two metal-atom sites was assumed. The dispersive difference Fourier map calculated for zinc gave two peaks above the background of similar heights at the expected metal-atom sites. The peak height at the Zn1 site was a little higher than at the Zn2 site. The dispersive difference Fourier map calculated for cobalt gave just one peak above the background, at the Zn1 site, and only a small peak at the Zn2 site, thus indicating that incorporation of cobalt takes place mainly at one site. Refinement of the zinc occupancies using MLPHARE reinforces this conclusion. The chemical environment of each site is discussed.


2019 ◽  
Vol 52 (4) ◽  
pp. 854-863 ◽  
Author(s):  
Brendan Sullivan ◽  
Rick Archibald ◽  
Jahaun Azadmanesh ◽  
Venu Gopal Vandavasi ◽  
Patricia S. Langan ◽  
...  

Neutron crystallography offers enormous potential to complement structures from X-ray crystallography by clarifying the positions of low-Z elements, namely hydrogen. Macromolecular neutron crystallography, however, remains limited, in part owing to the challenge of integrating peak shapes from pulsed-source experiments. To advance existing software, this article demonstrates the use of machine learning to refine peak locations, predict peak shapes and yield more accurate integrated intensities when applied to whole data sets from a protein crystal. The artificial neural network, based on the U-Net architecture commonly used for image segmentation, is trained using about 100 000 simulated training peaks derived from strong peaks. After 100 training epochs (a round of training over the whole data set broken into smaller batches), training converges and achieves a Dice coefficient of around 65%, in contrast to just 15% for negative control data sets. Integrating whole peak sets using the neural network yields improved intensity statistics compared with other integration methods, including k-nearest neighbours. These results demonstrate, for the first time, that neural networks can learn peak shapes and be used to integrate Bragg peaks. It is expected that integration using neural networks can be further developed to increase the quality of neutron, electron and X-ray crystallography data.


2018 ◽  
Vol 29 (07) ◽  
pp. 1181-1201
Author(s):  
Shuo Yan ◽  
Yunyong Zhang ◽  
Binfeng Yan ◽  
Lin Yan ◽  
Jinfeng Kou

To study the data association, a structure called a hierarchy tree is constructed. It is based on the approach to hierarchical data processing, and constituted by different level partitions of a data set. This leads to the definition of the data association, thereby links two hierarchy trees together. The research on the data association focuses on the way to check whether data are associated with other data. The investigation includes the issues: the intuitive and formal methods for constructing hierarchy trees, the technique of making granules hierarchical, the sufficient and necessary condition for measuring the data association, the analysis of basing the closer data association on the closer data identity, the discussion of connecting numerical information with association closeness, etc. Crucially, the hierarchical data processing and numerical information are important characteristics of the research. As an applied example, two hierarchy trees are set up, demonstrating the hierarchical granulation process of two actual data sets. Data associations between the data sets are characterized by the approach developed in this paper, which provides the basis of algorithm design for the actual problem. In particular, since the research is relevant to granules and alterations of granularity, it may offer an avenue of research on granular computing.


1985 ◽  
Vol 12 (3) ◽  
pp. 464-471 ◽  
Author(s):  
B. G. Krishnappan

The MOBED and HEC-6 models of river flow were compared in this study. The comparison consisted of two steps. In step one, the major differences between the models were identified by examining the theoretical base of each model. In step two, the predictive capabilities of the models were compared by applying the models to identical data sets. The data set comes from the South Saskatchewan River reach below Gardiner Dam and relates to the degradation process that has taken place since the creation of Lake Diefenbaker. Comparison of model predictions with measurements reveals that MOBED has predictive capability superior to that of HEC-6 and that use of HEC-6 as a predictive tool requires an extensive model calibration by the adjustment of Manning's 'n' and the moveable bed width. Key words: computers, models, sediment transport, river hydraulics erosion.


2019 ◽  
Vol 8 (3) ◽  
pp. 177-186
Author(s):  
Rokas Jurevičius ◽  
Virginijus Marcinkevičius

Purpose The purpose of this paper is to present a new data set of aerial imagery from robotics simulator (AIR). AIR data set aims to provide a starting point for localization system development and to become a typical benchmark for accuracy comparison of map-based localization algorithms, visual odometry and SLAM for high-altitude flights. Design/methodology/approach The presented data set contains over 100,000 aerial images captured from Gazebo robotics simulator using orthophoto maps as a ground plane. Flights with three different trajectories are performed on maps from urban and forest environment at different altitudes, totaling over 33 kilometers of flight distance. Findings The review of previous research studies show that the presented data set is the largest currently available public data set with downward facing camera imagery. Originality/value This paper presents the problem of missing publicly available data sets for high-altitude (100‒3,000 meters) UAV flights; the current state-of-the-art research studies performed to develop map-based localization system for UAVs depend on real-life test flights and custom-simulated data sets for accuracy evaluation of the algorithms. The presented new data set solves this problem and aims to help the researchers to improve and benchmark new algorithms for high-altitude flights.


2020 ◽  
Vol 76 (11) ◽  
pp. 1134-1144 ◽  
Author(s):  
Helen M. Ginn

Drug and fragment screening at X-ray crystallography beamlines has been a huge success. However, it is inevitable that more high-profile biological drug targets will be identified for which high-quality, highly homogenous crystal systems cannot be found. With increasing heterogeneity in crystal systems, the application of current multi-data-set methods becomes ever less sensitive to bound ligands. In order to ease the bottleneck of finding a well behaved crystal system, pre-clustering of data sets can be carried out using cluster4x after data collection to separate data sets into smaller partitions in order to restore the sensitivity of multi-data-set methods. Here, the software cluster4x is introduced for this purpose and validated against published data sets using PanDDA, showing an improved total signal from existing ligands and identifying new hits in both highly heterogenous and less heterogenous multi-data sets. cluster4x provides the researcher with an interactive graphical user interface with which to explore multi-data set experiments.


2013 ◽  
Vol 69 (7) ◽  
pp. 1215-1222 ◽  
Author(s):  
K. Diederichs ◽  
P. A. Karplus

In macromolecular X-ray crystallography, typical data sets have substantial multiplicity. This can be used to calculate the consistency of repeated measurements and thereby assess data quality. Recently, the properties of a correlation coefficient, CC1/2, that can be used for this purpose were characterized and it was shown that CC1/2has superior properties compared with `merging'Rvalues. A derived quantity, CC*, links data and model quality. Using experimental data sets, the behaviour of CC1/2and the more conventional indicators were compared in two situations of practical importance: merging data sets from different crystals and selectively rejecting weak observations or (merged) unique reflections from a data set. In these situations controlled `paired-refinement' tests show that even though discarding the weaker data leads to improvements in the mergingRvalues, the refined models based on these data are of lower quality. These results show the folly of such data-filtering practices aimed at improving the mergingRvalues. Interestingly, in all of these tests CC1/2is the one data-quality indicator for which the behaviour accurately reflects which of the alternative data-handling strategies results in the best-quality refined model. Its properties in the presence of systematic error are documented and discussed.


2014 ◽  
Vol 70 (a1) ◽  
pp. C1195-C1195 ◽  
Author(s):  
Sven Hovmöller ◽  
Devinder SINGH ◽  
Wei Wan ◽  
Yifeng Yun ◽  
Benjamin Grushko ◽  
...  

We have developed single crystal electron diffraction for powder-sized samples, i.e. < 0.1μm in all dimensions. Complete 3D electron diffraction is collected by Rotation Electron Diffraction (RED) in about one hour. Data processing takes another hour. The crystal structures are solved by standard crystallographic techniques. X-ray crystallography requires crystals several micrometers big. For nanometer sized crystals, electron diffraction and electron microscopy (EM) are the only possibilities. Modern transmission EMs are equipped with the two things that are necessary for turning them into automatic single crystal diffractometers; they have CCD cameras and all lenses and the sample stage are computer-controlled. Two methods have been developed for collecting complete (except for a missing cone) 3D electron diffraction data; the Rotation Electron Diffraction (RED) [1] and Automated Electron Diffraction Tomography (ADT) by Kolb et al. [2]. Because of the very strong interaction between electrons and matter, an electron diffraction pattern with visible spots is obtained in one second from a submicron sized crystal in the EM. By collecting 1000-2000 electron diffraction patterns, a complete 3D data set is obtained. The geometry in RED is analogous to the rotation method in X-ray crystallography; the sample is rotated continuously along one rotation axis. The data processing results in a list of typically over 1000 reflections with h,k,l and Intensity. The unit cell is typically obtained correctly to within 1%. Space group determination is done as in X-ray crystallography from systematically absent reflections, but special care must be taken because occasionally multiple electron diffraction can give rise to very strong forbidden reflections. At +/-60° tilt with 0.1° steps, a complete data collection will be some 1200 frames. With one second exposures this takes about one hour. There is no need to align the crystal orientation. The reciprocal lattice can be rotated and displayed at any direction of view. Sections such as hk0, hk1, hk2, h0l and so on can easily be cut out and displayed. We have solved over 50 crystal structures by RED in one year. These include the most complex zeolites ever solved and quasicrystal approximants, such as the pseudo-decagonal approximants PD2 [3] and PD1 in AlCoNi. Observed and calculated sections of reciprocal space (cut at 1.0Å) are shown in Fig. 1. Notice the 10-fold symmetry of strong reflections.


2018 ◽  
Vol 44 (1) ◽  
pp. 52-73 ◽  
Author(s):  
Jean-Christophe Plantin

This article investigates the work of processors who curate and “clean” the data sets that researchers submit to data archives for archiving and further dissemination. Based on ethnographic fieldwork conducted at the data processing unit of a major US social science data archive, I investigate how these data processors work, under which status, and how they contribute to data sharing. This article presents two main results. First, it contributes to the study of invisible technicians in science by showing that the same procedures can make technical work invisible outside and visible inside the archive, to allow peer review and quality control. Second, this article contributes to the social study of scientific data sharing, by showing that the organization of data processing directly stems from the conception that the archive promotes of a valid data set—that is, a data set that must look “pristine” at the end of its processing. After critically interrogating this notion of pristineness, I show how it perpetuates a misleading conception of data as “raw” instead of acknowledging the important contribution of data processors to data sharing and social science.


Sign in / Sign up

Export Citation Format

Share Document