Avoidable errors in deposited macromolecular structures: an impediment to efficient data mining

Whereas the vast majority of the more than 85 000 crystal structures of macromolecules currently deposited in the Protein Data Bank are of high quality, some suffer from a variety of imperfections. Although this fact has been pointed out in the past, it is still worth periodic updates so that the metadata obtained by global analysis of the available crystal structures, as well as the utilization of the individual structures for tasks such as drug design, should be based on only the most reliable data. Here, selected abnormal deposited structures have been analysed based on the Bayesian reasoning that the correctness of a model must be judged against both the primary evidence as well as prior knowledge. These structures, as well as information gained from the corresponding publications (if available), have emphasized some of the most prevalent types of common problems. The errors are often perfect illustrations of the nature of human cognition, which is frequently influenced by preconceptions that may lead to fanciful results in the absence of proper validation. Common errors can be traced to negligence and a lack of rigorous verification of the models against electron density, creation of non-parsimonious models, generation of improbable numbers, application of incorrect symmetry, illogical presentation of the results, or violation of the rules of chemistry and physics. Paying more attention to such problems, not only in the final validation stages but during the structure-determination process as well, is necessary not only in order to maintain the highest possible quality of the structural repositories and databases but most of all to provide a solid basis for subsequent studies, including large-scale data-mining projects. For many scientists PDB deposition is a rather infrequent event, so the need for proper training and supervision is emphasized, as well as the need for constant alertness of reason and critical judgment as absolutely necessary safeguarding measures against such problems. Ways of identifying more problematic structures are suggested so that their users may be properly alerted to their possible shortcomings.

Download Full-text

Bond-valence analyses of the crystal structures of FeMo/V cofactors in FeMo/V proteins

Acta Crystallographica Section D Structural Biology ◽

10.1107/s2059798320003952 ◽

2020 ◽

Vol 76 (5) ◽

pp. 428-437 ◽

Cited By ~ 1

Author(s):

Wan-Ting Jin ◽

Min Yang ◽

Shuang-Shuang Zhu ◽

Zhao-Hui Zhou

Keyword(s):

Error Analysis ◽

Crystal Structures ◽

Protein Data Bank ◽

Data Bank ◽

Bond Valence ◽

Crystallographic Data ◽

Data Sets ◽

Electron Configuration ◽

The Individual ◽

Bond Valence Method

The bond-valence method has been used for valence calculations of FeMo/V cofactors in FeMo/V proteins using 51 crystallographic data sets of FeMo/V proteins from the Protein Data Bank. The calculations show molybdenum(III) to be present in MoFe7S9C(Cys)(HHis)[R-(H)homocit] (where H4homocit is homocitric acid, HCys is cysteine and HHis is histidine) in FeMo cofactors, while vanadium(III) with a more reduced iron complement is obtained for FeV cofactors. Using an error analysis of the calculated valences, it was found that in FeMo cofactors Fe1, Fe6 and Fe7 can be unambiguously assigned as iron(III), while Fe2, Fe3, Fe4 and Fe5 show different degrees of mixed valences for the individual Fe atoms. For the FeV cofactors in PDB entry 5n6y, Fe4, Fe5 and Fe6 correspond to iron(II), iron(II) and iron(III), respectively, while Fe1, Fe2, Fe3 and Fe7 exhibit strongly mixed valences. Special situations such as CO-bound and selenium-substituted FeMo cofactors and O(N)H-bridged FeV cofactors are also discussed and suggest rearrangement of the electron configuration on the substitution of the bridging S atoms.

Download Full-text

Continuous mutual improvement of macromolecular structure models in the PDB and of X-ray crystallographic software: the dual role of deposited experimental data

Acta Crystallographica Section D Biological Crystallography ◽

10.1107/s1399004714017040 ◽

2014 ◽

Vol 70 (10) ◽

pp. 2533-2543 ◽

Cited By ~ 21

Author(s):

Thomas C. Terwilliger ◽

Gerard Bricogne

Keyword(s):

Crystal Structures ◽

Paradigm Shift ◽

Large Scale ◽

Data Bank ◽

Small Scale ◽

Macromolecular Structure ◽

X Ray ◽

Rapid Pace ◽

Very High

Accurate crystal structures of macromolecules are of high importance in the biological and biomedical fields. Models of crystal structures in the Protein Data Bank (PDB) are in general of very high quality as deposited. However, methods for obtaining the best model of a macromolecular structure from a given set of experimental X-ray data continue to progress at a rapid pace, making it possible to improve most PDB entries after their deposition by re-analyzing the original deposited data with more recent software. This possibility represents a very significant departure from the situation that prevailed when the PDB was created, when it was envisioned as a cumulative repository of static contents. A radical paradigm shift for the PDB is therefore proposed, away from the static archive model towards a much more dynamic body of continuously improving results in symbiosis with continuously improving methods and software. These simultaneous improvements in methods and final results are made possible by the current deposition of processed crystallographic data (structure-factor amplitudes) and will be supported further by the deposition of raw data (diffraction images). It is argued that it is both desirable and feasible to carry out small-scale and large-scale efforts to make this paradigm shift a reality. Small-scale efforts would focus on optimizing structures that are of interest to specific investigators. Large-scale efforts would undertake a systematic re-optimization of all of the structures in the PDB, or alternatively the redetermination of groups of structures that are either related to or focused on specific questions. All of the resulting structures should be made generally available, along with the precursor entries, with various views of the structures being made available depending on the types of questions that users are interested in answering.

Download Full-text

Visual management of large scale data mining projects

Biocomputing 2000 ◽

10.1142/9789814447331_0026 ◽

1999 ◽

Author(s):

I. Shah ◽

L. Hunter

Keyword(s):

Data Mining ◽

Large Scale ◽

Large Scale Data ◽

Mining Projects ◽

Scale Data

Download Full-text

Prediction of models for ordered solvent in macromolecular structures by a classifier based upon resolution-independent projections of local feature data

Acta Crystallographica Section D Structural Biology ◽

10.1107/s2059798319008933 ◽

2019 ◽

Vol 75 (8) ◽

pp. 696-717

Author(s):

Laurel Jones ◽

Michael Tynes ◽

Paul Smith

Keyword(s):

Crystal Structures ◽

Electron Density ◽

Large Scale ◽

Data Bank ◽

Chemical Environment ◽

Water Molecules ◽

Model Errors ◽

Difference Density ◽

Solvent Models ◽

Molecule Scattering

Current software tools for the automated building of models for macromolecular X-ray crystal structures are capable of assembling high-quality models for ordered macromolecule and small-molecule scattering components with minimal or no user supervision. Many of these tools also incorporate robust functionality for modelling the ordered water molecules that are found in nearly all macromolecular crystal structures. However, no current tools focus on differentiating these ubiquitous water molecules from other frequently occurring multi-atom solvent species, such as sulfate, or the automated building of models for such species. PeakProbe has been developed specifically to address the need for such a tool. PeakProbe predicts likely solvent models for a given point (termed a `peak') in a structure based on analysis (`probing') of its local electron density and chemical environment. PeakProbe maps a total of 19 resolution-dependent features associated with electron density and two associated with the local chemical environment to a two-dimensional score space that is independent of resolution. Peaks are classified based on the relative frequencies with which four different classes of solvent (including water) are observed within a given region of this score space as determined by large-scale sampling of solvent models in the Protein Data Bank. Designed to classify peaks generated from difference density maxima, PeakProbe also incorporates functionality for identifying peaks associated with model errors or clusters of peaks likely to correspond to multi-atom solvent, and for the validation of existing solvent models using solvent-omit electron-density maps. When tasked with classifying peaks into one of four distinct solvent classes, PeakProbe achieves greater than 99% accuracy for both peaks derived directly from the atomic coordinates of existing solvent models and those based on difference density maxima. While the program is still under development, a fully functional version is publicly available. PeakProbe makes extensive use of cctbx libraries, and requires a PHENIX licence and an up-to-date phenix.python environment for execution.

Download Full-text

An efficient data preprocessing approach for large scale medical data mining

Technology and Health Care ◽

10.3233/thc-140887 ◽

2015 ◽

Vol 23 (2) ◽

pp. 153-160 ◽

Cited By ~ 1

Author(s):

Ya-Han Hu ◽

Wei-Chao Lin ◽

Chih-Fong Tsai ◽

Shih-Wen Ke ◽

Chih-Wen Chen

Keyword(s):

Data Mining ◽

Large Scale ◽

Data Preprocessing ◽

Medical Data ◽

Medical Data Mining ◽

Efficient Data

Download Full-text

Models, Practices and Methods of Reading: Evolution in Time and Space

Bibliotekovedenie [Library and Information Science (Russia)] ◽

10.25281/0869-608x-2009-0-1-59-64 ◽

2009 ◽

pp. 59-64

Author(s):

Yulia P. Melentyeva

Keyword(s):

Large Scale ◽

Social Groups ◽

National Program ◽

Time And Space ◽

Lack Of Information ◽

The Individual

In recent years as public in general and specialist have been showing big interest to the matters of reading. According to discussion and launch of the “Support and Development of Reading National Program”, many Russian libraries are organizing the large-scale events like marathons, lecture cycles, bibliographic trainings etc. which should draw attention of different social groups to reading. The individual forms of attraction to reading are used much rare. To author’s mind the main reason of such an issue has to be the lack of information about forms and methods of attraction to reading.

Download Full-text

Accelerated Discovery of High-Refractive-Index Polyimides via First-Principles Molecular Modeling, Virtual High-Throughput Screening, and Data Mining

10.26434/chemrxiv.7670903.v1 ◽

2019 ◽

Author(s):

Mohammad Atif Faiz Afzal ◽

Mojtaba Haghighatlari ◽

Sai Prasad Ganesh ◽

Chong Cheng ◽

Johannes Hachmann

Keyword(s):

Data Mining ◽

Refractive Index ◽

High Throughput ◽

First Principles ◽

High Throughput Screening ◽

Large Scale ◽

Computational Study ◽

High Refractive Index ◽

Structural Features ◽

Learning Program

<div>We present a high-throughput computational study to identify novel polyimides (PIs) with exceptional refractive index (RI) values for use as optic or optoelectronic materials. Our study utilizes an RI prediction protocol based on a combination of first-principles and data modeling developed in previous work, which we employ on a large-scale PI candidate library generated with the ChemLG code. We deploy the virtual screening software ChemHTPS to automate the assessment of this extensive pool of PI structures in order to determine the performance potential of each candidate. This rapid and efficient approach yields a number of highly promising leads compounds. Using the data mining and machine learning program package ChemML, we analyze the top candidates with respect to prevalent structural features and feature combinations that distinguish them from less promising ones. In particular, we explore the utility of various strategies that introduce highly polarizable moieties into the PI backbone to increase its RI yield. The derived insights provide a foundation for rational and targeted design that goes beyond traditional trial-and-error searches.</div>

Download Full-text

Multi-GPU approach to global induction of classification trees for large-scale data mining

Applied Intelligence ◽

10.1007/s10489-020-01952-5 ◽

2021 ◽

Author(s):

Krzysztof Jurczuk ◽

Marcin Czajkowski ◽

Marek Kretowski

Keyword(s):

Data Mining ◽

Large Scale ◽

Real Life ◽

Population Based ◽

Tree Structure ◽

Global Approach ◽

Data Parallel ◽

Large Scale Data ◽

The Impact ◽

Scale Data

AbstractThis paper concerns the evolutionary induction of decision trees (DT) for large-scale data. Such a global approach is one of the alternatives to the top-down inducers. It searches for the tree structure and tests simultaneously and thus gives improvements in the prediction and size of resulting classifiers in many situations. However, it is the population-based and iterative approach that can be too computationally demanding to apply for big data mining directly. The paper demonstrates that this barrier can be overcome by smart distributed/parallel processing. Moreover, we ask the question whether the global approach can truly compete with the greedy systems for large-scale data. For this purpose, we propose a novel multi-GPU approach. It incorporates the knowledge of global DT induction and evolutionary algorithm parallelization together with efficient utilization of memory and computing GPU’s resources. The searches for the tree structure and tests are performed simultaneously on a CPU, while the fitness calculations are delegated to GPUs. Data-parallel decomposition strategy and CUDA framework are applied. Experimental validation is performed on both artificial and real-life datasets. In both cases, the obtained acceleration is very satisfactory. The solution is able to process even billions of instances in a few hours on a single workstation equipped with 4 GPUs. The impact of data characteristics (size and dimension) on convergence and speedup of the evolutionary search is also shown. When the number of GPUs grows, nearly linear scalability is observed what suggests that data size boundaries for evolutionary DT mining are fading.

Download Full-text

Selecting Optimal Combination of Data Channels for Semantic Segmentation in City Information Modelling (CIM)

Remote Sensing ◽

10.3390/rs13071367 ◽

2021 ◽

Vol 13 (7) ◽

pp. 1367

Author(s):

Yuanzhi Cai ◽

Hong Huang ◽

Kaiyang Wang ◽

Cheng Zhang ◽

Lei Fan ◽

...

Keyword(s):

Large Scale ◽

Semantic Segmentation ◽

Optimal Combination ◽

Reconstruction Technique ◽

Process Data ◽

Redundant Data ◽

Multiple Data ◽

Information Models ◽

Efficient Data

Over the last decade, a 3D reconstruction technique has been developed to present the latest as-is information for various objects and build the city information models. Meanwhile, deep learning based approaches are employed to add semantic information to the models. Studies have proved that the accuracy of the model could be improved by combining multiple data channels (e.g., XYZ, Intensity, D, and RGB). Nevertheless, the redundant data channels in large-scale datasets may cause high computation cost and time during data processing. Few researchers have addressed the question of which combination of channels is optimal in terms of overall accuracy (OA) and mean intersection over union (mIoU). Therefore, a framework is proposed to explore an efficient data fusion approach for semantic segmentation by selecting an optimal combination of data channels. In the framework, a total of 13 channel combinations are investigated to pre-process data and the encoder-to-decoder structure is utilized for network permutations. A case study is carried out to investigate the efficiency of the proposed approach by adopting a city-level benchmark dataset and applying nine networks. It is found that the combination of IRGB channels provide the best OA performance, while IRGBD channels provide the best mIoU performance.

Download Full-text

Algorithm for Efficient Data Mining

International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2007) ◽

10.1109/iccima.2007.150 ◽

2007 ◽

Cited By ~ 2

Author(s):

S.P. Latha ◽

N. Ramaraj

Keyword(s):

Data Mining ◽

Efficient Data

Download Full-text