Serial Crystallography with Multi-stage Merging oi 1000’s of Images

KAMO and Blend provide particularly effective tools to automatically manage the merging of large numbers of data sets from serial crystallography. The requirement for manual intervention in the process can be reduced by extending Blend to support additional clustering options to increase the sensitivity to differences in unit cell parameters and to allow for clustering of nearly complete datasets on the basis of intensity or amplitude differences. If datasets are already sufficiently complete to permit it, apply KAMO once, just for reflections. If starting from incomplete datasets, one applies KAMO twice, first using cell parameters. In this step either the simple cell vector distance of the original Blend is used, or the more sensitive NCDist, to find clusters to merge to achieve sufficient completeness to allow intensities or amplitudes to be compared. One then uses KAMO again using the correlation between the reflections at the common HKLs to merge clusters in a way sensitive to structural differences that may not perturb the cell parameters sufficiently to make meaningful clusters.Many groups have developed effective clustering algorithms that use a measurable physical parameter from each diffraction still or wedge to cluster the data into categories which can then be merged to, hopefully, yield the electron density from a single protein iso-form. What is striking about many of these physical parameters is that they are largely independent from one another. Consequently, it should be possible to greatly improve the efficacy of data clustering software by using a multi-stage partitioning strategy. Here, we have demonstrated one possible approach to multi-stage data clustering. Our strategy was to use unit-cell clustering until merged data was of sufficient completeness to then use intensity based clustering. We have demonstrated that, using this strategy, we were able to accurately cluster data sets from crystals that had subtle differences.

Download Full-text

Crystallization and preliminary X-ray studies of sialidase L from the leech Macrobdella decora

Acta Crystallographica Section D Biological Crystallography ◽

10.1107/s0907444997007993 ◽

1998 ◽

Vol 54 (1) ◽

pp. 111-113 ◽

Cited By ~ 5

Author(s):

Yu Luo ◽

Min-yuan Chou ◽

Su-chen Li ◽

Yu-teh Li ◽

Ming Luo

Keyword(s):

Escherichia Coli ◽

Unit Cell ◽

Data Sets ◽

Molecular Replacement ◽

Unit Cell Parameters ◽

Solvent Content ◽

X Ray ◽

Cell Parameters ◽

Mercury Derivative ◽

Macrobdella Decora

Functional monomeric 83 kDa sialidase L, a NeuAcα2→3Gal-specific sialidase from Macrobdella leech, was expressed in Escherichia coli and readily crystallized by a macroseeding technique. The crystal belongs to space group P1 with unit-cell parameters a = 46.4, b = 69.3, c = 72.5 Å, α = 113.5, β = 95.4 and γ = 107.3°. There is one molecule per unit cell, giving a Vm = 2.4 Å3 Da−1 and a solvent content of 40%. Native and mercury-derivative data sets were collected to 2.0 Å resolution. Threading and molecular-replacement calculations confirmed the existence of a bacterial sialidase-like domain.

Download Full-text

Crystallization and X-ray diffraction analysis of native and selenomethionine-substituted PhyH-DI fromBacillussp. HJB17

Acta Crystallographica Section F Structural Biology Communications ◽

10.1107/s2053230x17015102 ◽

2017 ◽

Vol 73 (11) ◽

pp. 607-611

Author(s):

Fang Lu ◽

Bei Zhang ◽

Yong Liu ◽

Ying Song ◽

Gangxing Guo ◽

...

Keyword(s):

Unit Cell ◽

Diffusion Method ◽

Data Sets ◽

Asymmetric Unit ◽

X Ray Diffraction ◽

Unit Cell Parameters ◽

Solvent Content ◽

Space Group ◽

X Ray ◽

Cell Parameters

Phytases are phosphatases that hydrolyze phytates to less phosphorylatedmyo-inositol derivatives and inorganic phosphate. β-Propeller phytases, which are very diverse phytases with improved thermostability that are active at neutral and alkaline pH and have absolute substrate specificity, are ideal substitutes for other commercial phytases. PhyH-DI, a β-propeller phytase fromBacillussp. HJB17, was found to act synergistically with other single-domain phytases and can increase their efficiency in the hydrolysis of phytate. Crystals of native and selenomethionine-substituted PhyH-DI were obtained using the vapour-diffusion method in a condition consisting of 0.2 Msodium chloride, 0.1 MTris pH 8.5, 25%(w/v) PEG 3350 at 289 K. X-ray diffraction data were collected to 3.00 and 2.70 Å resolution, respectively, at 100 K. Native PhyH-DI crystals belonged to space groupC121, with unit-cell parametersa = 156.84,b = 45.54,c = 97.64 Å, α = 90.00, β = 125.86, γ = 90.00°. The asymmetric unit contained two molecules of PhyH-DI, with a corresponding Matthews coefficient of 2.17 Å3 Da−1and a solvent content of 43.26%. Crystals of selenomethionine-substituted PhyH-DI belonged to space groupC2221, with unit-cell parametersa = 94.71,b= 97.03,c= 69.16 Å, α = β = γ = 90.00°. The asymmetric unit contained one molecule of the protein, with a corresponding Matthews coefficient of 2.44 Å3 Da−1and a solvent content of 49.64%. Initial phases for PhyH-DI were obtained from SeMet SAD data sets. These data will be useful for further studies of the structure–function relationship of PhyH-DI.

Download Full-text

Uncertainty-Based Clustering Algorithms for Large Data Sets

Modern Technologies for Big Data Classification and Clustering - Advances in Data Mining and Database Management ◽

10.4018/978-1-5225-2805-0.ch001 ◽

2018 ◽

pp. 1-33 ◽

Cited By ~ 1

Author(s):

B. K. Tripathy ◽

Hari Seetha ◽

M. N. Murty

Keyword(s):

Big Data ◽

Data Clustering ◽

Clustering Algorithms ◽

Large Data ◽

Large Data Sets ◽

Mining Machine ◽

Data Sets ◽

Fuzzy C Means ◽

Intuitionistic Fuzzy ◽

New Algorithms

Data clustering plays a very important role in Data mining, machine learning and Image processing areas. As modern day databases have inherent uncertainties, many uncertainty-based data clustering algorithms have been developed in this direction. These algorithms are fuzzy c-means, rough c-means, intuitionistic fuzzy c-means and the means like rough fuzzy c-means, rough intuitionistic fuzzy c-means which base on hybrid models. Also, we find many variants of these algorithms which improve them in different directions like their Kernelised versions, possibilistic versions, and possibilistic Kernelised versions. However, all the above algorithms are not effective on big data for various reasons. So, researchers have been trying for the past few years to improve these algorithms in order they can be applied to cluster big data. The algorithms are relatively few in comparison to those for datasets of reasonable size. It is our aim in this chapter to present the uncertainty based clustering algorithms developed so far and proposes a few new algorithms which can be developed further.

Download Full-text

PyMDA: microcrystal data assembly using Python

Journal of Applied Crystallography ◽

10.1107/s160057671901673x ◽

2020 ◽

Vol 53 (1) ◽

pp. 277-281

Author(s):

Lina Takemaru ◽

Gongrui Guo ◽

Ping Zhu ◽

Wayne A. Hendrickson ◽

Sean McSweeney ◽

...

Keyword(s):

Structural Analysis ◽

Data Quality ◽

Diffraction Data ◽

Incomplete Data ◽

Unit Cell ◽

Data Sets ◽

X Ray ◽

Assembly Method ◽

Multi Stage ◽

Recent Developments

The recent developments at microdiffraction X-ray beamlines are making microcrystals of macromolecules appealing subjects for routine structural analysis. Microcrystal diffraction data collected at synchrotron microdiffraction beamlines may be radiation damaged with incomplete data per microcrystal and with unit-cell variations. A multi-stage data assembly method has previously been designed for microcrystal synchrotron crystallography. Here the strategy has been implemented as a Python program for microcrystal data assembly (PyMDA). PyMDA optimizes microcrystal data quality including weak anomalous signals through iterative crystal and frame rejections. Beyond microcrystals, PyMDA may be applicable for assembling data sets from larger crystals for improved data quality.

Download Full-text

Density-based clustering with constraints

Computer Science and Information Systems ◽

10.2298/csis180601007l ◽

2019 ◽

Vol 16 (2) ◽

pp. 469-489 ◽

Cited By ~ 1

Author(s):

Piotr Lasek ◽

Jarek Gryz

Keyword(s):

Data Clustering ◽

Clustering Algorithms ◽

Background Knowledge ◽

Data Sets ◽

Benchmark Data ◽

Density Based Clustering

In this paper we present our ic-NBC and ic-DBSCAN algorithms for data clustering with constraints. The algorithms are based on density-based clustering algorithms NBC and DBSCAN but allow users to incorporate background knowledge into the process of clustering by means of instance constraints. The knowledge about anticipated groups can be applied by specifying the so-called must-link and cannot-link relationships between objects or points. These relationships are then incorporated into the clustering process. In the proposed algorithms this is achieved by properly merging resulting clusters and introducing a new notion of deferred points which are temporarily excluded from clustering and assigned to clusters based on their involvement in cannot-link relationships. To examine the algorithms, we have carried out a number of experiments. We used benchmark data sets and tested the efficiency and quality of the results. We have also measured the efficiency of the algorithms against their original versions. The experiments prove that the introduction of instance constraints improves the quality of both algorithms. The efficiency is only insignificantly reduced and is due to extra computation related to the introduced constraints.

Download Full-text

A Preference Model on Adaptive Affinity Propagation

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v8i3.pp1805-1813 ◽

2018 ◽

Vol 8 (3) ◽

pp. 1805 ◽

Cited By ~ 1

Author(s):

Rina Refianti ◽

Achmad Benny Mutiara ◽

Asep Juarna ◽

Adang Suhendra

Keyword(s):

Data Clustering ◽

Message Passing ◽

Clustering Algorithms ◽

Experimental Results ◽

Data Sets ◽

Similarity Matrix ◽

Preference Model ◽

New Model ◽

Data Points ◽

Scanning Algorithm

In recent years, two new data clustering algorithms have been proposed. One of them isAffinity Propagation (AP). AP is a new data clustering technique that use iterative message passing and consider all data points as potential exemplars. Two important inputs of AP are a similarity matrix (SM) of the data and the parameter ”preference” p. Although the original AP algorithm has shown much success in data clustering, it still suffer from one limitation: it is not easy to determine the value of the parameter ”preference” p which can result an optimal clustering solution. To resolve this limitation, we propose a new model of the parameter ”preference” p, i.e. it is modeled based on the similarity distribution. Having the SM and p, Modified Adaptive AP (MAAP) procedure is running. MAAP procedure means that we omit the adaptive p-scanning algorithm as in original Adaptive-AP (AAP) procedure. Experimental results on random non-partition and partition data sets show that (i) the proposed algorithm, MAAP-DDP, is slower than original AP for random non-partition dataset, (ii) for random 4-partition dataset and real datasets the proposed algorithm has succeeded to identify clusters according to the number of dataset’s true labels with the execution times that are comparable with those original AP. Beside that the MAAP-DDP algorithm demonstrates more feasible and effective than original AAP procedure.

Download Full-text

A Methodology for Clustering Transient Biomedical Signals by Variable

International Journal of Computational Models and Algorithms in Medicine ◽

10.4018/jcmam.2012010103 ◽

2012 ◽

Vol 3 (1) ◽

pp. 32-71

Author(s):

Pimwadee Chaovalit

Keyword(s):

Data Streams ◽

Data Clustering ◽

Healthcare Professionals ◽

Clustering Algorithms ◽

Biomedical Signals ◽

Model Based Clustering ◽

Physical Conditions ◽

Cluster Data ◽

Baseline Algorithm ◽

Crucial Part

Biomedical signals which help monitor patients’ physical conditions are a crucial part of the healthcare industry. The healthcare professionals’ ability to monitor patients and detect early signs of conditions such as blocked arteries and abnormal heart rhythms can be accomplished by performing data clustering on biomedical signals. More importantly, clustering on streams of biomedical signals make it possible to look for patterns that may indicate developing conditions. While there are a number of clustering algorithms that perform data streams clustering by example, few algorithms exist that perform clustering by variable. This paper presents POD-Clus, a clustering method which uses a model-based clustering principle and, in addition to clustering by example, also cluster data streams by variable. The clustering result from POD-Clus was superior to the result from ODAC, a baseline algorithm, for both with and without cluster evolutions.

Download Full-text

Packing analysis of carbohydrates and polysaccharides. 16. The crystal structures of celluloses IVI and IVII

Canadian Journal of Chemistry ◽

10.1139/v85-027 ◽

1985 ◽

Vol 63 (1) ◽

pp. 173-180 ◽

Cited By ~ 58

Author(s):

Eric S. Gardiner ◽

Anatole Sarko

Keyword(s):

Crystal Structures ◽

Unit Cell ◽

Hydroxyl Groups ◽

Data Sets ◽

Cell Parameters ◽

The Difference ◽

Cellulose Iv ◽

Chain Conformations ◽

Probable Space ◽

Packing Analysis

The crystal structures of cellulose polymorphs IVI and IVII have been determined by X-ray fiber diffraction analysis combined with stereochemical model refinement. Both structures crystallize in a two-chain unit cell of essentially identical parameters. The most probable space group in both cases is P1. The chain conformations, although close to two-fold helical, are marked by unequal rotational positions of the O(6) hydroxyl groups in adjacent residues. Despite identical unit cell parameters, the structures differ in chain polarity: in cellulose IVI both chains of the unit cell are parallel, whereas in cellulose IVII they are antiparallel. The difference in polarity is further substantiated by the results of chemical conversions which show that cellulose IVI is converted to cellulose I, and cellulose IVII is converted to cellulose II, via parallel and antiparallel cellulose triacetates, respectively. The reliability of the structure analyses is indicated by the residual R″ = 0.115 for cellulose IVI and 0.094 for cellulose IVII, for data sets of 41 and 43 reflections, respectively.

Download Full-text

A Clustering Algorithm Based on Variance-Similarity

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.333-335.1306 ◽

2013 ◽

Vol 333-335 ◽

pp. 1306-1309

Author(s):

Zhen Dong Li ◽

Fei Li

Keyword(s):

Computer Simulation ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Data Sets ◽

Cluster Data ◽

Similarity Algorithm ◽

Attribute Space

Clustering algorithms, like K-means Algorithm, use distances in attribute space to cluster data. However the computation of distances in attribute space influences the accuracy. So innovatively, Variance-Similarity clustering algorithm defines similarity as a function of the attribute variance, and clusters data by the comparison of similarities. In computer simulation, the comparison of Variance-Similarity Algorithm and K-means Algorithm on UCI data sets presents that Variance-Similarity Algorithm has a better clustering accuracy than K-means Algorithm.

Download Full-text

Overview of Different Data Clustering Algorithms for Static and Dynamic Data Sets

International Journal of Computer Science and Engineering ◽

10.14445/23488387/ijcse-v5i3p101 ◽

2018 ◽

Vol 5 (3) ◽

pp. 1-3

Author(s):

Johnsymol Joy

Keyword(s):

Data Clustering ◽

Clustering Algorithms ◽

Data Sets ◽

Dynamic Data

Download Full-text