Extending the charge-flipping method towards structure solution from incomplete data sets

The charge-flipping method tends to fail if applied to an incomplete diffraction data set. The reason is artifacts induced in the density maps by Fourier transforming the data. It is shown that the missing data can be sufficiently well approximated on the basis of the Patterson map of the unknown structure optimized by the maximum entropy method (MEM). Structures that could not be solved by the original charge-flipping algorithm can be solved by the proposed method. The method has been tested on experimental data of one inorganic and two organic structures and on several types of missing data. In many cases, up to 50% of missing reflections, or even more, can be tolerated and the structure can still be reconstructed by charge flipping.

Download Full-text

Nanoporous Structural Science Developed by the MEM/Rietveld Method

Acta Crystallographica Section A Foundations and Advances ◽

10.1107/s2053273314085325 ◽

2014 ◽

Vol 70 (a1) ◽

pp. C1467-C1467

Author(s):

Masaki Takata ◽

Eiji Nishibori ◽

Yoshiki Kubota ◽

Hiroshi Tanaka

Keyword(s):

Electrostatic Potential ◽

Incomplete Data ◽

Maximum Entropy Method ◽

Rietveld Method ◽

Entropy Method ◽

Wide Spread ◽

Noise Resistance ◽

Data Set ◽

Density Mapping ◽

Metal Organic

Wide-spread functionalization research of Metal Organic Frameworks(MOFs) has brought rapid increase in variety of materials since the beginning of structural study in nanoporous of MOFs were made by SR(Synchrotron Radiation) powder diffraction using the MEM(Maximum Entropy Method)/Rietveld Method(Kitaura et al, 2002). The MEM/Rietveld method has successfully applied to refine the structural position of absorbed molecules and to investigate a bonding nature between the molecules and MOF's pore walls. Noise-resistance electron density mapping with incomplete data set was a key advantage of MEM to visualize unmodeled feature of molecules in nanoporous. Since then, the charge density studies by the MEM/Rietveld Method have uncovered various ordering structure of absorbed molecules into nanoporous more and more(Takata, 2008). Those findings ignited trends to design the nanoporous as the space to be functionalized. Recently, the MEM/Rietveld method has been further developed as the method to map an electrostatic potential and electric field(Tanaka 2006). This technique is making a progress in structural science of MOFs since the visualized electrostatic potential in the nanoporous ought to provide information of interplay between the molecule and the pore walls. The talk will present the recent progress and challenges of the MEM/Rietveld method to the structural science of the MOFs.

Download Full-text

Different Approaches for Missing Data Handling in Fuzzy Clustering: A Review

Recent Advances in Electrical & Electronic Engineering (Formerly Recent Patents on Electrical & Electronic Engineering) ◽

10.2174/2352096512666191127121710 ◽

2020 ◽

Vol 13 (6) ◽

pp. 833-846

Author(s):

Sonia Goel ◽

Meena Tushir

Keyword(s):

Missing Data ◽

Fuzzy Clustering ◽

Incomplete Data ◽

Clustering Algorithm ◽

Linear Interpolation ◽

Performance Criteria ◽

Data Sets ◽

Data Set ◽

Fcm Clustering ◽

Missing Attributes

Introduction: Incomplete data sets containing some missing attributes is a prevailing problem in many research areas. The reasons for the lack of missing attributes may be several; human error in tabulating/recording the data, machine failure, errors in data acquisition or refusal of a patient/customer to answer few questions in a questionnaire or survey. Further, clustering of such data sets becomes a challenge. Objective: In this paper, we presented a critical review of various methodologies proposed for handling missing data in clustering. The focus of this paper is the comparison of various imputation techniques based FCM clustering and the four clustering strategies proposed by Hathway and Bezdek. Methods: In this paper, we imputed the missing values in incomplete datasets by various imputation/ non-imputation techniques to complete the data set and then conventional fuzzy clustering algorithm is applied to get the clustering results. Results: Experiments on various synthetic data sets and real data sets from UCI repository are carried out. To evaluate the performance of the various imputation/ non-imputation based FCM clustering algorithm, several performance criteria and statistical tests are considered. Experimental results on various data sets show that the linear interpolation based FCM clustering performs significantly better than other imputation as well as non-imputation techniques. Conclusion: It is concluded that the clustering algorithm is data specific, no clustering technique can give good results on all data sets. It depends upon both the data type and the percentage of missing attributes in the dataset. Through this study, we have shown that the linear interpolation based FCM clustering algorithm can be used effectively for clustering of incomplete data set.

Download Full-text

Examining vehicular speed characteristics through divergences from prior distributions

SN Applied Sciences ◽

10.1007/s42452-021-04194-3 ◽

2021 ◽

Vol 3 (2) ◽

Author(s):

Caglar Kosun

Keyword(s):

Maximum Entropy Method ◽

Synthetic Data ◽

Entropy Method ◽

Data Sets ◽

Distribution Models ◽

Data Set ◽

Kl Divergence ◽

Vehicular Speed ◽

Uniform Case ◽

Gaussian Case

AbstractA variety of approaches, within literature, has been conducted to interpret vehicular speed characteristics. This study turns the attention to the entropy-based approaches, and thus focuses on the maximum entropy method of statistical mechanics and the Kullback–Leibler (KL) divergence approach to examining the vehicular speeds. The vehicle speeds at the selected highway are analyzed in order to find out the disparities among them. However, it is turned out that the speed dynamics could not be distinguished over the speed distributions; hence the maximization of Shannon entropy seems insufficient to compare the speed distributions of each data set. For this reason, the KL divergence approach was performed. This approach displays the comparison, among the speed distributions, based on two prior distribution models, i.e., uniform and Gauss. The examination of the trends of KL divergences obtained from both distributions was made. It was concluded that the KL divergence values for the highway speed data sets ranged between about 0.53 and 0.70 for the uniform case, while for the Gaussian case the obtained values are between 0.16 and 0.33. The KL divergence trends for the real speeds were obtained analogous for both cases, but they differed significantly when the synthetic data sets were employed. As a result, the KL divergence approach proves suitable as an appropriate indicator to compare the speed distributions.

Download Full-text

The Issue of Missing Values in Data Mining

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch171 ◽

2011 ◽

pp. 1102-1109

Author(s):

Malcolm J. Beynon

Keyword(s):

Data Mining ◽

Missing Data ◽

Incomplete Data ◽

Missing Values ◽

Large Data ◽

Original Data ◽

Customer Relationship ◽

Data Sets ◽

Data Mining Technique ◽

Data Set

The essence of data mining is to investigate for pertinent information that may exist in data (often large data sets). The immeasurably large amount of data present in the world, due to the increasing capacity of storage media, manifests the issue of the presence of missing values (Olinsky et al., 2003; Brown and Kros, 2003). The presented encyclopaedia article considers the general issue of the presence of missing values when data mining, and demonstrates the effect of when managing their presence is or is not undertaken, through the utilisation of a data mining technique. The issue of missing values was first exposited over forty years ago in Afifi and Elashoff (1966). Since then it is continually the focus of study and explanation (El-Masri and Fox-Wasylyshyn, 2005), covering issues such as the nature of their presence and management (Allison, 2000). With this in mind, the naïve consistent aspect of the missing value debate is the limited general strategies available for their management, the main two being either the simple deletion of cases with missing data or a form of imputation of the missing values in someway (see Elliott and Hawthorne, 2005). Examples of the specific investigation of missing data (and data quality), include in; data warehousing (Ma et al., 2000), and customer relationship management (Berry and Linoff, 2000). An alternative strategy considered is the retention of the missing values, and their subsequent ‘ignorance’ contribution in any data mining undertaken on the associated original incomplete data set. A consequence of this retention is that full interpretability can be placed on the results found from the original incomplete data set. This strategy can be followed when using the nascent CaRBS technique for object classification (Beynon, 2005a, 2005b). CaRBS analyses are presented here to illustrate that data mining can manage the presence of missing values in a much more effective manner than the more inhibitory traditional strategies. An example data set is considered, with a noticeable level of missing values present in the original data set. A critical increase in the number of missing values present in the data set further illustrates the benefit from ‘intelligent’ data mining (in this case using CaRBS).

Download Full-text

Incommensurate modulations made visible by the Maximum Entropy Method in superspace

Zeitschrift für Kristallographie - Crystalline Materials ◽

10.1524/zkri.219.11.719.52435 ◽

2004 ◽

Vol 219 (11) ◽

Cited By ~ 4

Author(s):

Lukás Palatinus ◽

Sander van Smaalen

Keyword(s):

Missing Data ◽

Maximum Entropy ◽

Electron Density ◽

Maximum Entropy Method ◽

Entropy Method ◽

Basic Principles ◽

Layer Compound ◽

Modulated Structures ◽

Structure Solution ◽

Superspace Formalism

AbstractThis paper presents the application of the Maximum Entropy Method (MEM) to structure solution of incommensurately modulated structures within the superspace formalism. The basic principles of the MEM are outlined, and its generalization toward superspace is discussed. Possible problems in MEM reconstructions and their solutions are summarized. They include series-termination errors in the reconstructed electron density, the effect of insufficient constraints, and the effect of missing data. The use of the MEM in superspace is illustrated by three examples: the structure of the misfit-layer compound (LaS)

Download Full-text

Determination of Reactivity Ratios from Binary Copolymerization Using the k-Nearest Neighbor Non-Parametric Regression

Polymers ◽

10.3390/polym13213811 ◽

2021 ◽

Vol 13 (21) ◽

pp. 3811

Author(s):

Iosif Sorin Fazakas-Anca ◽

Arina Modrea ◽

Sorin Vlase

Keyword(s):

Experimental Data ◽

Nearest Neighbor ◽

Optimization Method ◽

Reactivity Ratios ◽

Data Sets ◽

K Nearest Neighbor ◽

Integration Algorithm ◽

Data Set ◽

Parametric Regression ◽

Non Parametric

This paper proposes a new method for calculating the monomer reactivity ratios for binary copolymerization based on the terminal model. The original optimization method involves a numerical integration algorithm and an optimization algorithm based on k-nearest neighbour non-parametric regression. The calculation method has been tested on simulated and experimental data sets, at low (<10%), medium (10–35%) and high conversions (>40%), yielding reactivity ratios in a good agreement with the usual methods such as intersection, Fineman–Ross, reverse Fineman–Ross, Kelen–Tüdös, extended Kelen–Tüdös and the error in variable method. The experimental data sets used in this comparative analysis are copolymerization of 2-(N-phthalimido) ethyl acrylate with 1-vinyl-2-pyrolidone for low conversion, copolymerization of isoprene with glycidyl methacrylate for medium conversion and copolymerization of N-isopropylacrylamide with N,N-dimethylacrylamide for high conversion. Also, the possibility to estimate experimental errors from a single experimental data set formed by n experimental data is shown.

Download Full-text

Inferring parameters for a lattice-free model of cell migration and proliferation using experimental data

10.1101/186197 ◽

2017 ◽

Author(s):

Alexander P. Browning ◽

Scott W. McCue ◽

Rachelle N. Binny ◽

Michael J. Plank ◽

Esha T. Shah ◽

...

Keyword(s):

Experimental Data ◽

Cell Migration ◽

Spatial Clustering ◽

Cancer Cell Line ◽

Movement Direction ◽

Data Sets ◽

Collective Cell Migration ◽

Rejection Sampling ◽

Data Set ◽

Cell Migration And Proliferation

AbstractCollective cell spreading takes place in spatially continuous environments, yet it is often modelled using discrete lattice-based approaches. Here, we use data from a series of cell proliferation assays, with a prostate cancer cell line, to calibrate a spatially continuous individual based model (IBM) of collective cell migration and proliferation. The IBM explicitly accounts for crowding effects by modifying the rate of movement, direction of movement, and the rate of proliferation by accounting for pair-wise interactions. Taking a Bayesian approach we estimate the free parameters in the IBM using rejection sampling on three separate, independent experimental data sets. Since the posterior distributions for each experiment are similar, we perform simulations with parameters sampled from a new posterior distribution generated by combining the three data sets. To explore the predictive power of the calibrated IBM, we forecast the evolution of a fourth experimental data set. Overall, we show how to calibrate a lattice-free IBM to experimental data, and our work highlights the importance of interactions between individuals. Despite great care taken to distribute cells as uniformly as possible experimentally, we find evidence of significant spatial clustering over short distances, suggesting that standard mean-field models could be inappropriate.

Download Full-text

A program to analyze the distributions of unmeasured reflections

Journal of Applied Crystallography ◽

10.1107/s0021889811019546 ◽

2011 ◽

Vol 44 (4) ◽

pp. 865-872 ◽

Cited By ~ 6

Author(s):

Ludmila Urzhumtseva ◽

Alexandre Urzhumtsev

Keyword(s):

Spatial Distribution ◽

Missing Data ◽

Computer Program ◽

Diffraction Data ◽

Reciprocal Space ◽

Data Set ◽

Data Completeness ◽

Significant Time ◽

Cumulative Data ◽

Structure Solution

Crystallographic Fourier maps may contain barely interpretable or non-interpretable regions if these maps are calculated with an incomplete set of diffraction data. Even a small percentage of missing data may be crucial if these data are distributed non-uniformly and form connected regions of reciprocal space. Significant time and effort can be lost trying to interpret poor maps, in improving them by phase refinement or in fighting against artefacts, whilst the problem could in fact be solved by completing the data set. To characterize the distribution of missing reflections, several types of diagrams have been suggested in addition to the usual plots of completeness in resolution shells and cumulative data completeness. A computer program,FOBSCOM, has been developed to analyze the spatial distribution of unmeasured diffraction data, to search for connected regions of unmeasured reflections and to obtain numeric characteristics of these regions. By performing this analysis, the program could help to save time during structure solution for a number of projects. It can also provide information about a possible overestimation of the map quality and model-biased features when calculated values are used to replace unmeasured data.

Download Full-text

Data Mining with Incomplete Data

Encyclopedia of Data Warehousing and Mining ◽

10.4018/978-1-59140-557-3.ch056 ◽

2011 ◽

pp. 293-296 ◽

Cited By ~ 2

Author(s):

Hai Wang ◽

Shouhong Wang

Keyword(s):

Data Mining ◽

Missing Data ◽

Survey Data ◽

Incomplete Data ◽

Knowledge Engineering ◽

Data Set ◽

Survey Question ◽

Data Source ◽

The Common ◽

Extent Of Damage

Survey is one of the common data acquisition methods for data mining (Brin, Rastogi & Shim, 2003). In data mining one can rarely find a survey data set that contains complete entries of each observation for all of the variables. Commonly, surveys and questionnaires are often only partially completed by respondents. The possible reasons for incomplete data could be numerous, including negligence, deliberate avoidance for privacy, ambiguity of the survey question, and aversion. The extent of damage of missing data is unknown when it is virtually impossible to return the survey or questionnaires to the data source for completion, but is one of the most important parts of knowledge for data mining to discover. In fact, missing data is an important debatable issue in the knowledge engineering field (Tseng, Wang, & Lee, 2003).

Download Full-text

Weber's Ratio, Multidimensional Scaling and Incomplete Data Sets: New Light on an Old Problem

Proceedings of the Human Factors Society Annual Meeting ◽

10.1177/154193128803201713 ◽

1988 ◽

Vol 32 (17) ◽

pp. 1183-1187

Author(s):

J. G. Kreifeldt ◽

S. H. Levine ◽

M. C. Chuang

Keyword(s):

Multidimensional Scaling ◽

Incomplete Data ◽

Data Sets ◽

Observation Data ◽

Data Set ◽

New Approach ◽

Sensory Modalities ◽

Minimal Difference ◽

Typical Measurement ◽

Measurement Context

Sensory modalities exhibit a characteristic known as Weber's ratio which remarks that when two stimuli are compared for a difference: (1) there is some minimal nonzero difference which can be differentiated and (2) this minimal difference is a nearly constant proportion of the magnitude of the stimuli. Both of these would, in a typical measurement context, appear to be system defects. We have found through simulation explorations that in fact these are apparently the characteristics required by a system designed to extract an adequate amount of information from an incomplete observation data set according to a new approach to measurement.

Download Full-text