scholarly journals ccbmlib – a Python package for modeling Tanimoto similarity value distributions

F1000Research ◽  
2020 ◽  
Vol 9 ◽  
pp. 100
Author(s):  
Martin Vogt ◽  
Jürgen Bajorath

The ccbmlib Python package is a collection of modules for modeling similarity value distributions based on Tanimoto coefficients for fingerprints available in RDKit. It can be used to assess the statistical significance of Tanimoto coefficients and evaluate how molecular similarity is reflected when different fingerprint representations are used. Significance measures derived from p-values allow a quantitative comparison of similarity scores obtained from different fingerprint representations that might have very different value ranges. Furthermore, the package models conditional distributions of similarity coefficients for a given reference compound. The conditional significance score estimates where a test compound would be ranked in a similarity search. The models are based on the statistical analysis of feature distributions and feature correlations of fingerprints of a reference database. The resulting models have been evaluated for 11 RDKit fingerprints, taking a collection of ChEMBL compounds as a reference data set. For most fingerprints, highly accurate models were obtained, with differences of 1% or less for Tanimoto coefficients indicating high similarity.

F1000Research ◽  
2020 ◽  
Vol 9 ◽  
pp. 100
Author(s):  
Martin Vogt ◽  
Jürgen Bajorath

The ccbmlib Python package is a collection of modules for modeling similarity value distributions based on Tanimoto coefficients for fingerprints available in RDKit. It can be used to assess the statistical significance of Tanimoto coefficients and evaluate how molecular similarity is reflected when different fingerprint representations are used. Significance measures derived from p-values allow a quantitative comparison of similarity scores obtained from different fingerprint representations that might have very different value ranges. Furthermore, the package models conditional distributions of similarity coefficients for a given reference compound. The conditional significance score estimates where a test compound would be ranked in a similarity search. The models are based on the statistical analysis of feature distributions and feature correlations of fingerprints of a reference database. The resulting models have been evaluated for 11 RDKit fingerprints, taking a collection of ChEMBL compounds as a reference data set. For most fingerprints, highly accurate models were obtained, with differences of 1% or less for Tanimoto coefficients indicating high similarity.


2020 ◽  
Vol 501 (1) ◽  
pp. 994-1001
Author(s):  
Suman Sarkar ◽  
Biswajit Pandey ◽  
Snehasish Bhattacharjee

ABSTRACT We use an information theoretic framework to analyse data from the Galaxy Zoo 2 project and study if there are any statistically significant correlations between the presence of bars in spiral galaxies and their environment. We measure the mutual information between the barredness of galaxies and their environments in a volume limited sample (Mr ≤ −21) and compare it with the same in data sets where (i) the bar/unbar classifications are randomized and (ii) the spatial distribution of galaxies are shuffled on different length scales. We assess the statistical significance of the differences in the mutual information using a t-test and find that both randomization of morphological classifications and shuffling of spatial distribution do not alter the mutual information in a statistically significant way. The non-zero mutual information between the barredness and environment arises due to the finite and discrete nature of the data set that can be entirely explained by mock Poisson distributions. We also separately compare the cumulative distribution functions of the barred and unbarred galaxies as a function of their local density. Using a Kolmogorov–Smirnov test, we find that the null hypothesis cannot be rejected even at $75{{\ \rm per\ cent}}$ confidence level. Our analysis indicates that environments do not play a significant role in the formation of a bar, which is largely determined by the internal processes of the host galaxy.


2017 ◽  
Vol 3 (5) ◽  
pp. e192 ◽  
Author(s):  
Corina Anastasaki ◽  
Stephanie M. Morris ◽  
Feng Gao ◽  
David H. Gutmann

Objective:To ascertain the relationship between the germline NF1 gene mutation and glioma development in patients with neurofibromatosis type 1 (NF1).Methods:The relationship between the type and location of the germline NF1 mutation and the presence of a glioma was analyzed in 37 participants with NF1 from one institution (Washington University School of Medicine [WUSM]) with a clinical diagnosis of NF1. Odds ratios (ORs) were calculated using both unadjusted and weighted analyses of this data set in combination with 4 previously published data sets.Results:While no statistical significance was observed between the location and type of the NF1 mutation and glioma in the WUSM cohort, power calculations revealed that a sample size of 307 participants would be required to determine the predictive value of the position or type of the NF1 gene mutation. Combining our data set with 4 previously published data sets (n = 310), children with glioma were found to be more likely to harbor 5′-end gene mutations (OR = 2; p = 0.006). Moreover, while not clinically predictive due to insufficient sensitivity and specificity, this association with glioma was stronger for participants with 5′-end truncating (OR = 2.32; p = 0.005) or 5′-end nonsense (OR = 3.93; p = 0.005) mutations relative to those without glioma.Conclusions:Individuals with NF1 and glioma are more likely to harbor nonsense mutations in the 5′ end of the NF1 gene, suggesting that the NF1 mutation may be one predictive factor for glioma in this at-risk population.


Author(s):  
Emilia Mendes

Although numerous studies on Web effort estimation have been carried out to date, there is no consensus on what constitutes the best effort estimation technique to be used by Web companies. It seems that not only the effort estimation technique itself can influence the accuracy of predictions, but also the characteristics of the data set used (e.g., skewness, collinearity; Shepperd & Kadoda, 2001). Therefore, it is often necessary to compare different effort estimation techniques, looking for those that provide the best estimation accuracy for the data set being employed. With this in mind, the use of graphical aids such as boxplots is not always enough to assess the existence of significant differences between effort prediction models. The same applies to measures of prediction accuracy such as the mean magnitude of relative error (MMRE), median magnitude of relative error (MdMRE), and prediction at level l (Pred[25]). Other techniques, which correspond to the group of statistical significance tests, need to be employed to check if the different residuals obtained for each of the effort estimation techniques compared come from the same population. This chapter details how to use such techniques and how their results should be interpreted.


Author(s):  
German Perlovich ◽  
Artem Surov

In this work, a database containing thermochemical and structural information about 208 monotropic polymorphic forms has been created and analyzed. Most of the identified compounds (77 cases) have been found to have two polymorphs, 14 compounds have three forms and there are only three examples of systems with four polymorphs. The analysis of density distribution within the database has revealed that only 62 out of 114 metastable polymorphs (referred to as group I) obey the `density rule' proposed by Burger and Ramberger [(1979), Mikrochim. Acta, 72, 259–271], while the remaining 45% of the monotropic systems (group II) violate the rule. A number of physicochemical, structural and molecular descriptors have been used to find and highlight the differences between group I and group II of the polymorphs. Group II is characterized (on average) by higher values of descriptors, which are responsible for conformational flexibility of molecules. An algorithm has been proposed for carrying out bivariate statistical analysis. It implies partitioning the database into structurally related clusters based on Tanimoto similarity coefficients and subsequent analysis of each cluster in terms of the number of hydrogen bonds per molecule.


2016 ◽  
Vol 5 (5) ◽  
pp. 16 ◽  
Author(s):  
Guolong Zhao

To evaluate a drug, statistical significance alone is insufficient and clinical significance is also necessary. This paper explains how to analyze clinical data with considering both statistical and clinical significance. The analysis is practiced by combining a confidence interval under null hypothesis with that under non-null hypothesis. The combination conveys one of the four possible results: (i) both significant, (ii) only significant in the former, (iii) only significant in the latter or (iv) neither significant. The four results constitute a quadripartite procedure. Corresponding tests are mentioned for describing Type I error rates and power. The empirical coverage is exhibited by Monte Carlo simulations. In superiority trials, the four results are interpreted as clinical superiority, statistical superiority, non-superiority and indeterminate respectively. The interpretation is opposite in inferiority trials. The combination poses a deflated Type I error rate, a decreased power and an increased sample size. The four results may helpful for a meticulous evaluation of drugs. Of these, non-superiority is another profile of equivalence and so it can also be used to interpret equivalence. This approach may prepare a convenience for interpreting discordant cases. Nevertheless, a larger data set is usually needed. An example is taken from a real trial in naturally acquired influenza.


2018 ◽  
Vol 57 (6) ◽  
pp. 773-780 ◽  
Author(s):  
Elizabet D’hooge ◽  
Pierre Becker ◽  
Dirk Stubbe ◽  
Anne-Cécile Normand ◽  
Renaud Piarroux ◽  
...  

AbstractAspergillus section Nigri is a taxonomically difficult but medically and economically important group. In this study, an update of the taxonomy of A. section Nigri strains within the BCCM/IHEM collection has been conducted. The identification accuracy of matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF MS) was tested and the antifungal susceptibilities of clinical isolates were evaluated. A total of 175 strains were molecularly analyzed. Three regions were amplified (ITS, benA, and caM) and a multi-locus phylogeny of the combined loci was created by using maximum likelihood analysis. The in-house MALDI-TOF MS reference database was extended and an identification data set of 135 strains was run against a reference data set. Antifungal susceptibility was tested for voriconazole, itraconazole, and amphotericin B, using the EUCAST method. Phylogenetic analysis revealed 18 species in our data set. MALDI-TOF MS was able to distinguish between A. brasiliensis, A. brunneoviolaceus, A. neoniger, A. niger, A. tubingensis, and A. welwitschiae of A. sect. Nigri. In the routine clinical lab, isolates of A. sect. Nigri are often identified as A. niger. However, in the clinical isolates of our data set, A. tubingensis (n = 35) and A. welwitschiae (n = 34) are more common than A. niger (n = 9). Decreased antifungal susceptibility to azoles was observed in clinical isolates of the /tubingensis clade. This emphasizes the importance of identification up to species level or at least up to clade level in the clinical lab. Our results indicate that MALDI-TOF MS can be a powerful tool to replace classical morphology.


2019 ◽  
Vol 18 ◽  
pp. 117693511983554 ◽  
Author(s):  
Ophir Gal ◽  
Noam Auslander ◽  
Yu Fan ◽  
Daoud Meerzaman

Machine learning (ML) is a useful tool for advancing our understanding of the patterns and significance of biomedical data. Given the growing trend on the application of ML techniques in precision medicine, here we present an ML technique which predicts the likelihood of complete remission (CR) in patients diagnosed with acute myeloid leukemia (AML). In this study, we explored the question of whether ML algorithms designed to analyze gene-expression patterns obtained through RNA sequencing (RNA-seq) can be used to accurately predict the likelihood of CR in pediatric AML patients who have received induction therapy. We employed tests of statistical significance to determine which genes were differentially expressed in the samples derived from patients who achieved CR after 2 courses of treatment and the samples taken from patients who did not benefit. We tuned classifier hyperparameters to optimize performance and used multiple methods to guide our feature selection as well as our assessment of algorithm performance. To identify the model which performed best within the context of this study, we plotted receiver operating characteristic (ROC) curves. Using the top 75 genes from the k-nearest neighbors algorithm (K-NN) model ( K = 27) yielded the best area-under-the-curve (AUC) score that we obtained: 0.84. When we finally tested the previously unseen test data set, the top 50 genes yielded the best AUC = 0.81. Pathway enrichment analysis for these 50 genes showed that the guanosine diphosphate fucose (GDP-fucose) biosynthesis pathway is the most significant with an adjusted P value = .0092, which may suggest the vital role of N-glycosylation in AML.


2020 ◽  
Vol 492 (3) ◽  
pp. 4469-4476 ◽  
Author(s):  
E de Carvalho ◽  
A Bernui ◽  
H S Xavier ◽  
C P Novaes

ABSTRACT The clustering properties of the Universe at large scales are currently being probed at various redshifts through several cosmological tracers and with diverse statistical estimators. Here we use the three-point angular correlation function (3PACF) to probe the baryon acoustic oscillation (BAO) features in the quasars catalogue from the Sloan Digital Sky Survey Data Release 12, with mean redshift $\overline{z} = 2.225$, detecting the BAO imprint with a statistical significance of $2.9 \sigma$, obtained using lognormal mocks. Following a quasi-model-independent approach for the 3PACF, we find the BAO transversal signature for triangles with sides θ1 = $1{^{\circ}_{.}}0$ and θ2 = $1{^{\circ}_{.}}5$ and the angle between them of α = 1.59 ± 0.17 rad, a value that corresponds to the angular BAO scale $\theta_{\rm BAO}=1{^{\circ}_{.}}82 \pm 0{^{\circ}_{.}}21$, in excellent agreement with the value found in a recent work ($\theta_{\rm BAO}=1{^{\circ}_{.}}77 \pm 0{^{\circ}_{.}}31$) applying the two-point angular correlation function (2PACF) to similar data. Moreover, we performed two types of test: one to confirm the robustness of the BAO signal in the 3PACF through random displacements in the data set, and the other to verify the suitability of our random samples, a null test that in fact does not show any signature that could bias our results.


Symmetry ◽  
2020 ◽  
Vol 12 (11) ◽  
pp. 1832
Author(s):  
Tomasz Hachaj ◽  
Patryk Mazurek

Deep learning-based feature extraction methods and transfer learning have become common approaches in the field of pattern recognition. Deep convolutional neural networks trained using tripled-based loss functions allow for the generation of face embeddings, which can be directly applied to face verification and clustering. Knowledge about the ground truth of face identities might improve the effectiveness of the final classification algorithm; however, it is also possible to use ground truth clusters previously discovered using an unsupervised approach. The aim of this paper is to evaluate the potential improvement of classification results of state-of-the-art supervised classification methods trained with and without ground truth knowledge. In this study, we use two sufficiently large data sets containing more than 200,000 “taken in the wild” images, each with various resolutions, visual quality, and face poses which, in our opinion, guarantee the statistical significance of the results. We examine several clustering and supervised pattern recognition algorithms and find that knowledge about the ground truth has a very small influence on the Fowlkes–Mallows score (FMS) of the classification algorithm. In the case of the classification algorithm that obtained the highest accuracy in our experiment, the FMS improved by only 5.3% (from 0.749 to 0.791) in the first data set and by 6.6% (from 0.652 to 0.718) in the second data set. Our results show that, beside highly secure systems in which face verification is a key component, face identities discovered by unsupervised approaches can be safely used for training supervised classifiers. We also found that the Silhouette Coefficient (SC) of unsupervised clustering is positively correlated with the Adjusted Rand Index, V-measure score, and Fowlkes–Mallows score and, so, we can use the SC as an indicator of clustering performance when the ground truth of face identities is not known. All of these conclusions are important findings for large-scale face verification problems. The reason for this is the fact that skipping the verification of people’s identities before supervised training saves a lot of time and resources.


Sign in / Sign up

Export Citation Format

Share Document