Prediction of Eye Colour in Scandinavians Using the EyeColour 11 (EC11) SNP Set

Description of a perpetrator’s eye colour can be an important investigative lead in a forensic case with no apparent suspects. Herein, we present 11 SNPs (Eye Colour 11-EC11) that are important for eye colour prediction and eye colour prediction models for a two-category reporting system (blue and brown) and a three-category system (blue, intermediate, and brown). The EC11 SNPs were carefully selected from 44 pigmentary variants in seven genes previously found to be associated with eye colours in 757 Europeans (Danes, Swedes, and Italians). Mathematical models using three different reporting systems: a quantitative system (PIE-score), a two-category system (blue and brown), and a three-category system (blue, intermediate, brown) were used to rank the variants. SNPs with a sufficient mean variable importance (above 0.3%) were selected for EC11. Eye colour prediction models using the EC11 SNPs were developed using leave-one-out cross-validation (LOOCV) in an independent data set of 523 Norwegian individuals. Performance of the EC11 models for the two- and three-category system was compared with models based on the IrisPlex SNPs and the most important eye colour locus, rs12913832. We also compared model performances with the IrisPlex online tool (IrisPlex Web). The EC11 eye colour prediction models performed slightly better than the IrisPlex and rs12913832 models in all reporting systems and better than the IrisPlex Web in the three-category system. Three important points to consider prior to the implementation of eye colour prediction in a forensic genetic setting are discussed: (1) the reference population, (2) the SNP set, and (3) the reporting strategy.

Download Full-text

Evaluation for estimating of the PDF and the CDF of Generalized Inverted Exponential Distribution with Application in Industry

Advances in Mathematics: Scientific Journal ◽

10.37418/amsj.9.1.39 ◽

2020 ◽

pp. 507-522

Author(s):

Parisa Torkaman

Keyword(s):

Least Squares ◽

Exponential Distribution ◽

Mean Squared Error ◽

Weighted Least Squares ◽

Real Data ◽

Minimum Variance ◽

Cumulative Distribution ◽

Estimation Methods ◽

Data Set ◽

Better Than

The generalized inverted exponential distribution is introduced as a lifetime model with good statistical properties. This paper, the estimation of the probability density function and the cumulative distribution function of with five different estimation methods: uniformly minimum variance unbiased(UMVU), maximum likelihood(ML), least squares(LS), weighted least squares (WLS) and percentile(PC) estimators are considered. The performance of these estimation procedures, based on the mean squared error (MSE) by numerical simulations are compared. Simulation studies express that the UMVU estimator performs better than others and when the sample size is large enough the ML and UMVU estimators are almost equivalent and efficient than LS, WLS and PC. Finally, the result using a real data set are analyzed.

Download Full-text

A Computational Method for the Identification of Endolysins and Autolysins

Protein and Peptide Letters ◽

10.2174/0929866526666191002104735 ◽

2020 ◽

Vol 27 (4) ◽

pp. 329-336 ◽

Cited By ~ 1

Author(s):

Lei Xu ◽

Guangmin Liang ◽

Baowen Chen ◽

Xu Tan ◽

Huaikun Xiang ◽

...

Keyword(s):

Support Vector Machine ◽

Cell Wall ◽

Experimental Results ◽

Computational Method ◽

Lytic Enzyme ◽

Support Vector ◽

Lytic Enzymes ◽

Data Set ◽

Optimal Feature ◽

Better Than

Background: Cell lytic enzyme is a kind of highly evolved protein, which can destroy the cell structure and kill the bacteria. Compared with antibiotics, cell lytic enzyme will not cause serious problem of drug resistance of pathogenic bacteria. Thus, the study of cell wall lytic enzymes aims at finding an efficient way for curing bacteria infectious. Compared with using antibiotics, the problem of drug resistance becomes more serious. Therefore, it is a good choice for curing bacterial infections by using cell lytic enzymes. Cell lytic enzyme includes endolysin and autolysin and the difference between them is the purpose of the break of cell wall. The identification of the type of cell lytic enzymes is meaningful for the study of cell wall enzymes. Objective: In this article, our motivation is to predict the type of cell lytic enzyme. Cell lytic enzyme is helpful for killing bacteria, so it is meaningful for study the type of cell lytic enzyme. However, it is time consuming to detect the type of cell lytic enzyme by experimental methods. Thus, an efficient computational method for the type of cell lytic enzyme prediction is proposed in our work. Method: We propose a computational method for the prediction of endolysin and autolysin. First, a data set containing 27 endolysins and 41 autolysins is built. Then the protein is represented by tripeptides composition. The features are selected with larger confidence degree. At last, the classifier is trained by the labeled vectors based on support vector machine. The learned classifier is used to predict the type of cell lytic enzyme. Results: Following the proposed method, the experimental results show that the overall accuracy can attain 97.06%, when 44 features are selected. Compared with Ding's method, our method improves the overall accuracy by nearly 4.5% ((97.06-92.9)/92.9%). The performance of our proposed method is stable, when the selected feature number is from 40 to 70. The overall accuracy of tripeptides optimal feature set is 94.12%, and the overall accuracy of Chou's amphiphilic PseAAC method is 76.2%. The experimental results also demonstrate that the overall accuracy is improved by nearly 18% when using the tripeptides optimal feature set. Conclusion: The paper proposed an efficient method for identifying endolysin and autolysin. In this paper, support vector machine is used to predict the type of cell lytic enzyme. The experimental results show that the overall accuracy of the proposed method is 94.12%, which is better than some existing methods. In conclusion, the selected 44 features can improve the overall accuracy for identification of the type of cell lytic enzyme. Support vector machine performs better than other classifiers when using the selected feature set on the benchmark data set.

Download Full-text

In silico Prediction of Inhibitory Constant of Thrombin Inhibitors Using Machine Learning

Combinatorial Chemistry & High Throughput Screening ◽

10.2174/1386207322666181220130232 ◽

2019 ◽

Vol 21 (9) ◽

pp. 662-669 ◽

Cited By ~ 1

Author(s):

Junnan Zhao ◽

Lu Zhu ◽

Weineng Zhou ◽

Lingfeng Yin ◽

Yuchen Wang ◽

...

Keyword(s):

Machine Learning ◽

Prediction Models ◽

Regression Tree ◽

Large Data ◽

Thrombin Inhibitors ◽

Coagulation Cascade ◽

Gradient Boosting ◽

Support Vector ◽

Data Set ◽

Descriptor Selection

Background: Thrombin is the central protease of the vertebrate blood coagulation cascade, which is closely related to cardiovascular diseases. The inhibitory constant Ki is the most significant property of thrombin inhibitors. Method: This study was carried out to predict Ki values of thrombin inhibitors based on a large data set by using machine learning methods. Taking advantage of finding non-intuitive regularities on high-dimensional datasets, machine learning can be used to build effective predictive models. A total of 6554 descriptors for each compound were collected and an efficient descriptor selection method was chosen to find the appropriate descriptors. Four different methods including multiple linear regression (MLR), K Nearest Neighbors (KNN), Gradient Boosting Regression Tree (GBRT) and Support Vector Machine (SVM) were implemented to build prediction models with these selected descriptors. Results: The SVM model was the best one among these methods with R2=0.84, MSE=0.55 for the training set and R2=0.83, MSE=0.56 for the test set. Several validation methods such as yrandomization test and applicability domain evaluation, were adopted to assess the robustness and generalization ability of the model. The final model shows excellent stability and predictive ability and can be employed for rapid estimation of the inhibitory constant, which is full of help for designing novel thrombin inhibitors.

Download Full-text

CADD-Splice—improving genome-wide variant effect prediction using deep learning-derived splice scores

Genome Medicine ◽

10.1186/s13073-021-00835-9 ◽

2021 ◽

Vol 13 (1) ◽

Author(s):

Philipp Rentzsch ◽

Max Schubach ◽

Jay Shendure ◽

Martin Kircher

Keyword(s):

Prediction Models ◽

Splice Variants ◽

Superior Performance ◽

Data Set ◽

Pathogenic Variants ◽

Genome Wide ◽

Donor And Acceptor ◽

Human Proteins ◽

Variant Effect ◽

Variant Effect Prediction

Abstract Background Splicing of genomic exons into mRNAs is a critical prerequisite for the accurate synthesis of human proteins. Genetic variants impacting splicing underlie a substantial proportion of genetic disease, but are challenging to identify beyond those occurring at donor and acceptor dinucleotides. To address this, various methods aim to predict variant effects on splicing. Recently, deep neural networks (DNNs) have been shown to achieve better results in predicting splice variants than other strategies. Methods It has been unclear how best to integrate such process-specific scores into genome-wide variant effect predictors. Here, we use a recently published experimental data set to compare several machine learning methods that score variant effects on splicing. We integrate the best of those approaches into general variant effect prediction models and observe the effect on classification of known pathogenic variants. Results We integrate two specialized splicing scores into CADD (Combined Annotation Dependent Depletion; cadd.gs.washington.edu), a widely used tool for genome-wide variant effect prediction that we previously developed to weight and integrate diverse collections of genomic annotations. With this new model, CADD-Splice, we show that inclusion of splicing DNN effect scores substantially improves predictions across multiple variant categories, without compromising overall performance. Conclusions While splice effect scores show superior performance on splice variants, specialized predictors cannot compete with other variant scores in general variant interpretation, as the latter account for nonsense and missense effects that do not alter splicing. Although only shown here for splice scores, we believe that the applied approach will generalize to other specific molecular processes, providing a path for the further improvement of genome-wide variant effect prediction.

Download Full-text

Applying Artificial Neural Networks. I. Estimating Nicotine in Tobacco from near Infrared Data

Journal of Near Infrared Spectroscopy ◽

10.1255/jnirs.64 ◽

1995 ◽

Vol 3 (3) ◽

pp. 133-142 ◽

Cited By ~ 10

Author(s):

M. Hana ◽

W.F. McClure ◽

T.B. Whitaker ◽

M. White ◽

D.R. Bahler

Keyword(s):

Linear Regression ◽

Regression Model ◽

Linear Regression Model ◽

Near Infrared ◽

Back Propagation ◽

Linear Network ◽

Data Set ◽

Input Layer ◽

Propagation Network ◽

Better Than

Two artificial neural network models were used to estimate the nicotine in tobacco: (i) a back-propagation network and (ii) a linear network. The back-propagation network consisted of an input layer, an output layer and one hidden layer. The linear network consisted of an input layer and an output layer. Both networks used the generalised delta rule for learning. Performances of both networks were compared to the multiple linear regression method MLR of calibration. The nicotine content in tobacco samples was estimated for two different data sets. Data set A contained 110 near infrared (NIR) spectra each consisting of reflected energy at eight wavelengths. Data set B consisted of 200 NIR spectra with each spectrum having 840 spectral data points. The Fast Fourier transformation was applied to data set B in order to compress each spectrum into 13 Fourier coefficients. For data set A, the linear regression model gave better results followed by the back-propagation network which was followed by the linear network. The true performance of the linear regression model was better than the back-propagation and the linear networks by 14.0% and 18.1%, respectively. For data set B, the back-propagation network gave the best result followed by MLR and the linear network. Both the linear network and MLR models gave almost the same results. The true performance of the back-propagation network model was better than the MLR and linear network by 35.14%.

Download Full-text

A simple clustering technique to extract subsets of data for function approximation

Journal of Hydroinformatics ◽

10.2166/hydro.2015.065 ◽

2015 ◽

Vol 17 (5) ◽

pp. 719-732

Author(s):

Dulakshi Santhusitha Kumari Karunasingha ◽

Shie-Yui Liong

Keyword(s):

Function Approximation ◽

Prediction Models ◽

Data Extraction ◽

Single Parameter ◽

Subtractive Clustering ◽

Data Sets ◽

Clustering Methods ◽

Clustering Method ◽

Data Set ◽

Functional Relationships

A simple clustering method is proposed for extracting representative subsets from lengthy data sets. The main purpose of the extracted subset of data is to use it to build prediction models (of the form of approximating functional relationships) instead of using the entire large data set. Such smaller subsets of data are often required in exploratory analysis stages of studies that involve resource consuming investigations. A few recent studies have used a subtractive clustering method (SCM) for such data extraction, in the absence of clustering methods for function approximation. SCM, however, requires several parameters to be specified. This study proposes a clustering method, which requires only a single parameter to be specified, yet it is shown to be as effective as the SCM. A method to find suitable values for the parameter is also proposed. Due to having only a single parameter, using the proposed clustering method is shown to be orders of magnitudes more efficient than using SCM. The effectiveness of the proposed method is demonstrated on phase space prediction of three univariate time series and prediction of two multivariate data sets. Some drawbacks of SCM when applied for data extraction are identified, and the proposed method is shown to be a solution for them.

Download Full-text

Initializing numerical weather prediction models with satellite-derived surface soil moisture: Data assimilation experiments with ECMWF's Integrated Forecast System and the TMI soil moisture data set

Journal of Geophysical Research Atmospheres ◽

10.1029/2006jd007478 ◽

2007 ◽

Vol 112 (D3) ◽

Cited By ~ 126

Author(s):

M. Drusch

Keyword(s):

Soil Moisture ◽

Numerical Weather Prediction ◽

Prediction Models ◽

Surface Soil ◽

Weather Prediction ◽

Surface Soil Moisture ◽

Forecast System ◽

Data Set ◽

Soil Moisture Data ◽

Numerical Weather Prediction Models

Download Full-text

A new approach to categorising continuous variables in prediction models: Proposal and validation

Statistical Methods in Medical Research ◽

10.1177/0962280215601873 ◽

2015 ◽

Vol 26 (6) ◽

pp. 2586-2602 ◽

Cited By ~ 30

Author(s):

Irantzu Barrio ◽

Inmaculada Arostegui ◽

María-Xosé Rodríguez-Álvarez ◽

José-María Quintana

Keyword(s):

Prediction Models ◽

Characteristic Curve ◽

Prediction Rule ◽

Real Data ◽

Clinical Variable ◽

Chronic Obstructive ◽

Continuous Variables ◽

Discriminative Ability ◽

Data Set ◽

Statistical Point

When developing prediction models for application in clinical practice, health practitioners usually categorise clinical variables that are continuous in nature. Although categorisation is not regarded as advisable from a statistical point of view, due to loss of information and power, it is a common practice in medical research. Consequently, providing researchers with a useful and valid categorisation method could be a relevant issue when developing prediction models. Without recommending categorisation of continuous predictors, our aim is to propose a valid way to do it whenever it is considered necessary by clinical researchers. This paper focuses on categorising a continuous predictor within a logistic regression model, in such a way that the best discriminative ability is obtained in terms of the highest area under the receiver operating characteristic curve (AUC). The proposed methodology is validated when the optimal cut points’ location is known in theory or in practice. In addition, the proposed method is applied to a real data-set of patients with an exacerbation of chronic obstructive pulmonary disease, in the context of the IRYSS-COPD study where a clinical prediction rule for severe evolution was being developed. The clinical variable PCO2 was categorised in a univariable and a multivariable setting.

Download Full-text

Building prediction models for coronary heart disease by synthesizing multiple longitudinal research findings

European Journal of Cardiovascular Prevention & Rehabilitation ◽

10.1097/01.hjr.0000173109.14228.71 ◽

2005 ◽

Vol 12 (5) ◽

pp. 459-464 ◽

Cited By ~ 4

Author(s):

Guizhou Hu ◽

Martin M. Root

Keyword(s):

Coronary Heart Disease ◽

Heart Disease ◽

Prediction Model ◽

Empirical Model ◽

Complex Disease ◽

Prediction Models ◽

Longitudinal Research ◽

Study Data ◽

Individual Risk ◽

Data Set

Background No methodology is currently available to allow the combining of individual risk factor information derived from different longitudinal studies for a chronic disease in a multivariate fashion. This paper introduces such a methodology, named Synthesis Analysis, which is essentially a multivariate meta-analytic technique. Design The construction and validation of statistical models using available data sets. Methods and results Two analyses are presented. (1) With the same data, Synthesis Analysis produced a similar prediction model to the conventional regression approach when using the same risk variables. Synthesis Analysis produced better prediction models when additional risk variables were added. (2) A four-variable empirical logistic model for death from coronary heart disease was developed with data from the Framingham Heart Study. A synthesized prediction model with five new variables added to this empirical model was developed using Synthesis Analysis and literature information. This model was then compared with the four-variable empirical model using the first National Health and Nutrition Examination Survey (NHANES I) Epidemiologic Follow-up Study data set. The synthesized model had significantly improved predictive power ( x2 = 43.8, P < 0.00001). Conclusions Synthesis Analysis provides a new means of developing complex disease predictive models from the medical literature.

Download Full-text

Fast effect size shrinkage software for beta-binomial models of allelic imbalance

F1000Research ◽

10.12688/f1000research.20916.2 ◽

2020 ◽

Vol 8 ◽

pp. 2024

Author(s):

Joshua P. Zitovsky ◽

Michael I. Love

Keyword(s):

Allelic Imbalance ◽

Real Data ◽

Shrinkage Estimators ◽

Data Set ◽

Bayesian Shrinkage ◽

In Cis ◽

Posterior Estimation ◽

Binomial Models ◽

Better Than ◽

Diploid Organism

Allelic imbalance occurs when the two alleles of a gene are differentially expressed within a diploid organism and can indicate important differences in cis-regulation and epigenetic state across the two chromosomes. Because of this, the ability to accurately quantify the proportion at which each allele of a gene is expressed is of great interest to researchers. This becomes challenging in the presence of small read counts and/or sample sizes, which can cause estimators for allelic expression proportions to have high variance. Investigators have traditionally dealt with this problem by filtering out genes with small counts and samples. However, this may inadvertently remove important genes that have truly large allelic imbalances. Another option is to use pseudocounts or Bayesian estimators to reduce the variance. To this end, we evaluated the accuracy of four different estimators, the latter two of which are Bayesian shrinkage estimators: maximum likelihood, adding a pseudocount to each allele, approximate posterior estimation of GLM coefficients (apeglm) and adaptive shrinkage (ash). We also wrote C++ code to quickly calculate ML and apeglm estimates and integrated it into the apeglm package. The four methods were evaluated on two simulations and one real data set. Apeglm consistently performed better than ML according to a variety of criteria, and generally outperformed use of pseudocounts as well. Ash also performed better than ML in one of the simulations, but in the other performance was more mixed. Finally, when compared to five other packages that also fit beta-binomial models, the apeglm package was substantially faster and more numerically reliable, making our package useful for quick and reliable analyses of allelic imbalance. Apeglm is available as an R/Bioconductor package at http://bioconductor.org/packages/apeglm.

Download Full-text