Development of Prediction Models Using Machine Learning Algorithms for Girls with Suspected Central Precocious Puberty: Retrospective Study

Liyan Pan; Guangjian Liu; Xiaojian Mao; Huixian Li; Jiexin Zhang; Huiying Liang; Xiuzhen Li

doi:10.2196/11728

Development of Prediction Models Using Machine Learning Algorithms for Girls with Suspected Central Precocious Puberty: Retrospective Study (Preprint)

10.2196/preprints.11728 ◽

2018 ◽

Author(s):

Liyan Pan ◽

Guangjian Liu ◽

Xiaojian Mao ◽

Huixian Li ◽

Jiexin Zhang ◽

...

Keyword(s):

Machine Learning ◽

Retrospective Study ◽

Random Forest ◽

Precocious Puberty ◽

Prediction Models ◽

Central Precocious Puberty ◽

Machine Learning Algorithms ◽

Stimulation Test ◽

Gnrh Analogue ◽

Prediction Probability

BACKGROUND Central precocious puberty (CPP) in girls seriously affects their physical and mental development in childhood. The method of diagnosis—gonadotropin-releasing hormone (GnRH)–stimulation test or GnRH analogue (GnRHa)–stimulation test—is expensive and makes patients uncomfortable due to the need for repeated blood sampling. OBJECTIVE We aimed to combine multiple CPP–related features and construct machine learning models to predict response to the GnRHa-stimulation test. METHODS In this retrospective study, we analyzed clinical and laboratory data of 1757 girls who underwent a GnRHa test in order to develop XGBoost and random forest classifiers for prediction of response to the GnRHa test. The local interpretable model-agnostic explanations (LIME) algorithm was used with the black-box classifiers to increase their interpretability. We measured sensitivity, specificity, and area under receiver operating characteristic (AUC) of the models. RESULTS Both the XGBoost and random forest models achieved good performance in distinguishing between positive and negative responses, with the AUC ranging from 0.88 to 0.90, sensitivity ranging from 77.91% to 77.94%, and specificity ranging from 84.32% to 87.66%. Basal serum luteinizing hormone, follicle-stimulating hormone, and insulin-like growth factor-I levels were found to be the three most important factors. In the interpretable models of LIME, the abovementioned variables made high contributions to the prediction probability. CONCLUSIONS The prediction models we developed can help diagnose CPP and may be used as a prescreening tool before the GnRHa-stimulation test.

Download Full-text

Prediction Models for Girls with Suspected Central Precocious Puberty Using Machine Learning Algorithms

SSRN Electronic Journal ◽

10.2139/ssrn.3215549 ◽

2018 ◽

Author(s):

Liyan Pan ◽

Guangjian Liu ◽

Xiaojian Mao ◽

Huixian Li ◽

Jiexin Zhang ◽

...

Keyword(s):

Machine Learning ◽

Precocious Puberty ◽

Prediction Models ◽

Learning Algorithms ◽

Central Precocious Puberty ◽

Machine Learning Algorithms

Download Full-text

Predicting hospitalization following psychiatric crisis care using machine learning

BMC Medical Informatics and Decision Making ◽

10.1186/s12911-020-01361-1 ◽

2020 ◽

Vol 20 (1) ◽

Author(s):

Matthijs Blankers ◽

Louk F. M. van der Post ◽

Jack J. M. Dekker

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Prediction Models ◽

Learning Algorithms ◽

Nearest Neighbors ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Ensemble Model ◽

K Nearest Neighbors ◽

Crisis Care

Abstract Background Accurate prediction models for whether patients on the verge of a psychiatric criseis need hospitalization are lacking and machine learning methods may help improve the accuracy of psychiatric hospitalization prediction models. In this paper we evaluate the accuracy of ten machine learning algorithms, including the generalized linear model (GLM/logistic regression) to predict psychiatric hospitalization in the first 12 months after a psychiatric crisis care contact. We also evaluate an ensemble model to optimize the accuracy and we explore individual predictors of hospitalization. Methods Data from 2084 patients included in the longitudinal Amsterdam Study of Acute Psychiatry with at least one reported psychiatric crisis care contact were included. Target variable for the prediction models was whether the patient was hospitalized in the 12 months following inclusion. The predictive power of 39 variables related to patients’ socio-demographics, clinical characteristics and previous mental health care contacts was evaluated. The accuracy and area under the receiver operating characteristic curve (AUC) of the machine learning algorithms were compared and we also estimated the relative importance of each predictor variable. The best and least performing algorithms were compared with GLM/logistic regression using net reclassification improvement analysis and the five best performing algorithms were combined in an ensemble model using stacking. Results All models performed above chance level. We found Gradient Boosting to be the best performing algorithm (AUC = 0.774) and K-Nearest Neighbors to be the least performing (AUC = 0.702). The performance of GLM/logistic regression (AUC = 0.76) was slightly above average among the tested algorithms. In a Net Reclassification Improvement analysis Gradient Boosting outperformed GLM/logistic regression by 2.9% and K-Nearest Neighbors by 11.3%. GLM/logistic regression outperformed K-Nearest Neighbors by 8.7%. Nine of the top-10 most important predictor variables were related to previous mental health care use. Conclusions Gradient Boosting led to the highest predictive accuracy and AUC while GLM/logistic regression performed average among the tested algorithms. Although statistically significant, the magnitude of the differences between the machine learning algorithms was in most cases modest. The results show that a predictive accuracy similar to the best performing model can be achieved when combining multiple algorithms in an ensemble model.

Download Full-text

Software Maintainability: Systematic Literature Review and Current Trends

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s0218194016500431 ◽

2016 ◽

Vol 26 (08) ◽

pp. 1221-1253 ◽

Cited By ~ 16

Author(s):

Ruchika Malhotra ◽

Anuradha Chug

Keyword(s):

Machine Learning ◽

Prediction Model ◽

Evaluation System ◽

Prediction Models ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Design Metrics ◽

Current Trends ◽

Software Maintainability ◽

Early Phases

Software maintenance is an expensive activity that consumes a major portion of the cost of the total project. Various activities carried out during maintenance include the addition of new features, deletion of obsolete code, correction of errors, etc. Software maintainability means the ease with which these operations can be carried out. If the maintainability can be measured in early phases of the software development, it helps in better planning and optimum resource utilization. Measurement of design properties such as coupling, cohesion, etc. in early phases of development often leads us to derive the corresponding maintainability with the help of prediction models. In this paper, we performed a systematic review of the existing studies related to software maintainability from January 1991 to October 2015. In total, 96 primary studies were identified out of which 47 studies were from journals, 36 from conference proceedings and 13 from others. All studies were compiled in structured form and analyzed through numerous perspectives such as the use of design metrics, prediction model, tools, data sources, prediction accuracy, etc. According to the review results, we found that the use of machine learning algorithms in predicting maintainability has increased since 2005. The use of evolutionary algorithms has also begun in related sub-fields since 2010. We have observed that design metrics is still the most favored option to capture the characteristics of any given software before deploying it further in prediction model for determining the corresponding software maintainability. A significant increase in the use of public dataset for making the prediction models has also been observed and in this regard two public datasets User Interface Management System (UIMS) and Quality Evaluation System (QUES) proposed by Li and Henry is quite popular among researchers. Although machine learning algorithms are still the most popular methods, however, we suggest that researchers working on software maintainability area should experiment on the use of open source datasets with hybrid algorithms. In this regard, more empirical studies are also required to be conducted on a large number of datasets so that a generalized theory could be made. The current paper will be beneficial for practitioners, researchers and developers as they can use these models and metrics for creating benchmark and standards. Findings of this extensive review would also be useful for novices in the field of software maintainability as it not only provides explicit definitions, but also lays a foundation for further research by providing a quick link to all important studies in the said field. Finally, this study also compiles current trends, emerging sub-fields and identifies various opportunities of future research in the field of software maintainability.

Download Full-text

Targeting Virus-host Protein Interactions: Feature Extraction and Machine Learning Approaches

Current Drug Metabolism ◽

10.2174/1389200219666180829121038 ◽

2019 ◽

Vol 20 (3) ◽

pp. 177-184 ◽

Cited By ~ 16

Author(s):

Nantao Zheng ◽

Kairou Wang ◽

Weihua Zhan ◽

Lei Deng

Keyword(s):

Machine Learning ◽

Computational Methods ◽

Protein Interactions ◽

Prediction Models ◽

Learning Algorithms ◽

Biological Data ◽

Machine Learning Algorithms ◽

Host Protein ◽

Protein Protein Interactions ◽

Protein Motifs

Background:Targeting critical viral-host Protein-Protein Interactions (PPIs) has enormous application prospects for therapeutics. Using experimental methods to evaluate all possible virus-host PPIs is labor-intensive and time-consuming. Recent growth in computational identification of virus-host PPIs provides new opportunities for gaining biological insights, including applications in disease control. We provide an overview of recent computational approaches for studying virus-host PPI interactions.Methods:In this review, a variety of computational methods for virus-host PPIs prediction have been surveyed. These methods are categorized based on the features they utilize and different machine learning algorithms including classical and novel methods.Results:We describe the pivotal and representative features extracted from relevant sources of biological data, mainly include sequence signatures, known domain interactions, protein motifs and protein structure information. We focus on state-of-the-art machine learning algorithms that are used to build binary prediction models for the classification of virus-host protein pairs and discuss their abilities, weakness and future directions.Conclusion:The findings of this review confirm the importance of computational methods for finding the potential protein-protein interactions between virus and host. Although there has been significant progress in the prediction of virus-host PPIs in recent years, there is a lot of room for improvement in virus-host PPI prediction.

Download Full-text

Mapping (un)certainty of machine learning-based spatial prediction models based on predictor space distances

10.5194/egusphere-egu2020-8492 ◽

2020 ◽

Author(s):

Hanna Meyer ◽

Edzer Pebesma

Keyword(s):

Machine Learning ◽

Spatial Patterns ◽

Environmental Science ◽

Prediction Models ◽

Learning Algorithms ◽

Predictor Variable ◽

Spatial Prediction ◽

Machine Learning Algorithms ◽

Training Data ◽

Field Samples

Spatial mapping is an important task in environmental science to reveal spatial patterns and changes of the environment. In this context predictive modelling using flexible machine learning algorithms has become very popular. However, looking at the diversity of modelled (global) maps of environmental variables, there might be increasingly the impression that machine learning is a magic tool to map everything. Recently, the reliability of such maps have been increasingly questioned, calling for a reliable quantification of uncertainties.Though spatial (cross-)validation allows giving a general error estimate for the predictions, models are usually applied to make predictions for a much larger area or might even be transferred to make predictions for an area where they were not trained on. But by making predictions on heterogeneous landscapes, there will be areas that feature environmental properties that have not been observed in the training data and hence not learned by the algorithm. This is problematic as most machine learning algorithms are weak in extrapolations and can only make reliable predictions for environments with conditions the model has knowledge about. Hence predictions for environmental conditions that differ significantly from the training data have to be considered as uncertain.To approach this problem, we suggest a measure of uncertainty that allows identifying locations where predictions should be regarded with care. The proposed uncertainty measure is based on distances to the training data in the multidimensional predictor variable space. However, distances are not equally relevant within the feature space but some variables are more important than others in the machine learning model and hence are mainly responsible for prediction patterns. Therefore, we weight the distances by the model-derived importance of the predictors.&#160;As a case study we use a simulated area-wide response variable for Europe, bio-climatic variables as predictors, as well as simulated field samples. Random Forest is applied as algorithm to predict the simulated response. The model is then used to make predictions for entire Europe. We then calculate the corresponding uncertainty and compare it to the area-wide true prediction error.&#160;The results show that the uncertainty map reflects the patterns in the true error very well and considerably outperforms ensemble-based standard deviations of predictions as indicator for uncertainty.The resulting map of uncertainty gives valuable insights into spatial patterns of prediction uncertainty which is important when the predictions are used as a baseline for decision making or subsequent environmental modelling. Hence, we suggest that a map of distance-based uncertainty should be given in addition to prediction maps.

Download Full-text

Prediction models with multiple machine learning algorithms for POPs: the calculation of PDMS-air partition coefficient from molecular descriptor

Journal of Hazardous Materials ◽

10.1016/j.jhazmat.2021.127037 ◽

2021 ◽

pp. 127037

Author(s):

Tengyi Zhu ◽

Cuicui Tao

Keyword(s):

Machine Learning ◽

Partition Coefficient ◽

Prediction Models ◽

Molecular Descriptor ◽

Learning Algorithms ◽

Machine Learning Algorithms

Download Full-text

Enhancing Generalizability of Predictive Models with Synergy of Data and Physics

Measurement Science and Technology ◽

10.1088/1361-6501/ac3944 ◽

2021 ◽

Author(s):

Yingjun Shen ◽

Zhe Song ◽

Andrew Kusiak

Keyword(s):

Machine Learning ◽

Data Mining ◽

Predictive Models ◽

Prediction Models ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Predictive Maintenance ◽

Divide And Conquer ◽

Less Is More ◽

Industrial Big Data

Abstract Wind farm needs prediction models for predictive maintenance. There is a need to predict values of non-observable parameters beyond ranges reflected in available data. A prediction model developed for one machine many not perform well in another similar machine. This is usually due to lack of generalizability of data-driven models. To increase generalizability of predictive models, this research integrates the data mining with first-principle knowledge. Physics-based principles are combined with machine learning algorithms through feature engineering, strong rules and divide-and-conquer. The proposed synergy concept is illustrated with the wind turbine blade icing prediction and achieves significant prediction accuracy across different turbines. The proposed process is widely accepted by wind energy predictive maintenance practitioners because of its simplicity and efficiency. Furthermore, the testing scores of KNN, CART and DNN algorithm are increased by 44.78%, 32.72% and 9.13% with our proposed process. We demonstrated the importance of embedding physical principles within the machine learning process, and also highlight an important point that the need for more complex machine learning algorithms in industrial big data mining is often much less than it is in other applications, making it essential to incorporate physics and follow “Less is More” philosophy.

Download Full-text

Machine Learning Housing Price Prediction in Petaling Jaya, Selangor, Malaysia

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b1084.0982s1119 ◽

2019 ◽

Vol 8 (2S11) ◽

pp. 542-546

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Prediction Models ◽

Learning Algorithms ◽

Housing Price ◽

Machine Learning Algorithms ◽

Selling Price ◽

Prediction Problem ◽

Real Dataset ◽

Price Prediction

This paper demonstrates the utilization of machine learning algorithms in the prediction of housing selling prices on real dataset collected from the Petaling Jaya area, Selangor, Malaysia. To date, literature about research on machine learning prediction of housing selling price in Malaysia is scarce. This paper provides a brief review of the existing machine learning algorithms for the prediction problem and presents the characteristics of the collected datasets with different groups of feature selection. The findings indicate that using irrelevant features from the dataset can decrease the accuracy of the prediction models.

Download Full-text

Comparative analysis of multiple classification models to improve PM10 prediction performance

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v11i3.pp2500-2507 ◽

2021 ◽

Vol 11 (3) ◽

pp. 2500

Author(s):

Yong-Jin Jung ◽

Kyoung-Woo Cho ◽

Jong-Sung Lee ◽

Chang-Heon Oh

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Particulate Matter ◽

Prediction Models ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Occurrence Rate ◽

Classification Models ◽

Comparison Results ◽

Multiple Classification

With the increasing requirement of high accuracy for particulate matter prediction, various attempts have been made to improve prediction accuracy by applying machine learning algorithms. However, the characteristics of particulate matter and the problem of the occurrence rate by concentration make it difficult to train prediction models, resulting in poor prediction. In order to solve this problem, in this paper, we proposed multiple classification models for predicting particulate matter concentrations required for prediction by dividing them into AQI-based classes. We designed multiple classification models using logistic regression, decision tree, SVM and ensemble among the various machine learning algorithms. The comparison results of the performance of the four classification models through error matrices confirmed the f-score of 0.82 or higher for all the models other than the logistic regression model.

Download Full-text