Machine Learning in Updating Predictive Models of Planning and Scheduling Transportation Projects

A method combining machine learning and regression analysis to automatically and intelligently update predictive models used in the Kansas Department of Transportation’s (KDOT’s) internal management system is presented. The predictive models used by KDOT consist of planning factors (mathematical functions) and base quantities (constants). The duration of a functional unit (defined as a subactivity) is determined by the product of a planning factor and its base quantity. The availability of a large data base on projects executed over the past decade provided the opportunity to develop an automated process updating predictive models based on extracting information from historical data through machine learning. To perform the entire task of updating the predictive models, the learning process consists of three stages. The first stage derives the numerical relationship between the duration of a functional unit and the project attributes recorded in the data base. The second stage finds the functional units with similar behavior—that is, identifies functional units that can be described by the same shared planning factor scaled in terms of their own base quantities. The third stage generates new planning factors and base quantities. A system called PFactor built on the basis of the three-stage learning process shows good performance in updating KDOT’s predictive models.

Download Full-text

A Bayesian Machine Learning Approach for Efficient Integrity Management of Steel Lazy Wave Risers

Volume 4: Pipelines, Risers, and Subsea Systems ◽

10.1115/omae2020-18190 ◽

2020 ◽

Author(s):

Rasoul Hejazi ◽

Andrew Grime ◽

Mark Randolph ◽

Mike Efthymiou

Keyword(s):

Machine Learning ◽

Fatigue Failure ◽

Learning Process ◽

Predictive Models ◽

Structural Integrity ◽

Degradation Mechanism ◽

Data Driven ◽

Practical Implementation ◽

Integrity Management

Abstract In-service integrity management (IM) of steel lazy wave risers (SLWRs) can benefit significantly from quantitative assessment of the overall risk of system failure as it can provide an effective tool for decision making. SLWRs are prone to fatigue failure within their touchdown zone (TDZ). This failure mode needs to be evaluated rigorously in riser IM processes because fatigue is an ongoing degradation mechanism threatening the structural integrity of risers throughout their service life. However, accurately evaluating the probability of fatigue failure for riser systems within a useful time frame is challenging due to the need to run a large number of nonlinear, dynamic numerical time domain simulations. Applying the Bayesian framework for machine learning, through the use of Gaussian Processes (GP) for regression, offers an attractive solution to overcome the burden of prohibitive simulation run times. GPs are stochastic, data-driven predictive models which incorporate the underlying physics of the problem in the learning process, and facilitate rapid probabilistic assessments with limited loss in accuracy. This paper proposes an efficient framework for practical implementation of a GP to create predictive models for the estimation of fatigue responses at SLWR hotspots. Such models are able to perform stochastic response prediction within a few milliseconds, thus enabling rapid prediction of the probability of SLWR fatigue failure. A realistic North West Shelf (NWS) case study is used to demonstrate the framework, comprising a 20” SLWR connected to a representative floating facility located in 950 m water depth. A full hindcast metocean dataset with associated statistical distributions are used for the riser long-term fatigue loading conditions. Numerical simulation and sampling techniques are adopted to generate a simulation-based dataset for training the data-driven model. In addition, a recently developed dimensionality reduction technique is employed to improve efficiency and reduce complexity of the learning process. The results show that the stochastic predictive models developed by the suggested framework can predict the long-term TDZ fatigue damage of SLWRs due to vessel motions with an acceptable level of accuracy for practical purposes.

Download Full-text

SP-0551 Exploiting large data base to build robust predictive models: validation issues

Radiotherapy and Oncology ◽

10.1016/s0167-8140(19)30971-5 ◽

2019 ◽

Vol 133 ◽

pp. S290

Author(s):

T. Rancati

Keyword(s):

Data Base ◽

Predictive Models ◽

Large Data ◽

Large Data Base ◽

Models Validation

Download Full-text

Benchmarking missing-values approaches for predictive models on health databases v2

10.17504/protocols.io.b3nfqmbn ◽

2022 ◽

Author(s):

Alexandre Perez-Lebel ◽

Gaël Varoquaux ◽

Marine Le Morvan ◽

Julie Josse ◽

Jean-Baptiste Poline

Keyword(s):

Machine Learning ◽

Predictive Models ◽

Missing Values ◽

State Of The Art ◽

Computational Cost ◽

Large Data ◽

Supervised Machine Learning ◽

Computational Time ◽

Generative Modeling ◽

Predictive Approaches

BACKGROUND As databases grow larger, it becomes harder to fully control their collection, and they frequently come with missing values: incomplete observations. These large databases are well suited to train machine-learning models, for instance for forecasting or to extract biomarkers in biomedical settings. Such predictive approaches can use discriminative --rather than generative-- modeling, and thus open the door to new missing-values strategies. Yet existing empirical evaluations of strategies to handle missing values have focused on inferential statistics. RESULTS Here we conduct a systematic benchmark of missing-values strategies in predictive models with a focus on large health databases: four electronic health record datasets, a population brain imaging one, a health survey and two intensive care ones. Using gradient-boosted trees, we compare native support for missing values with simple and state-of-the-art imputation prior to learning. We investigate prediction accuracy and computational time. For prediction after imputation, we find that adding an indicator to express which values have been imputed is important, suggesting that the data are missing not at random. Elaborate missing values imputation can improve prediction compared to simple strategies but requires longer computational time on large data. Learning trees that model missing values --with missing incorporated attribute-- leads to robust, fast, and well-performing predictive modeling. CONCLUSIONS Native support for missing values in supervised machine learning predicts better than state-of-the-art imputation with much less computational cost. When using imputation, it is important to add indicator columns expressing which values have been imputed.

Download Full-text

EP-1269: From datasets to predictive models in cervical cancer: an ontology to mine data for large data-base

Radiotherapy and Oncology ◽

10.1016/s0167-8140(15)41261-7 ◽

2015 ◽

Vol 115 ◽

pp. S685

Author(s):

R. Autorino ◽

M.A. Gambacorta ◽

L. Tagliaferri ◽

M. Campitelli ◽

E. Meldolesi ◽

...

Keyword(s):

Cervical Cancer ◽

Data Base ◽

Predictive Models ◽

Large Data ◽

Large Data Base

Download Full-text

Machine-Learning-Guided Cocrystal Prediction Based on Large Data Base

Crystal Growth & Design ◽

10.1021/acs.cgd.0c00767 ◽

2020 ◽

Vol 20 (10) ◽

pp. 6610-6621

Author(s):

Dingyan Wang ◽

Zeen Yang ◽

Bingqing Zhu ◽

Xuefeng Mei ◽

Xiaomin Luo

Keyword(s):

Machine Learning ◽

Data Base ◽

Large Data ◽

Large Data Base

Download Full-text

Epigenetic Target Prediction with Accurate Machine Learning Models

10.26434/chemrxiv.13522313 ◽

2021 ◽

Author(s):

Norberto Sánchez-Cruz ◽

Jose L. Medina-Franco

Keyword(s):

Machine Learning ◽

Small Molecules ◽

Predictive Models ◽

Large Scale ◽

Target Prediction ◽

Quantitative Measure ◽

Learning Models ◽

Discovery Research ◽

Drug Discovery Research ◽

Machine Learning Models

<p>Epigenetic targets are a significant focus for drug discovery research, as demonstrated by the eight approved epigenetic drugs for treatment of cancer and the increasing availability of chemogenomic data related to epigenetics. This data represents a large amount of structure-activity relationships that has not been exploited thus far for the development of predictive models to support medicinal chemistry efforts. Herein, we report the first large-scale study of 26318 compounds with a quantitative measure of biological activity for 55 protein targets with epigenetic activity. Through a systematic comparison of machine learning models trained on molecular fingerprints of different design, we built predictive models with high accuracy for the epigenetic target profiling of small molecules. The models were thoroughly validated showing mean precisions up to 0.952 for the epigenetic target prediction task. Our results indicate that the herein reported models have considerable potential to identify small molecules with epigenetic activity. Therefore, our results were implemented as freely accessible and easy-to-use web application.</p>

Download Full-text

In silico Prediction of Inhibitory Constant of Thrombin Inhibitors Using Machine Learning

Combinatorial Chemistry & High Throughput Screening ◽

10.2174/1386207322666181220130232 ◽

2019 ◽

Vol 21 (9) ◽

pp. 662-669 ◽

Cited By ~ 1

Author(s):

Junnan Zhao ◽

Lu Zhu ◽

Weineng Zhou ◽

Lingfeng Yin ◽

Yuchen Wang ◽

...

Keyword(s):

Machine Learning ◽

Prediction Models ◽

Regression Tree ◽

Large Data ◽

Thrombin Inhibitors ◽

Coagulation Cascade ◽

Gradient Boosting ◽

Support Vector ◽

Data Set ◽

Descriptor Selection

Background: Thrombin is the central protease of the vertebrate blood coagulation cascade, which is closely related to cardiovascular diseases. The inhibitory constant Ki is the most significant property of thrombin inhibitors. Method: This study was carried out to predict Ki values of thrombin inhibitors based on a large data set by using machine learning methods. Taking advantage of finding non-intuitive regularities on high-dimensional datasets, machine learning can be used to build effective predictive models. A total of 6554 descriptors for each compound were collected and an efficient descriptor selection method was chosen to find the appropriate descriptors. Four different methods including multiple linear regression (MLR), K Nearest Neighbors (KNN), Gradient Boosting Regression Tree (GBRT) and Support Vector Machine (SVM) were implemented to build prediction models with these selected descriptors. Results: The SVM model was the best one among these methods with R2=0.84, MSE=0.55 for the training set and R2=0.83, MSE=0.56 for the test set. Several validation methods such as yrandomization test and applicability domain evaluation, were adopted to assess the robustness and generalization ability of the model. The final model shows excellent stability and predictive ability and can be employed for rapid estimation of the inhibitory constant, which is full of help for designing novel thrombin inhibitors.

Download Full-text

A comparison of the value of two machine learning predictive models to support bovine tuberculosis disease control in England

Preventive Veterinary Medicine ◽

10.1016/j.prevetmed.2021.105264 ◽

2021 ◽

Vol 188 ◽

pp. 105264

Author(s):

M. Pilar Romero ◽

Yu-Mei Chang ◽

Lucy A. Brunton ◽

Alison Prosser ◽

Paul Upton ◽

...

Keyword(s):

Machine Learning ◽

Disease Control ◽

Predictive Models ◽

Bovine Tuberculosis ◽

Tuberculosis Disease

Download Full-text

Predicting the Appearance of Hypotension During Hemodialysis Sessions Using Machine Learning Classifiers

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph18052364 ◽

2021 ◽

Vol 18 (5) ◽

pp. 2364

Author(s):

Juan A. Gómez-Pulido ◽

José M. Gómez-Pulido ◽

Diego Rodríguez-Puyol ◽

María-Luz Polo-Luque ◽

Miguel Vargas-Lombardo

Keyword(s):

Machine Learning ◽

Predictive Models ◽

Chronic Renal Disease ◽

Clinical Information ◽

Healthcare Personnel ◽

Dialysis Session ◽

Clinical Parameters ◽

Machine Learning Classifiers ◽

Learning Classifiers ◽

Gender And Age

A patient suffering from advanced chronic renal disease undergoes several dialysis sessions on different dates. Several clinical parameters are monitored during the different hours of any of these sessions. These parameters, together with the information provided by other parameters of analytical nature, can be very useful to determine the probability that a patient may suffer from hypotension during the session, which should be specially watched since it represents a proven factor of possible mortality. However, the analytical information is not always available to the healthcare personnel, or it is far in time, so the clinical parameters monitored during the session become key to the prevention of hypotension. This article presents an investigation to predict the appearance of hypotension during a dialysis session, using predictive models trained from a large dialysis database, which contains the clinical information of 98,015 sessions corresponding to 758 patients. The prediction model takes into account up to 22 clinical parameters measured five times during the session, as well as the gender and age of the patient. This model was trained by means of machine learning classifiers, providing a success in the prediction higher than 80%.

Download Full-text

A two-level architecture for a large data base

ACM SIGIR Forum ◽

10.1145/1095286.1095295 ◽

1976 ◽

Vol 10 (4) ◽

pp. 23-23 ◽

Cited By ~ 1

Author(s):

Tomas Lang

Keyword(s):

Data Base ◽

Large Data ◽

Large Data Base

Download Full-text