Machine learning for Drug-Virus Prediction

Mapping Intimacies ◽

10.21203/rs.3.rs-910042/v4 ◽

2021 ◽

Author(s):

Milad Besharatifard ◽

Arshia Gharagozlou

Keyword(s):

Feature Vector ◽

Drug Repositioning ◽

Binary Classifier ◽

Negative Data ◽

Positive Data ◽

Factorization Model ◽

Forest Models ◽

The World ◽

Random Forest Models ◽

Classifier Learning

Abstract The 2019 Coronavirus (COVID-19) epidemic has recently hit most countries hard. Therefore, many researchers around the world are looking for a way to control this virus. Examining existing medications and using them to prevent this epidemic can be helpful. Drug repositioning solutions can be effective because designing and discovering a drug can be very time-consuming. In this study, we used a binary classifier learning method to predict the drug-virus relationship. The feature vector for each drug-virus pair is based on the similarity between drugs and the similarity between viruses. We calculated the similarities between the drugs using their structural properties (fingerprint) and their phenotype. We also calculated the similarities between viruses based on their genome sequence and the vector encoded by the Biobert model. Finally, using the HDVD dataset, we formed the similarity vectors of each drug-virus pair and considered it as input to neural network and random forest models. In these models, we randomly selected 20% of the positive data and the same amount of negative data. Finally, the performance of the proposed approach for this test data is considered, after five tests, as AUC=0.97 and AUPR = 0.96. We also used the Compressed Sensing (CS) matrix factorization model to predict the drug-virus association. After that, we investigated the importance of drug features in predicting drug-virus association by using Autoencoder and reducing the dimension of drug properties.

Download Full-text

Machine learning for Drug-Virus Prediction

10.21203/rs.3.rs-910042/v3 ◽

2021 ◽

Author(s):

Milad Besharatifard ◽

Arshia Gharagozlou

Keyword(s):

Feature Vector ◽

Drug Repositioning ◽

Binary Classifier ◽

Negative Data ◽

Positive Data ◽

Factorization Model ◽

Forest Models ◽

The World ◽

Random Forest Models ◽

Classifier Learning

Abstract The 2019 Coronavirus (COVID-19) epidemic has recently hit most countries hard. Therefore, many researchers around the world are looking for a way to control this virus. Examining existing medications and using them to prevent this epidemic can be helpful. Drug repositioning solutions can be effective because designing and discovering a drug can be very time-consuming. In this study, we used a binary classifier learning method to predict the drug-virus relationship. The feature vector for each drug-virus pair is based on the similarity between drugs and the similarity between viruses. We calculated the similarities between the drugs using their structural properties (fingerprint) and their phenotype. We also calculated the similarities between viruses based on their genome sequence and the vector encoded by the Biobert model. Finally, using the HDVD dataset, we formed the similarity vectors of each drug-virus pair and considered it as input to neural network and random forest models. In these models, we randomly selected 20% of the positive data and the same amount of negative data. Finally, the performance of the proposed approach for this test data is considered, after five tests, as AUC=0.97 and AUPR = 0.96. We also used the Compressed Sensing (CS) matrix factorization model to predict the drug-virus association. We also investigated the importance of drug features in predicting drug-virus association by using Autoencoder and reducing the dimension of drug properties.

Download Full-text

Data-Driven Wildfire Risk Prediction in Northern California

Atmosphere ◽

10.3390/atmos12010109 ◽

2021 ◽

Vol 12 (1) ◽

pp. 109

Author(s):

Ashima Malik ◽

Megha Rajam Rao ◽

Nandini Puppala ◽

Prathusha Koouri ◽

Venkata Anil Kumar Thota ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Learning Curves ◽

Data Driven ◽

Northern California ◽

Combined Model ◽

Wildfire Risk ◽

Study Results ◽

Forest Models ◽

Random Forest Models

Over the years, rampant wildfires have plagued the state of California, creating economic and environmental loss. In 2018, wildfires cost nearly 800 million dollars in economic loss and claimed more than 100 lives in California. Over 1.6 million acres of land has burned and caused large sums of environmental damage. Although, recently, researchers have introduced machine learning models and algorithms in predicting the wildfire risks, these results focused on special perspectives and were restricted to a limited number of data parameters. In this paper, we have proposed two data-driven machine learning approaches based on random forest models to predict the wildfire risk at areas near Monticello and Winters, California. This study demonstrated how the models were developed and applied with comprehensive data parameters such as powerlines, terrain, and vegetation in different perspectives that improved the spatial and temporal accuracy in predicting the risk of wildfire including fire ignition. The combined model uses the spatial and the temporal parameters as a single combined dataset to train and predict the fire risk, whereas the ensemble model was fed separate parameters that were later stacked to work as a single model. Our experiment shows that the combined model produced better results compared to the ensemble of random forest models on separate spatial data in terms of accuracy. The models were validated with Receiver Operating Characteristic (ROC) curves, learning curves, and evaluation metrics such as: accuracy, confusion matrices, and classification report. The study results showed and achieved cutting-edge accuracy of 92% in predicting the wildfire risks, including ignition by utilizing the regional spatial and temporal data along with standard data parameters in Northern California.

Download Full-text

Incorporating space and time into random forest models for analyzing geospatial patterns of drug-related crime incidents in a major U.S. metropolitan area

Computers Environment and Urban Systems ◽

10.1016/j.compenvurbsys.2021.101599 ◽

2021 ◽

Vol 87 ◽

pp. 101599

Author(s):

Zhiyue Xia ◽

Kathleen Stewart ◽

Junchuan Fan

Keyword(s):

Random Forest ◽

Metropolitan Area ◽

Space And Time ◽

Forest Models ◽

Random Forest Models

Download Full-text

Landslide susceptibility assessment for a transmission line in Gansu Province, China by using a hybrid approach of fractal theory, information value, and random forest models

Environmental Earth Sciences ◽

10.1007/s12665-021-09737-w ◽

2021 ◽

Vol 80 (12) ◽

Author(s):

Binbin Zhao ◽

Yunfeng Ge ◽

Hongzhi Chen

Keyword(s):

Random Forest ◽

Landslide Susceptibility ◽

Fractal Theory ◽

Hybrid Approach ◽

Gansu Province ◽

Information Value ◽

Susceptibility Assessment ◽

Landslide Susceptibility Assessment ◽

Forest Models ◽

Random Forest Models

Download Full-text

Random forest models of 305-days milk yield for Holstein cows in Bulgaria

10.1063/5.0034778 ◽

2020 ◽

Author(s):

A. Yordanova ◽

H. Kulina

Keyword(s):

Random Forest ◽

Milk Yield ◽

Holstein Cows ◽

Forest Models ◽

Random Forest Models

Download Full-text

Classifying Very High-Dimensional Data with Random Forests Built from Small Subspaces

International Journal of Data Warehousing and Mining ◽

10.4018/jdwm.2012040103 ◽

2012 ◽

Vol 8 (2) ◽

pp. 44-63 ◽

Cited By ~ 30

Author(s):

Baoxun Xu ◽

Joshua Zhexue Huang ◽

Graham Williams ◽

Qiang Wang ◽

Yunming Ye

Keyword(s):

Random Forest ◽

High Dimensional Data ◽

Real Life ◽

Classification Performance ◽

Feature Weighting ◽

Random Forest Model ◽

High Dimensional ◽

Forest Model ◽

Forest Models ◽

Random Forest Models

The selection of feature subspaces for growing decision trees is a key step in building random forest models. However, the common approach using randomly sampling a few features in the subspace is not suitable for high dimensional data consisting of thousands of features, because such data often contains many features which are uninformative to classification, and the random sampling often doesn’t include informative features in the selected subspaces. Consequently, classification performance of the random forest model is significantly affected. In this paper, the authors propose an improved random forest method which uses a novel feature weighting method for subspace selection and therefore enhances classification performance over high-dimensional data. A series of experiments on 9 real life high dimensional datasets demonstrated that using a subspace size of features where M is the total number of features in the dataset, our random forest model significantly outperforms existing random forest models.

Download Full-text

Gully erosion zonation mapping using integrated geographically weighted regression with certainty factor and random forest models in GIS

Journal of Environmental Management ◽

10.1016/j.jenvman.2018.11.110 ◽

2019 ◽

Vol 232 ◽

pp. 928-942 ◽

Cited By ~ 46

Author(s):

Alireza Arabameri ◽

Biswajeet Pradhan ◽

Khalil Rezaei

Keyword(s):

Random Forest ◽

Geographically Weighted Regression ◽

Gully Erosion ◽

Weighted Regression ◽

Certainty Factor ◽

Forest Models ◽

Random Forest Models

Download Full-text

Forest floor temperature and greenness link significantly to canopy attributes in South Africa’s fragmented coastal forests

10.7287/peerj.preprints.27168 ◽

2018 ◽

Author(s):

Marion Pfeifer ◽

Michael JW Boyle ◽

Stuart Dunning ◽

Pieter Olivier

Keyword(s):

Land Use ◽

Food Security ◽

Surface Temperature ◽

Habitat Quality ◽

Ground Surface ◽

Quality Metrics ◽

Ground Vegetation ◽

Ground Surface Temperature ◽

Forest Models ◽

Random Forest Models

Tropical landscapes are changing rapidly due to changes in land use and land management. Being able to predict and monitor land use change impacts on species for conservation or food security concerns requires the use of habitat quality metrics, that are consistent, can be mapped using above - ground sensor data and are relevant for species performance. Here, we focus on ground surface temperature (Thermalground) and ground vegetation greenness (NDVIdown) as potentially suitable metrics of habitat quality. We measure both across habitats differing in tree cover (natural grassland to forest edges to forests and tree plantations) in the human-modified coastal forested landscapes of Kwa-Zulua Natal, South Africa. We show that both habitat quality metrics decline linearly as a function of increasing canopy closure (FCover, %) and canopy leaf area index (LAI). Opening canopies by about 20% or reducing canopy leaf area by 1% would result in an increase of temperatures on the ground by more than 1°C, and an increase in ground vegetation greenness by 0.2 and 0.14 respectively. Upscaling LAI and FCover to develop maps from Landsat imagery using random forest models allowed us to map Thermalground and NDVIdown using the linear relationships. However, map accuracy was constrained by the predictive capacity of the random forest models predicting canopy attributes and the linear models linking canopy attributes to the habitat quality metrics. Accounting for micro-scale variation in temperature is seen as essential to improve biodiversity impact predictions. Our upscaling approach suggests that mapping ground surface temperature based on radiation and vegetation properties might be possible, and that canopy cover maps could provide a useful tool for mapping habitat quality metrics that matter to species. However, we need to increase sampling of surface temperature spatially and temporally to improve and validate upscaled models. We also need to link surface temperature maps to demographic traits of species of different threat status or functions in landscapes with different disturbance and management histories testing for generalities in relationships. The derived understanding could then be exploited for targeted landscape restoration that benefits biodiversity conservation and food security sustainably at the landscape scale.

Download Full-text

Data Mining Crystallization Kinetics

10.26434/chemrxiv.11708286 ◽

2020 ◽

Author(s):

Cameron Brown ◽

Diego Maldonado ◽

Antony Vassileiou ◽

Blair Johnston ◽

Alastair Florence

Keyword(s):

Random Forest ◽

Kinetic Parameters ◽

Crystallization Kinetics ◽

Balance Model ◽

Forest Models ◽

Vast Literature ◽

Random Forest Models ◽

Kinetic Expression ◽

Population Balances ◽

Different Sources

<p>Population balance model is a valuable modelling tool which facilitates the optimization and understanding of crystallization processes. However, in order to use this tool, it is necessary to have previous knowledge of the crystallization kinetics, specifically crystal growth and nucleation. The majority of approaches to achieve proper estimations of kinetic parameters required experimental data. Across time, a vast literature about the estimation of kinetic parameters and population balances have been published. Considering the availability of data, this work built a database with information on solute, solvent, kinetic expression, parameters, crystallization method and seeding. Correlations were assessed and clusters structures identified by hierarchical clustering analysis. The final database contains 336 data of kinetic parameters from 185 different sources. The data were analysed using kinetic parameters of the most common expressions. Subsequently, clusters were identified for each kinetic model. With these clusters, classification random forest models were made using solute descriptors, seeding, solvent, and crystallization methods as classifiers. Random forest models had an overall classification accuracy higher than 70% whereby they were useful to provide rough estimates of kinetic parameters, although these methods have some limitations.</p>

Download Full-text

Applicability of an Automated Model and Parameter Selection in the Prediction of Screening-Level PTSD in Danish Soldiers Following Deployment: Development Study of Transferable Predictive Models Using Automated Machine Learning (Preprint)

10.2196/preprints.17119 ◽

2019 ◽

Author(s):

Karen-Inge Karstoft ◽

Ioannis Tsamardinos ◽

Kasper Eskelund ◽

Søren Bo Andersen ◽

Lars Ravnborg Nissen

Keyword(s):

Machine Learning ◽

Operating Characteristic ◽

Linear Models ◽

Prediction Models ◽

Characteristic Curve ◽

Ptsd Symptoms ◽

Forest Models ◽

Random Forest Models ◽

Automated Machine Learning ◽

Military Rank

BACKGROUND Posttraumatic stress disorder (PTSD) is a relatively common consequence of deployment to war zones. Early postdeployment screening with the aim of identifying those at risk for PTSD in the years following deployment will help deliver interventions to those in need but have so far proved unsuccessful. OBJECTIVE This study aimed to test the applicability of automated model selection and the ability of automated machine learning prediction models to transfer across cohorts and predict screening-level PTSD 2.5 years and 6.5 years after deployment. METHODS Automated machine learning was applied to data routinely collected 6-8 months after return from deployment from 3 different cohorts of Danish soldiers deployed to Afghanistan in 2009 (cohort 1, N=287 or N=261 depending on the timing of the outcome assessment), 2010 (cohort 2, N=352), and 2013 (cohort 3, N=232). RESULTS Models transferred well between cohorts. For screening-level PTSD 2.5 and 6.5 years after deployment, random forest models provided the highest accuracy as measured by area under the receiver operating characteristic curve (AUC): 2.5 years, AUC=0.77, 95% CI 0.71-0.83; 6.5 years, AUC=0.78, 95% CI 0.73-0.83. Linear models performed equally well. Military rank, hyperarousal symptoms, and total level of PTSD symptoms were highly predictive. CONCLUSIONS Automated machine learning provided validated models that can be readily implemented in future deployment cohorts in the Danish Defense with the aim of targeting postdeployment support interventions to those at highest risk for developing PTSD, provided the cohorts are deployed on similar missions.

Download Full-text