Intelligent Molecular Identification for High Performance Organosulfide Capture Using Active Machine Learning Algorithm

Machine learning and computer-aided approaches significantly accelerate molecular design and discovery in scientific and industrial fields increasingly relying on data science for efficiency. The typical method used is supervised learning which needs huge datasets. Semi-supervised machine learning approaches are effective to train unlabeled data with improved modeling performance, whereas they are limited by the accumulation of prediction errors. Here, to screen solvents for removal of methyl mercaptan, a type of organosulfur impurities in natural gas, we constructed a computational framework by integrating molecular similarity search and active learning methods, namely, molecular active selection machine learning (MASML). This new model framework identifies the optimal molecules set by molecular similarity search and iterative addition to the training dataset. Among all 126,068 compounds in the initial dataset, 3 molecules were identified to be promising for methyl mercaptan (MeSH) capture, including benzylamine (BZA), p-methoxybenzylamine (PZM), and N,N-diethyltrimethylenediamine (DEAPA). Further experiments confirmed the effectiveness of our modeling framework in efficient molecular design and identification for capturing methyl mercaptan, in which DEAPA presents a Henry's law constant 89.4% lower than that of methyl diethanolamine (MDEA).

Download Full-text

On the Unfounded Enthusiasm for Soft Selective Sweeps III: The Supervised Machine Learning Algorithm That Isn’t

Genes ◽

10.3390/genes12040527 ◽

2021 ◽

Vol 12 (4) ◽

pp. 527

Author(s):

Eran Elhaik ◽

Dan Graur

Keyword(s):

Machine Learning ◽

Learning Algorithm ◽

A Priori ◽

Neutral Theory ◽

Dominant Mode ◽

Supervised Machine Learning ◽

Training Dataset ◽

Selective Sweeps ◽

Two Factors ◽

Negative Controls

In the last 15 years or so, soft selective sweep mechanisms have been catapulted from a curiosity of little evolutionary importance to a ubiquitous mechanism claimed to explain most adaptive evolution and, in some cases, most evolution. This transformation was aided by a series of articles by Daniel Schrider and Andrew Kern. Within this series, a paper entitled “Soft sweeps are the dominant mode of adaptation in the human genome” (Schrider and Kern, Mol. Biol. Evolut. 2017, 34(8), 1863–1877) attracted a great deal of attention, in particular in conjunction with another paper (Kern and Hahn, Mol. Biol. Evolut. 2018, 35(6), 1366–1371), for purporting to discredit the Neutral Theory of Molecular Evolution (Kimura 1968). Here, we address an alleged novelty in Schrider and Kern’s paper, i.e., the claim that their study involved an artificial intelligence technique called supervised machine learning (SML). SML is predicated upon the existence of a training dataset in which the correspondence between the input and output is known empirically to be true. Curiously, Schrider and Kern did not possess a training dataset of genomic segments known a priori to have evolved either neutrally or through soft or hard selective sweeps. Thus, their claim of using SML is thoroughly and utterly misleading. In the absence of legitimate training datasets, Schrider and Kern used: (1) simulations that employ many manipulatable variables and (2) a system of data cherry-picking rivaling the worst excesses in the literature. These two factors, in addition to the lack of negative controls and the irreproducibility of their results due to incomplete methodological detail, lead us to conclude that all evolutionary inferences derived from so-called SML algorithms (e.g., S/HIC) should be taken with a huge shovel of salt.

Download Full-text

Using Supervised Machine Learning Algorithms for Automated Lithology Prediction from Wireline Log Data

10.2118/208559-ms ◽

2021 ◽

Author(s):

Marian Popescu ◽

Rebecca Head ◽

Tim Ferriday ◽

Kate Evans ◽

Jose Montero ◽

...

Keyword(s):

Machine Learning ◽

Learning Algorithm ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Training Dataset ◽

Depth Interval ◽

Log Data ◽

Machine Learning Approach ◽

Lithology Prediction ◽

Logging While Drilling

Abstract This paper presents advancements in machine learning and cloud deployment that enable rapid and accurate automated lithology interpretation. A supervised machine learning technique is described that enables rapid, consistent, and accurate lithology prediction alongside quantitative uncertainty from large wireline or logging-while-drilling (LWD) datasets. To leverage supervised machine learning, a team of geoscientists and petrophysicists made detailed lithology interpretations of wells to generate a comprehensive training dataset. Lithology interpretations were based on applying determinist cross-plotting by utilizing and combining various raw logs. This training dataset was used to develop a model and test a machine learning pipeline. The pipeline was applied to a dataset previously unseen by the algorithm, to predict lithology. A quality checking process was performed by a petrophysicist to validate new predictions delivered by the pipeline against human interpretations. Confidence in the interpretations was assessed in two ways. The prior probability was calculated, a measure of confidence in the input data being recognized by the model. Posterior probability was calculated, which quantifies the likelihood that a specified depth interval comprises a given lithology. The supervised machine learning algorithm ensured that the wells were interpreted consistently by removing interpreter biases and inconsistencies. The scalability of cloud computing enabled a large log dataset to be interpreted rapidly; >100 wells were interpreted consistently in five minutes, yielding >70% lithological match to the human petrophysical interpretation. Supervised machine learning methods have strong potential for classifying lithology from log data because: 1) they can automatically define complex, non-parametric, multi-variate relationships across several input logs; and 2) they allow classifications to be quantified confidently. Furthermore, this approach captured the knowledge and nuances of an interpreter's decisions by training the algorithm using human-interpreted labels. In the hydrocarbon industry, the quantity of generated data is predicted to increase by >300% between 2018 and 2023 (IDC, Worldwide Global DataSphere Forecast, 2019–2023). Additionally, the industry holds vast legacy data. This supervised machine learning approach can unlock the potential of some of these datasets by providing consistent lithology interpretations rapidly, allowing resources to be used more effectively.

Download Full-text

Application of Natural Language Processing with Supervised Machine Learning Techniques to Predict the Overall Drugs Performance

AJIT-e Online Academic Journal of Information Technology ◽

10.5824/ajite.2020.01.001.x ◽

2020 ◽

Vol 11 (40) ◽

pp. 8-23

Author(s):

Pius MARTHIN ◽

Duygu İÇEN

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Random Forest ◽

Semantic Analysis ◽

Classification Tree ◽

Supervised Machine Learning ◽

Training Dataset ◽

Support Vector ◽

Learning Models ◽

Machine Learning Models

Online product reviews have become a valuable source of information which facilitate customer decision with respect to a particular product. With the wealthy information regarding user's satisfaction and experiences about a particular drug, pharmaceutical companies make the use of online drug reviews to improve the quality of their products. Machine learning has enabled scientists to train more efficient models which facilitate decision making in various fields. In this manuscript we applied a drug review dataset used by (Gräβer, Kallumadi, Malberg,& Zaunseder, 2018), available freely from machine learning repository website of the University of California Irvine (UCI) to identify best machine learning model which provide a better prediction of the overall drug performance with respect to users' reviews. Apart from several manipulations done to improve model accuracy, all necessary procedures required for text analysis were followed including text cleaning and transformation of texts to numeric format for easy training machine learning models. Prior to modeling, we obtained overall sentiment scores for the reviews. Customer's reviews were summarized and visualized using a bar plot and word cloud to explore the most frequent terms. Due to scalability issues, we were able to use only the sample of the dataset. We randomly sampled 15000 observations from the 161297 training dataset and 10000 observations were randomly sampled from the 53766 testing dataset. Several machine learning models were trained using 10 folds cross-validation performed under stratified random sampling. The trained models include Classification and Regression Trees (CART), classification tree by C5.0, logistic regression (GLM), Multivariate Adaptive Regression Spline (MARS), Support vector machine (SVM) with both radial and linear kernels and a classification tree using random forest (Random Forest). Model selection was done through a comparison of accuracies and computational efficiency. Support vector machine (SVM) with linear kernel was significantly best with an accuracy of 83% compared to the rest. Using only a small portion of the dataset, we managed to attain reasonable accuracy in our models by applying the TF-IDF transformation and Latent Semantic Analysis (LSA) technique to our TDM.

Download Full-text

The challenge of delimiting cryptic species, and a supervised machine learning solution

10.1101/2021.08.05.455277 ◽

2021 ◽

Author(s):

Shahan Derkarabetian ◽

James Starrett ◽

Marshal Hedin

Keyword(s):

Machine Learning ◽

Cryptic Species ◽

Species Delimitation ◽

Genetic Data ◽

Biological Characteristics ◽

Supervised Machine Learning ◽

Training Dataset ◽

Dispersal Ability ◽

Data Types ◽

Multispecies Coalescent

The diversity of biological and ecological characteristics of organisms, and the underlying genetic patterns and processes of speciation, makes the development of universally applicable genetic species delimitation methods challenging. Many approaches, like those incorporating the multispecies coalescent, sometimes delimit populations and overestimate species numbers. This issue is exacerbated in taxa with inherently high population structure due to low dispersal ability, and in cryptic species resulting from nonecological speciation. These taxa present a conundrum when delimiting species: analyses rely heavily, if not entirely, on genetic data which over split species, while other lines of evidence lump. We showcase this conundrum in the harvester Theromaster brunneus, a low dispersal taxon with a wide geographic distribution and high potential for cryptic species. Integrating morphology, mitochondrial, and sub-genomic (double-digest RADSeq and ultraconserved elements) data, we find high discordance across analyses and data types in the number of inferred species, with further evidence that multispecies coalescent approaches over split. We demonstrate the power of a supervised machine learning approach in effectively delimiting cryptic species by creating a "custom" training dataset derived from a well-studied lineage with similar biological characteristics as Theromaster. This novel approach uses known taxa with particular biological characteristics to inform unknown taxa with similar characteristics, and uses modern computational tools ideally suited for species delimitation while also considering the biology and natural history of organisms to make more biologically informed species delimitation decisions. In principle, this approach is universally applicable for species delimitation of any taxon with genetic data, particularly for cryptic species.

Download Full-text

Machine learning for identification of frailty in Canadian primary care practices

International Journal for Population Data Science ◽

10.23889/ijpds.v6i1.1650 ◽

2021 ◽

Vol 6 (1) ◽

Author(s):

Sylvia Aponte-Hao ◽

Sabrina T. Wong ◽

Manpreet Thandi ◽

Paul Ronksley ◽

Kerry McBrien ◽

...

Keyword(s):

Machine Learning ◽

Primary Care ◽

Predictive Value ◽

Case Definition ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Sensitivity Analyses ◽

Screening Tests ◽

Training Dataset ◽

Total N

IntroductionFrailty is a medical syndrome, commonly affecting people aged 65 years and over and is characterized by a greater risk of adverse outcomes following illness or injury. Electronic medical records contain a large amount of longitudinal data that can be used for primary care research. Machine learning can fully utilize this wide breadth of data for the detection of diseases and syndromes. The creation of a frailty case definition using machine learning may facilitate early intervention, inform advanced screening tests, and allow for surveillance. ObjectivesThe objective of this study was to develop a validated case definition of frailty for the primary care context, using machine learning. MethodsPhysicians participating in the Canadian Primary Care Sentinel Surveillance Network across Canada were asked to retrospectively identify the level of frailty present in a sample of their own patients (total n = 5,466), collected from 2015-2019. Frailty levels were dichotomized using a cut-off of 5. Extracted features included previously prescribed medications, billing codes, and other routinely collected primary care data. We used eight supervised machine learning algorithms, with performance assessed using a hold-out test set. A balanced training dataset was also created by oversampling. Sensitivity analyses considered two alternative dichotomization cut-offs. Model performance was evaluated using area under the receiver-operating characteristic curve, F1, accuracy, sensitivity, specificity, negative predictive value and positive predictive value. ResultsThe prevalence of frailty within our sample was 18.4%. Of the eight models developed to identify frail patients, an XGBoost model achieved the highest sensitivity (78.14%) and specificity (74.41%). The balanced training dataset did not improve classification performance. Sensitivity analyses did not show improved performance for cut-offs other than 5. ConclusionSupervised machine learning was able to create well performing classification models for frailty. Future research is needed to assess frailty inter-rater reliability, and link multiple data sources for frailty identification.

Download Full-text

Classification of different skarn deposits based on the compositional variability of associated grandite garnets: a data science and Machine Learning approach

10.5194/egusphere-egu21-10537 ◽

2021 ◽

Author(s):

Urmi Ghosh ◽

Tuhin Chakraborty

Keyword(s):

Machine Learning ◽

Trace Element ◽

Data Science ◽

Training Dataset ◽

Support Vector ◽

Learning Approach ◽

Machine Learning Approach ◽

Skarn Deposits

<p>Rapid technological improvements made in in-situ analysis techniques, including LA-ICPMS, have transformed the field of analytical geochemistry. This has a far-reaching impact for different petrogenetic and ore-genetic studies where minute major and trace element compositional changes between different mineral zones within a single crystal can now be demarcated. Minerals such as garnet although robust are quite sensitive to the changing P-T and fluid conditions during their formation. These minerals have become powerful tools to characterize mineralization types. Previously, Meinert (1992) has used in-situ major element EPMA analysis results to classify different skarn deposit based on the end-member composition of hydrothermal garnets. Alternatively, Tian et al. (2019) used the garnet trace element composition for the similar purpose. However, these discrimination plots/ classification schemes show major overlap in different skarn deposits, such as Fe, Cu, Zn, and Au. The present study is an attempt to use machine learning approach on available garnet data to found a more potent classification scheme for skarn deposits, thus reaffirming garnet as a faithful indicator for hydrothermal ore deposits. We have meticulously collected major and trace element data of Ca-rich garnets, associated with different skarn deposits worldwide from 40 publications. This collected data is then used to train a model for fingerprinting the skarn deposits. Stratified random sampling method has been used on the dataset with 80% of the samples as test set and the rest 20 % as training dataset. We have used K-nearest neighbour (KNN), Support Vector Machine (SVM) and Random Forest algorithms on the data by using Python as a platform. These ML classification algorithm performs better than the earlier existing models available for classification of ore types based on garnet composition in skarn system. Factor importance is calculated that shows which elements play a pivotal role in classification of the ore type. Our results depict that multiple garnet forming elements taken together can reliably be used to discriminate between different ore formation settings.</p>

Download Full-text

Development of a Digital ESP Performance Monitoring System Based on Artificial Intelligence

10.2118/207929-ms ◽

2021 ◽

Author(s):

Göktug Diker ◽

Herwig Frühbauer ◽

Edna Michelle Bisso Bi Mba

Keyword(s):

Machine Learning ◽

Project Management ◽

Data Science ◽

Performance Monitoring ◽

Intelligent System ◽

Operating Conditions ◽

Oil Field ◽

Training Dataset ◽

Detailed Knowledge ◽

Concept Evaluation

Abstract Wintershall Dea is developing together with partners a digital system to monitor and optimize electrical submersible pump (ESP) performance based on the data from Mittelplate oil field. This tool is using machine learning (ML) models which are fed by historic data and will notify engineers and operators when operating conditions are trending beyond the operating envelope, which enables an operator to mitigate upcoming performance problems. In addition to traditional engineering methods, such a system will capture knowledge by continuous improvement based on ML. With this approach the engineer has a system at hand to support the day-to-day work. Manual monitoring and on demand investigations are now backed up by an intelligent system which permanently monitors the equipment. In order to create such a system, a proof of concept (PoC) study has been initiated with industry partners and data scientists to evaluate historic events, which are used to train the ML-systems. This phase aims to better understand the capabilities of machine learning and data science in the subsurface domain as well as to build up trust for the engineers with such systems. The concept evaluation has shown that the intensive collaboration between engineers and data scientist is essential. A continuous and structured exchange between engineering and data science resulted in a mutual developed product, which fits the engineer's needs based on the technical capabilities and limits set by ML-models. To organize such a development, new project management elements like agile working methods, sprints and scrum methods were utilized. During the development Wintershall Dea has partnered with two organizations. One has a pure data science background and the other one was the data science team of the ESP manufacturer. After the PoC period the following conclusions can be derived: (1) data quality and format is key to success; (2) detailed knowledge of the equipment speeds up the development and the quality of the results; (3) high model accuracy requires a high number of events in the training dataset. The overall conclusion of this PoC is that the collaboration between engineers and data scientists, fostered by the agile project management toolkit and suitable datasets, leads to a successful development. Even when the limits of the ML-algorithms are hit, the model forecast, in combination with traditional engineering methods, adds significant value to the ESP performance. The novelty of such a system is that the production engineer will be supported by trusted ML-models and digital systems. This system in combination with the traditional engineering tools improves monitoring of the equipment and taking decisions leading to increased equipment performance.

Download Full-text

Exploring the Efficiency of Various Supervised Machine Learning Techniques to Predict the Heart Disease using Risk Factors

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.a1063.1191s19 ◽

2019 ◽

Vol 9 (1S) ◽

pp. 309-312

Keyword(s):

Machine Learning ◽

Health Care ◽

Heart Disease ◽

Major Part ◽

Data Science ◽

Learning Algorithm ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Data Set

Data Science in healthcare is a innovative and capable for industry implementing the data science applications. Data analytics is recent science in to discover the medical data set to explore and discover the disease. It’s a beginning attempt to identify the disease with the help of large amount of medical dataset. Using this data science methodology, it makes the user to find their disease without the help of health care centres. Healthcare and data science are often linked through finances as the industry attempts to reduce its expenses with the help of large amounts of data. Data science and medicine are rapidly developing, and it is important that they advance together. Health care information is very effective in the society. In a human life day to day heart disease had increased. Based on the heart disease to monitor different factors in human body to analyse and prevent the heart disease. To classify the factors using the machine learning algorithms and to predict the disease is major part. Major part of involves machine level based supervised learning algorithm such as SVM, Naviebayes, Decision Trees and Random forest.

Download Full-text

Capturing Global, Predicting Local for Controlling Antimicrobial Resistance: a retrospective multivariable analysis

10.1101/2021.05.26.21257778 ◽

2021 ◽

Author(s):

Raghav Awasthi ◽

Samprati Agrawal ◽

Vaidehi Rakholia ◽

Lovedeep Singh Dhingra ◽

Aditya Nagori ◽

...

Keyword(s):

Machine Learning ◽

Staphylococcus Aureus ◽

Antimicrobial Resistance ◽

Multivariable Analysis ◽

Causal Modeling ◽

Mitigation Strategies ◽

Supervised Machine Learning ◽

World Health ◽

Modeling Framework ◽

The Impact

Background: Antimicrobial resistance (AMR) is a complex multifactorial outcome of health, socio-economic and geopolitical factors. Therefore, tailored solutions for mitigation strategies could be more effective in dealing with this challenge. Knowledge-synthesis and actionable models learned upon large datasets are critical in order to diffuse the risk of entering into a post-antimicrobial era. Objective: This work is focused on learning Global determinants of AMR and predicting susceptibility of antibiotics at isolate level (Local) for WHO (world health organization) declared critically important pathogens Pseudomonas aeruginosa, Klebsiella pneumoniae, Escherichia coli, Acinetobacter baumannii, Enterobacter cloacae, Staphylococcus aureus. Methods: In this study, we used longitudinal data (2004-2017) of AMR having 633820 isolates from 72 Middle and High-income countries. We integrated the Global burden of disease (GBD), Governance (WGI), and Finance data sets in order to find the unbiased and actionable determinants of AMR. We chose a Bayesian Decision Network (BDN) approach within the causal modeling framework to quantify determinants of AMR. Finally Integrating Bayesian networks with classical machine learning approaches lead to effective modeling of the level of AMR. Results: From MAR (Multiple Antibiotic Resistance) scores, we found that developing countries are at higher risk of AMR compared to developed countries, for all the critically important pathogens. Also, Principal Components Analysis(PCA) revealed that governance, finance, and disease burden variables have a strong association with AMR. We further quantified the impact of determinants in a probabilistic way and observed that heath system access and government effectiveness are strong actionable factors in reducing AMR, which was in turn confirmed by what-if analysis. Finally, our supervised machine learning models have shown decent performance, with the highest on Staphylococcus aureus. For Staphylococcus aureus, our model predicted susceptibility to Ceftaroline and Oxacillin with the highest AUROC, 0.94 and 0.89 respectively.

Download Full-text

On the Inapplicability of Supervised Machine Learning to Evolutionary Studies

10.20944/preprints202012.0214.v1 ◽

2020 ◽

Author(s):

Eran Elhaik ◽

Dan Graur

Keyword(s):

Machine Learning ◽

A Priori ◽

Supervised Machine Learning ◽

Training Dataset ◽

The Bible ◽

Human Genomes ◽

Evolutionary Studies ◽

Two Factors ◽

Negative Controls

Supervised machine learning (SML) is a powerful method for predicting a small number of well-defined output groups (e.g., potential buyers of a certain product) by taking as input a large number of known well-defined measurements (e.g., past purchases, income, ethnicity, gender, credit record, age, favorite color, favorite chewing gum). SML is predicated upon the existence of a training dataset in which the correspondence between the input and output is known to be true. SML has had enormous success in the world of commerce, and this success has prompted a few scientists to employ it in the study of molecular and genome evolution. Here, we list the properties of SML that make it an unsuitable tool in evolutionary studies. In particular, we argue that SML cannot be used in an evolutionary exploratory context for the simple reason that training datasets that are known to be a priori true do not exist. As a case study, we use an SML study in which it was concluded that most human genomes evolve by positive selection through soft selective sweeps (Schrider and Kern 2017). We show that in the absence of legitimate training datasets, Schrider and Kern (2017) used (1) simulations that employ many manipulatable variables and (2) a system of cherry-picking data that would put to shame most modern evangelical exegeses of the Bible. These two factors, in addition to the lack of methodological detail and the lack of either negative controls or corrections for multiple comparisons, lead us to conclude that all evolutionary inferences derived from so-called SML algorithms (e.g., discoal) should be taken with a huge shovel of salt.

Download Full-text