Interpretable Machine Learning for Perturbation Biology

AbstractSystematic perturbation of cells followed by comprehensive measurements of molecular and phenotypic responses provides an informative data resource for constructing computational models of cell biology. Models that generalize well beyond training data can be used to identify combinatorial perturbations of potential therapeutic interest. Major challenges for machine learning on large biological datasets are to find global optima in an enormously complex multi-dimensional solution space and to mechanistically interpret the solutions. To address these challenges, we introduce a hybrid approach that combines explicit mathematical models of dynamic cell biological processes with a machine learning framework, implemented in Tensorflow. We tested the modeling framework on a perturbation-response dataset for a melanoma cell line after drug treatments. The models can be efficiently trained to accurately describe cellular behavior, as tested by cross-validation. Even though completely data-driven and independent of prior knowledge, the resulting de novo network models recapitulate some known interactions. The main predictive application is the identification of combinatorial candidates for cancer therapy. The approach is readily applicable to a wide range of kinetic models of cell biology.

Download Full-text

EXPRESS: Overcoming the Cold Start Problem of CRM using a Probabilistic Machine Learning Approach

Journal of Marketing Research ◽

10.1177/00222437211032938 ◽

2021 ◽

pp. 002224372110329

Author(s):

Nicolas Padilla ◽

Eva Ascarza

Keyword(s):

Machine Learning ◽

Cold Start ◽

Customer Relationship ◽

Exponential Families ◽

Modeling Framework ◽

Wide Range ◽

The Moment ◽

Probabilistic Machine Learning ◽

Marketing Actions ◽

Cold Start Problem

The success of Customer Relationship Management (CRM) programs ultimately depends on the firm's ability to identify and leverage differences across customers — a very diffcult task when firms attempt to manage new customers, for whom only the first purchase has been observed. For those customers, the lack of repeated observations poses a structural challenge to inferring unobserved differences across them. This is what we call the “cold start” problem of CRM, whereby companies have difficulties leveraging existing data when they attempt to make inferences about customers at the beginning of their relationship. We propose a solution to the cold start problem by developing a probabilistic machine learning modeling framework that leverages the information collected at the moment of acquisition. The main aspect of the model is that it exibly captures latent dimensions that govern the behaviors observed at acquisition as well as future propensities to buy and to respond to marketing actions using deep exponential families. The model can be integrated with a variety of demand specifications and is exible enough to capture a wide range of heterogeneity structures. We validate our approach in a retail context and empirically demonstrate the model's ability at identifying high-value customers as well as those most sensitive to marketing actions, right after their first purchase.

Download Full-text

Development of Digital Twins for Drilling Fluids: Local Velocities for Hole Cleaning and Rheology Monitoring

10.1115/omae2021-62987 ◽

2021 ◽

Author(s):

Mehrdad Gharib Shirangi ◽

Roger Aragall ◽

Reza Ettehadi ◽

Roland May ◽

Edward Furlong ◽

...

Keyword(s):

Machine Learning ◽

Rheological Properties ◽

Drilling Fluid ◽

Training Data ◽

Drilling Fluids ◽

Hole Cleaning ◽

Digital Twins ◽

Wide Range ◽

Drilling Operations ◽

Cuttings Bed

Abstract In this work, we present our advances to develop and apply digital twins for drilling fluids and associated wellbore phenomena during drilling operations. A drilling fluid digital twin is a series of interconnected models that incorporate the learning from the past historical data in a wide range of operational settings to determine the fluids properties in realtime operations. From several drilling fluid functionalities and operational parameters, we describe advancements to improve hole cleaning predictions and high-pressure high-temperature (HPHT) rheological properties monitoring. In the hole cleaning application, we consider the Clark and Bickham (1994) approach which requires the prediction of the local fluid velocity above the cuttings bed as a function of operating conditions. We develop accurate computational fluid dynamics (CFD) models to capture the effects of rotation, eccentricity and bed height on local fluid velocities above cuttings bed. We then run 55,000 CFD simulations for a wide range of operational settings to generate training data for machine learning. For rheology monitoring, thousands of lab experiment records are collected as training data for machine learning. In this case, the HPHT rheological properties are determined based on rheological measurement in the American Petroleum Institute (API) condition together with the fluid type and composition data. We compare the results of application of several machine learning algorithms to represent CFD simulations (for hole cleaning application) and lab experiments (for monitoring HPHT rheological properties). Rotating cross-validation method is applied to ensure accurate and robust results. In both cases, models from the Gradient Boosting and the Artificial Neural Network algorithms provided the highest accuracy (about 0.95 in terms of R-squared) for test datasets. With developments presented in this paper, the hole cleaning calculations can be performed more accurately in real-time, and the HPHT rheological properties of drilling fluids can be estimated at the rigsite before performing the lab experiments. These contributions advance digital transformation of drilling operations.

Download Full-text

Detecting Pressure Anomalies While Drilling Using a Machine Learning Hybrid Approach

10.2118/204035-ms ◽

2021 ◽

Author(s):

Aurore Lafond ◽

Maurice Ringer ◽

Florian Le Blay ◽

Jiaxu Liu ◽

Ekaterina Millan ◽

...

Keyword(s):

Machine Learning ◽

Data Quality ◽

Real Time ◽

Large Scale ◽

Hybrid Approach ◽

Physical Models ◽

Training Data ◽

Digital Data ◽

Machine Learning Techniques ◽

New System

Abstract Abnormal surface pressure is typically the first indicator of a number of problematic events, including kicks, losses, washouts and stuck pipe. These events account for 60–70% of all drilling-related nonproductive time, so their early and accurate detection has the potential to save the industry billions of dollars. Detecting these events today requires an expert user watching multiple curves, which can be costly, and subject to human errors. The solution presented in this paper is aiming at augmenting traditional models with new machine learning techniques, which enable to detect these events automatically and help the monitoring of the drilling well. Today’s real-time monitoring systems employ complex physical models to estimate surface standpipe pressure while drilling. These require many inputs and are difficult to calibrate. Machine learning is an alternative method to predict pump pressure, but this alone needs significant labelled training data, which is often lacking in the drilling world. The new system combines these approaches: a machine learning framework is used to enable automated learning while the physical models work to compensate any gaps in the training data. The system uses only standard surface measurements, is fully automated, and is continuously retrained while drilling to ensure the most accurate pressure prediction. In addition, a stochastic (Bayesian) machine learning technique is used, which enables not only a prediction of the pressure, but also the uncertainty and confidence of this prediction. Last, the new system includes a data quality control workflow. It discards periods of low data quality for the pressure anomaly detection and enables to have a smarter real-time events analysis. The new system has been tested on historical wells using a new test and validation framework. The framework runs the system automatically on large volumes of both historical and simulated data, to enable cross-referencing the results with observations. In this paper, we show the results of the automated test framework as well as the capabilities of the new system in two specific case studies, one on land and another offshore. Moreover, large scale statistics enlighten the reliability and the efficiency of this new detection workflow. The new system builds on the trend in our industry to better capture and utilize digital data for optimizing drilling.

Download Full-text

Arabic tweets sentiment analysis – a hybrid scheme

Journal of Information Science ◽

10.1177/0165551515610513 ◽

2016 ◽

Vol 42 (6) ◽

pp. 782-797 ◽

Cited By ~ 42

Author(s):

Haifa K. Aldayel ◽

Aqil M. Azmi

Keyword(s):

Machine Learning ◽

Saudi Arabia ◽

Hybrid Approach ◽

Training Data ◽

Machine Learning Techniques ◽

Good Source ◽

Learning Classifier ◽

Learning Techniques ◽

Semantic Orientation ◽

F Measure

The fact that people freely express their opinions and ideas in no more than 140 characters makes Twitter one of the most prevalent social networking websites in the world. Being popular in Saudi Arabia, we believe that tweets are a good source to capture the public’s sentiment, especially since the country is in a fractious region. Going over the challenges and the difficulties that the Arabic tweets present – using Saudi Arabia as a basis – we propose our solution. A typical problem is the practice of tweeting in dialectical Arabic. Based on our observation we recommend a hybrid approach that combines semantic orientation and machine learning techniques. Through this approach, the lexical-based classifier will label the training data, a time-consuming task often prepared manually. The output of the lexical classifier will be used as training data for the SVM machine learning classifier. The experiments show that our hybrid approach improved the F-measure of the lexical classifier by 5.76% while the accuracy jumped by 16.41%, achieving an overall F-measure and accuracy of 84 and 84.01% respectively.

Download Full-text

Prediction of Lung Cancer Risk using Random Forest Algorithm Based on Kaggle Data Set

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.f7879.038620 ◽

2020 ◽

Vol 8 (6) ◽

pp. 1623-1630

Keyword(s):

Machine Learning ◽

Lung Cancer ◽

Random Forest ◽

Naive Bayes ◽

Early Stage ◽

Naïve Bayes ◽

Training Data ◽

Random Forest Algorithm ◽

Data Set ◽

Wide Range

As huge amount of data accumulating currently, Challenges to draw out the required amount of data from available information is needed. Machine learning contributes to various fields. The fast-growing population caused the evolution of a wide range of diseases. This intern resulted in the need for the machine learning model that uses the patient's datasets. From different sources of datasets analysis, cancer is the most hazardous disease, it may cause the death of the forbearer. The outcome of the conducted surveys states cancer can be nearly cured in the initial stages and it may also cause the death of an affected person in later stages. One of the major types of cancer is lung cancer. It highly depends on the past data which requires detection in early stages. The recommended work is based on the machine learning algorithm for grouping the individual details into categories to predict whether they are going to expose to cancer in the early stage itself. Random forest algorithm is implemented, it results in more efficiency of 97% compare to KNN and Naive Bayes. Further, the KNN algorithm doesn't learn anything from training data but uses it for classification. Naive Bayes results in the inaccuracy of prediction. The proposed system is for predicting the chances of lung cancer by displaying three levels namely low, medium, and high. Thus, mortality rates can be reduced significantly.

Download Full-text

Bi-LSTM Sentiment Classifier for Climate Change Issues in South Korea

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b1056.0782s619 ◽

2019 ◽

Vol 8 (2S6) ◽

pp. 295-299

Keyword(s):

Climate Change ◽

Machine Learning ◽

Big Data ◽

South Korea ◽

Sentiment Analysis ◽

Training Data ◽

Learning Models ◽

Wide Range ◽

Machine Learning Models ◽

Big Data Technology

A sentiment analysis using SNS data can confirm various people’s thoughts. Thus an analysis using SNS can predict social problems and more accurately identify the complex causes of the problem. In addition, big data technology can identify SNS information that is generated in real time, allowing a wide range of people’s opinions to be understood without losing time. It can supplement traditional opinion surveys. The incumbent government mainly uses SNS to promote its policies. However, measures are needed to actively reflect SNS in the process of carrying out the policy. Therefore this paper developed a sentiment classifier that can identify public feelings on SNS about climate change. To that end, based on a dictionary formulated on the theme of climate change, we collected climate change SNS data for learning and tagged seven sentiments. Using training data, the sentiment classifier models were developed using machine learning models. The analysis showed that the Bi-LSTM model had the best performance than shallow models. It showed the highest accuracy (85.10%) in the seven sentiments classified, outperforming traditional machine learning (Naive Bayes and SVM) by approximately 34.53%p, and 7.14%p respectively. These findings substantiate the applicability of the proposed Bi-LSTM-based sentiment classifier to the analysis of sentiments relevant to diverse climate change issues.

Download Full-text

Machine learning for laser-induced electron diffraction imaging of molecular structures

Communications Chemistry ◽

10.1038/s42004-021-00594-z ◽

2021 ◽

Vol 4 (1) ◽

Author(s):

Xinyao Liu ◽

Kasra Amini ◽

Aurelien Sanchez ◽

Blanca Belsa ◽

Tobias Steinle ◽

...

Keyword(s):

Machine Learning ◽

Electron Diffraction ◽

Learning Algorithm ◽

Structural Complexity ◽

Solution Space ◽

Molecular Structures ◽

Global Extremum ◽

Dimensional Solution ◽

Time Resolved ◽

Diffraction Imaging

AbstractUltrafast diffraction imaging is a powerful tool to retrieve the geometric structure of gas-phase molecules with combined picometre spatial and attosecond temporal resolution. However, structural retrieval becomes progressively difficult with increasing structural complexity, given that a global extremum must be found in a multi-dimensional solution space. Worse, pre-calculating many thousands of molecular configurations for all orientations becomes simply intractable. As a remedy, here, we propose a machine learning algorithm with a convolutional neural network which can be trained with a limited set of molecular configurations. We demonstrate structural retrieval of a complex and large molecule, Fenchone (C10H16O), from laser-induced electron diffraction (LIED) data without fitting algorithms or ab initio calculations. Retrieval of such a large molecular structure is not possible with other variants of LIED or ultrafast electron diffraction. Combining electron diffraction with machine learning presents new opportunities to image complex and larger molecules in static and time-resolved studies.

Download Full-text

Comprehensive Fitness Landscape of a Multi-Geometry Protein Capsid Informs Machine Learning Models of Assembly

10.1101/2021.12.21.473721 ◽

2021 ◽

Author(s):

Daniel D. Brauer ◽

Celine B. Santiago ◽

Zoe N. Merz ◽

Esther McCarthy ◽

Danielle Tullman-Ercek ◽

...

Keyword(s):

Machine Learning ◽

In Silico ◽

Quaternary Structure ◽

De Novo ◽

Fitness Landscape ◽

Machine Learning Algorithms ◽

Training Data ◽

Particle Assembly ◽

Complex Particle ◽

Self Assembled

Virus-like particles (VLPs) are non-infections viral-derived nanomaterials poised for biotechnological applications due to their well-defined, modular self-assembling architecture. Although progress has been made in understanding the complex effects that mutations may have on VLPs, nuanced understanding of the influence particle mutability has on quaternary structure has yet to be achieved. Here, we generate and compare the apparent fitness landscapes of two capsid geometries (T=3 and T=1 icosahedral) of the bacteriophage MS2 VLP. We find significant shifts in mutability at the symmetry interfaces of the T=1 capsid when compared to the wildtype T=3 assembly. Furthermore, we use the generated landscapes to benchmark the performance of in silico mutational scanning tools in capturing the effect of missense mutation on complex particle assembly. Finding that predicted stability effects correlated relatively poorly with assembly phenotype, we used a combination of de novo features in tandem with in silico results to train machine learning algorithms for the classification of variant effects on assembly. Our findings not only reveal ways that assembly geometry affects the mutable landscape of a self-assembled particle, but also establish a template for the generation of predictive mutational models of self-assembled capsids using minimal empirical training data.

Download Full-text

HAMLET

Terminology ◽

10.1075/term.20017.rig ◽

2021 ◽

Author(s):

Ayla Rigouts Terryn ◽

Véronique Hoste ◽

Els Lefever

Keyword(s):

Machine Learning ◽

Language Processing ◽

Hybrid Approach ◽

Substantial Effect ◽

Training Data ◽

Supervised Machine Learning ◽

Learning Approach ◽

Term Extraction ◽

Machine Learning Approach ◽

Different Types

Abstract Automatic term extraction (ATE) is an important task within natural language processing, both separately, and as a preprocessing step for other tasks. In recent years, research has moved far beyond the traditional hybrid approach where candidate terms are extracted based on part-of-speech patterns and filtered and sorted with statistical termhood and unithood measures. While there has been an explosion of different types of features and algorithms, including machine learning methodologies, some of the fundamental problems remain unsolved, such as the ambiguous nature of the concept “term”. This has been a hurdle in the creation of data for ATE, meaning that datasets for both training and testing are scarce, and system evaluations are often limited and rarely cover multiple languages and domains. The ACTER Annotated Corpora for Term Extraction Research contain manual term annotations in four domains and three languages and have been used to investigate a supervised machine learning approach for ATE, using a binary random forest classifier with multiple types of features. The resulting system (HAMLET Hybrid Adaptable Machine Learning approach to Extract Terminology) provides detailed insights into its strengths and weaknesses. It highlights a certain unpredictability as an important drawback of machine learning methodologies, but also shows how the system appears to have learnt a robust definition of terms, producing results that are state-of-the-art, and contain few errors that are not (part of) terms in any way. Both the amount and the relevance of the training data have a substantial effect on results, and by varying the training data, it appears to be possible to adapt the system to various desired outputs, e.g., different types of terms. While certain issues remain difficult – such as the extraction of rare terms and multiword terms – this study shows how supervised machine learning is a promising methodology for ATE.

Download Full-text

Design and Manufacture of a Multiband Rectangular Spiral-Shaped Microstrip Antenna Using EM-Driven and Machine Learning

Elektronika ir Elektrotechnika ◽

10.5755/j02.eie.27583 ◽

2021 ◽

Vol 27 (1) ◽

pp. 29-40

Author(s):

Ashrf Aoad

Keyword(s):

Machine Learning ◽

Microstrip Antenna ◽

Machine Learning Algorithms ◽

Mobile Systems ◽

Training Data ◽

Learning Models ◽

Prediction Ability ◽

Antenna Structure ◽

Wide Range ◽

Machine Learning Models

This paper presents a multiband rectangular microstrip antenna using spiral-shaped configurations. The antenna has been designed by combining two configurations of microstrip and spiral with consideration of careful selection of the substrate material, the dimension of the rectangular microstrip, the distance between the turned spiral, and the number of turns of the spiral. The efficiency and accuracy have been improved using machine learning algorithms as well. Machine learning has been studied to model the proposed antenna based on the performance requirements, which requires a sufficient training data to improve the accuracy. Three different machine learning models are applied to improve the accuracy and generalization performance and compared to simulation and measurement results. Simulation, measurement, and machine learning results confirm that the proposed antenna is a new electrically small and operating over a wide range of high-frequency bands between 1 GHz–4 GHz. Machine learning models have the best prediction ability with a mean square error (MSE) of 0.03, and 0.05. The antenna structure and size are compatible and suitable for several multi-band wireless mobile systems operating in L-band and S-band. The results, such as directivity, Half-Power Beamwidth, Voltage Standing Wave Ratio (VSWR), and S-parameter curves, are analysed and compared with the numerical formulation for both spiral and microstrip antennas.

Download Full-text