Synthesis of computer simulation and machine learning for achieving the best material properties of filled rubber

Abstract Molecular dynamics (MD) simulation is used to analyze the mechanical properties of polymerized and nanoscale filled rubber. Unfortunately, the computation time for a simulation can require several months’ computing power, because the interactions of thousands of filler particles must be calculated. To alleviate this problem, we introduce a surrogate convolutional neural network model to achieve faster and more accurate predictions. The major difficulty when employing machine-learning-based surrogate models is the shortage of training data, contributing to the huge simulation costs. To derive a highly accurate surrogate model using only a small amount of training data, we increase the number of training instances by dividing the large-scale simulation results into 3D images of middle-scale filler morphologies and corresponding regional stresses. The images include fringe regions to reflect the influence of the filler constituents outside the core regions. The resultant surrogate model provides higher prediction accuracy than that trained only by images of the entire region. Afterwards, we extract the fillers that dominate the mechanical properties using the surrogate model and we confirm their validity using MD.

Download Full-text

Data Quality Measures and Efficient Evaluation Algorithms for Large-Scale High-Dimensional Data

Applied Sciences ◽

10.3390/app11020472 ◽

2021 ◽

Vol 11 (2) ◽

pp. 472

Author(s):

Hyeongmin Cho ◽

Sangkyun Lee

Keyword(s):

Machine Learning ◽

Data Quality ◽

Large Scale ◽

High Dimensional Data ◽

Quality Measures ◽

Training Data ◽

Measure Data ◽

High Dimensional ◽

Small Scale ◽

Class Separability

Machine learning has been proven to be effective in various application areas, such as object and speech recognition on mobile systems. Since a critical key to machine learning success is the availability of large training data, many datasets are being disclosed and published online. From a data consumer or manager point of view, measuring data quality is an important first step in the learning process. We need to determine which datasets to use, update, and maintain. However, not many practical ways to measure data quality are available today, especially when it comes to large-scale high-dimensional data, such as images and videos. This paper proposes two data quality measures that can compute class separability and in-class variability, the two important aspects of data quality, for a given dataset. Classical data quality measures tend to focus only on class separability; however, we suggest that in-class variability is another important data quality factor. We provide efficient algorithms to compute our quality measures based on random projections and bootstrapping with statistical benefits on large-scale high-dimensional data. In experiments, we show that our measures are compatible with classical measures on small-scale data and can be computed much more efficiently on large-scale high-dimensional datasets.

Download Full-text

57 Precision neoantigen discovery using novel algorithms and expanded HLA-ligandome datasets

Journal for ImmunoTherapy of Cancer ◽

10.1136/jitc-2020-sitc2020.0057 ◽

2020 ◽

Vol 8 (Suppl 3) ◽

pp. A62-A62

Author(s):

Dattatreya Mellacheruvu ◽

Rachel Pyke ◽

Charles Abbott ◽

Nick Phillips ◽

Sejal Desai ◽

...

Keyword(s):

Machine Learning ◽

Cell Lines ◽

Antigen Processing ◽

Large Scale ◽

Prediction Models ◽

K562 Cells ◽

Machine Learning Algorithms ◽

Training Data ◽

High Quality ◽

Tissue Samples

BackgroundAccurately identified neoantigens can be effective therapeutic agents in both adjuvant and neoadjuvant settings. A key challenge for neoantigen discovery has been the availability of accurate prediction models for MHC peptide presentation. We have shown previously that our proprietary model based on (i) large-scale, in-house mono-allelic data, (ii) custom features that model antigen processing, and (iii) advanced machine learning algorithms has strong performance. We have extended upon our work by systematically integrating large quantities of high-quality, publicly available data, implementing new modelling algorithms, and rigorously testing our models. These extensions lead to substantial improvements in performance and generalizability. Our algorithm, named Systematic HLA Epitope Ranking Pan Algorithm (SHERPA™), is integrated into the ImmunoID NeXT Platform®, our immuno-genomics and transcriptomics platform specifically designed to enable the development of immunotherapies.MethodsIn-house immunopeptidomic data was generated using stably transfected HLA-null K562 cells lines that express a single HLA allele of interest, followed by immunoprecipitation using W6/32 antibody and LC-MS/MS. Public immunopeptidomics data was downloaded from repositories such as MassIVE and processed uniformly using in-house pipelines to generate peptide lists filtered at 1% false discovery rate. Other metrics (features) were either extracted from source data or generated internally by re-processing samples utilizing the ImmunoID NeXT Platform.ResultsWe have generated large-scale and high-quality immunopeptidomics data by using approximately 60 mono-allelic cell lines that unambiguously assign peptides to their presenting alleles to create our primary models. Briefly, our primary ‘binding’ algorithm models MHC-peptide binding using peptide and binding pockets while our primary ‘presentation’ model uses additional features to model antigen processing and presentation. Both primary models have significantly higher precision across all recall values in multiple test data sets, including mono-allelic cell lines and multi-allelic tissue samples. To further improve the performance of our model, we expanded the diversity of our training set using high-quality, publicly available mono-allelic immunopeptidomics data. Furthermore, multi-allelic data was integrated by resolving peptide-to-allele mappings using our primary models. We then trained a new model using the expanded training data and a new composite machine learning architecture. The resulting secondary model further improves performance and generalizability across several tissue samples.ConclusionsImproving technologies for neoantigen discovery is critical for many therapeutic applications, including personalized neoantigen vaccines, and neoantigen-based biomarkers for immunotherapies. Our new and improved algorithm (SHERPA) has significantly higher performance compared to a state-of-the-art public algorithm and furthers this objective.

Download Full-text

A Physics-Infused Deep Learning Model for the Prediction of Refractive Indices and Its Use for the Large-Scale Screening of Organic Compound Space

10.26434/chemrxiv.8796950 ◽

2019 ◽

Author(s):

Mojtaba Haghighatlari ◽

Gaurav Vishwakarma ◽

Mohammad Atif Faiz Afzal ◽

Johannes Hachmann

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Large Scale ◽

Organic Molecules ◽

Learning Model ◽

Training Data ◽

Refractive Indices ◽

Learning Models ◽

Deep Learning Model ◽

Machine Learning Models

<div><div><div><p>We present a multitask, physics-infused deep learning model to accurately and efficiently predict refractive indices (RIs) of organic molecules, and we apply it to a library of 1.5 million compounds. We show that it outperforms earlier machine learning models by a significant margin, and that incorporating known physics into data-derived models provides valuable guardrails. Using a transfer learning approach, we augment the model to reproduce results consistent with higher-level computational chemistry training data, but with a considerably reduced number of corresponding calculations. Prediction errors of machine learning models are typically smallest for commonly observed target property values, consistent with the distribution of the training data. However, since our goal is to identify candidates with unusually large RI values, we propose a strategy to boost the performance of our model in the remoter areas of the RI distribution: We bias the model with respect to the under-represented classes of molecules that have values in the high-RI regime. By adopting a metric popular in web search engines, we evaluate our effectiveness in ranking top candidates. We confirm that the models developed in this study can reliably predict the RIs of the top 1,000 compounds, and are thus able to capture their ranking. We believe that this is the first study to develop a data-derived model that ensures the reliability of RI predictions by model augmentation in the extrapolation region on such a large scale. These results underscore the tremendous potential of machine learning in facilitating molecular (hyper)screening approaches on a massive scale and in accelerating the discovery of new compounds and materials, such as organic molecules with high-RI for applications in opto-electronics.</p></div></div></div>

Download Full-text

A Novel Sentiment Analysis for Amazon Data with TSA based Feature Selection

Scalable Computing Practice and Experience ◽

10.12694/scpe.v22i1.1839 ◽

2021 ◽

Vol 22 (1) ◽

pp. 53-66

Author(s):

D. Anand Joseph Daniel ◽

M. Janaki Meena

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Sentiment Analysis ◽

User Satisfaction ◽

Performance Metrics ◽

Computation Time ◽

Feature Reduction ◽

Training Data ◽

Product Reviews ◽

Online Product Reviews

Sentiment analysis of online product reviews has become a mainstream way for businesses on e-commerce platforms to promote their products and improve user satisfaction. Hence, it is necessary to construct an automatic sentiment analyser for automatic identification of sentiment polarity of the online product reviews. Traditional lexicon-based approaches used for sentiment analysis suffered from several accuracy issues while machine learning techniques require labelled training data. This paper introduces a hybrid sentiment analysis framework to bond the gap between both machine learning and lexicon-based approaches. A novel tunicate swarm algorithm (TSA) based feature reduction is integrated with the proposed hybrid method to solve the scalability issue that arises due to a large feature set. It reduces the feature set size to 43% without changing the accuracy (93%). Besides, it improves the scalability, reduces the computation time and enhances the overall performance of the proposed framework. From experimental analysis, it can be observed that TSA outperforms existing feature selection techniques such as particle swarm optimization and genetic algorithm. Moreover, the proposed approach is analysed with performance metrics such as recall, precision, F1-score, feature size and computation time.

Download Full-text

Detecting Pressure Anomalies While Drilling Using a Machine Learning Hybrid Approach

10.2118/204035-ms ◽

2021 ◽

Author(s):

Aurore Lafond ◽

Maurice Ringer ◽

Florian Le Blay ◽

Jiaxu Liu ◽

Ekaterina Millan ◽

...

Keyword(s):

Machine Learning ◽

Data Quality ◽

Real Time ◽

Large Scale ◽

Hybrid Approach ◽

Physical Models ◽

Training Data ◽

Digital Data ◽

Machine Learning Techniques ◽

New System

Abstract Abnormal surface pressure is typically the first indicator of a number of problematic events, including kicks, losses, washouts and stuck pipe. These events account for 60–70% of all drilling-related nonproductive time, so their early and accurate detection has the potential to save the industry billions of dollars. Detecting these events today requires an expert user watching multiple curves, which can be costly, and subject to human errors. The solution presented in this paper is aiming at augmenting traditional models with new machine learning techniques, which enable to detect these events automatically and help the monitoring of the drilling well. Today’s real-time monitoring systems employ complex physical models to estimate surface standpipe pressure while drilling. These require many inputs and are difficult to calibrate. Machine learning is an alternative method to predict pump pressure, but this alone needs significant labelled training data, which is often lacking in the drilling world. The new system combines these approaches: a machine learning framework is used to enable automated learning while the physical models work to compensate any gaps in the training data. The system uses only standard surface measurements, is fully automated, and is continuously retrained while drilling to ensure the most accurate pressure prediction. In addition, a stochastic (Bayesian) machine learning technique is used, which enables not only a prediction of the pressure, but also the uncertainty and confidence of this prediction. Last, the new system includes a data quality control workflow. It discards periods of low data quality for the pressure anomaly detection and enables to have a smarter real-time events analysis. The new system has been tested on historical wells using a new test and validation framework. The framework runs the system automatically on large volumes of both historical and simulated data, to enable cross-referencing the results with observations. In this paper, we show the results of the automated test framework as well as the capabilities of the new system in two specific case studies, one on land and another offshore. Moreover, large scale statistics enlighten the reliability and the efficiency of this new detection workflow. The new system builds on the trend in our industry to better capture and utilize digital data for optimizing drilling.

Download Full-text

Straggler-Aware Distributed Learning: Communication–Computation Latency Trade-Off

Entropy ◽

10.3390/e22050544 ◽

2020 ◽

Vol 22 (5) ◽

pp. 544 ◽

Cited By ~ 1

Author(s):

Emre Ozfatura ◽

Sennur Ulukus ◽

Deniz Gündüz

Keyword(s):

Machine Learning ◽

Gradient Descent ◽

Large Scale ◽

Computation Time ◽

Distributed Learning ◽

Trade Off ◽

Communication Latency ◽

Advantages And Disadvantages ◽

Machine Learning Applications ◽

Coded Communication

When gradient descent (GD) is scaled to many parallel workers for large-scale machine learning applications, its per-iteration computation time is limited by straggling workers. Straggling workers can be tolerated by assigning redundant computations and/or coding across data and computations, but in most existing schemes, each non-straggling worker transmits one message per iteration to the parameter server (PS) after completing all its computations. Imposing such a limitation results in two drawbacks: over-computation due to inaccurate prediction of the straggling behavior, and under-utilization due to discarding partial computations carried out by stragglers. To overcome these drawbacks, we consider multi-message communication (MMC) by allowing multiple computations to be conveyed from each worker per iteration, and propose novel straggler avoidance techniques for both coded computation and coded communication with MMC. We analyze how the proposed designs can be employed efficiently to seek a balance between the computation and communication latency. Furthermore, we identify the advantages and disadvantages of these designs in different settings through extensive simulations, both model-based and real implementation on Amazon EC2 servers, and demonstrate that proposed schemes with MMC can help improve upon existing straggler avoidance schemes.

Download Full-text

Federal Synergy Computing Model Based on Network Interconnection

10.31219/osf.io/vjdba ◽

2020 ◽

Author(s):

Sorush Niknamian

Keyword(s):

Large Scale ◽

Heterogeneous Computing ◽

Computation Time ◽

Research Problem ◽

Computing System ◽

The Internet ◽

Collaborative Computing ◽

Computing Power ◽

Network Interconnection ◽

Computing Model

Abstract: To solve the shortage problem of the computing power provided by the single machine or the small cluster system in scientific research, we offer a collaborative computing system for users. This system has massive operation ability. It introduced a scalable mixed collaborative computing model. Through the internet and the heterogeneous computing equipment, the system uses the task decomposition model. This system can solve the research and development problem because of the shortage of capacity. To test the model, a subtask decomposition example is used. The results of the example analysis show that the computing work can obtain the shortest computation time when the number of calculation nodes is more than the number of subtasks; Maximum calculation efficiency can be achieved when the number of the calculating nodes closes to the number of subtasks. Through joint collaborative computing, the extensible mixed collaborative computing mode can effectively solve the mass computing problem for the system with heterogeneous hardware and software. This paper provides the reference for the system, which provides large scale computing power through the Internet and the research problem of due to the lack of computing ability.

Download Full-text

Federal Synergy Computing Model Based on Network Interconnection

10.36227/techrxiv.13350920.v1 ◽

2020 ◽

Author(s):

Sorush Niknamian

Keyword(s):

Large Scale ◽

Heterogeneous Computing ◽

Computation Time ◽

Research Problem ◽

Computing System ◽

The Internet ◽

Collaborative Computing ◽

Computing Power ◽

Network Interconnection ◽

Computing Model

To solve the shortage problem of the computing power provided by the single machine or the small cluster system in scientific research, we offer a collaborative computing system for users. This system has massive operation ability. It introduced a scalable mixed collaborative computing model. Through the internet and the heterogeneous computing equipment, the system uses the task decomposition model. This system can solve the research and development problem because of the shortage of capacity. To test the model, a subtask decomposition example is used. The results of the example analysis show that the computing work can obtain the shortest computation time when the number of calculation nodes is more than the number of subtasks; Maximum calculation efficiency can be achieved when the number of the calculating nodes closes to the number of subtasks. Through joint collaborative computing, the extensible mixed collaborative computing mode can effectively solve the mass computing problem for the system with heterogeneous hardware and software. This paper provides the reference for the system, which provides large scale computing power through the Internet and the research problem of due to the lack of computing ability.

Download Full-text

Speeding Up Discovery of Auxetic Zeolite Frameworks by Machine Learning

10.26434/chemrxiv.11796150 ◽

2020 ◽

Author(s):

Romain Gaillac ◽

Siwar Chibani ◽

François-Xavier Coudert

Keyword(s):

Machine Learning ◽

Mechanical Properties ◽

Dft Calculations ◽

Force Field ◽

Computational Cost ◽

Training Data ◽

Elastic Response ◽

Data Set ◽

Classical Level

<div> <div> <div> <p>The characterization of the mechanical properties of crystalline materials is nowadays considered a routine computational task in DFT calculations. However, its high computational cost still prevents it from being used in high-throughput screening methodologies, where a cheaper estimate of the elastic properties of a material is required. In this work, we have investigated the accuracy of force field calculations for the prediction of mechanical properties, and in particular for the characterization of the directional Poisson’s ratio. We analyze the behavior of about 600,000 hypothetical zeolitic structures at the classical level (a scale three orders of magnitude larger than previous studies), to highlight generic trends between mechanical properties and energetic stability. By comparing these results with DFT calculations on 991 zeolitic frameworks, we highlight the limitations of force field predictions, in particular for predicting auxeticity. We then used this reference DFT data as a training set for a machine learning algorithm, showing that it offers a way to build fast and reliable predictive models for anisotropic properties. The accuracies obtained are, in particular, much better than the current “cheap” approach for screening, which is the use of force fields. These results are a significant improvement over the previous work, due to the more difficult nature of the properties studied, namely the anisotropic elastic response. It is also the first time such a large training data set is used for zeolitic materials. </p></div></div></div><div><div><div> </div> </div> </div>

Download Full-text