AI for predicting chemical-effect associations at the universe level - deepFPlearn

A plethora of chemical substances is out there in our environment, and all living species, including us humans, are exposed to various mixtures of these. Our society is accustomed to developing, producing, using and dispersing a diverse and vast amount of chemicals with the original intention to improve our standard of living. However, many chemicals pose risks, for example of developing severe diseases, if they occur at the wrong time in the wrong place. For the majority of the chemicals these risks are not known. Chemical risk assessment and subsequent regulation of use requires efficient and systematic strategies, which are not available so far. Experimental methods, even those of high-throughput, are still lab based and therefore too slow to keep up with the pace of chemical innovation.Existing computational approaches, e.g. ML based, are powerful on specific chemical classes, or sub-problems, but not applicable on a large scale. Their main limitation is the lack of applicability to chemicals outside the training data and the availability of sufficient amounts of training data. Here, we present the ready-to-use and stand-alone program deepFPlearn that predicts the association between chemical structures and effects on the gene/pathway level using deep learning. We show good performance values for our trained models, and demonstrate that our program can predict meaningful associations of chemicals and effects beyond the training range due to the application of a sophisticated feature compression approach using a deep autoencoder. Further, it can be applied to hundreds of thousands of chemicals in seconds. We provide deepFPlearn as open source and flexible tool that can be easily retrained and customized to different application settings at https://github.com/yigbt/deepFPlearn.

Download Full-text

Organic particulate matter formation at varying relative humidity using surrogate secondary and primary organic compounds with activity corrections in the condensed phase obtained using a method based on the Wilson equation

Atmospheric Chemistry and Physics Discussions ◽

10.5194/acpd-8-995-2008 ◽

2008 ◽

Vol 8 (1) ◽

pp. 995-1039 ◽

Cited By ~ 15

Author(s):

E. I. Chang ◽

J. F. Pankow

Keyword(s):

Particulate Matter ◽

Relative Humidity ◽

Water Uptake ◽

Large Scale ◽

Prediction Method ◽

Organic Aerosol ◽

Specific Chemical ◽

Small Water ◽

Chemical Structures ◽

Unifac Model

Abstract. Secondary organic aerosol (SOA) formation in the atmosphere is currently often modeled using a multiple lumped "two-product" (N·2p) approach. The N·2p approach neglects: 1) variation of activity coefficient (ζi) values and mean molecular weight MW in the particulate matter (PM) phase; 2) water uptake into the PM; and 3) the possibility of phase separation in the PM. This study considers these effects by adopting an (N·2p)ζ, MW ,θ approach (θ is a phase index). Specific chemical structures are assigned to 25 lumped SOA compounds and to 15 representative primary organic aerosol (POA) compounds to allow calculation of ζi and MW values. The SOA structure assignments are based on chamber-derived 2p gas/particle partition coefficient values coupled with known effects of structure on vapor pressure pL,i° (atm). To facilitate adoption of the (N·2p)ζ, MW, θ approach in large-scale models, this study also develops CP-Wilson.1, a group-contribution ζi-prediction method that is more computationally economical than the UNIFAC model of Fredenslund et al. (1975). Group parameter values required by CP-Wilson.1 are obtained by fitting ζi values to predictions from UNIFAC. The (N·2p)ζ,MW, θ approach is applied (using CP-Wilson.1) to several real α-pinene/O3 chamber cases for high reacted hydrocarbon levels (ΔHC≈400 to 1000 μg m−3) when relative humidity (RH) ≈50%. Good agreement between the chamber and predicted results is obtained using both the (N·2p)ζ, MW, θ and N·2p approaches, indicating relatively small water effects under these conditions. However, for a hypothetical α-pinene/O3 case at ΔHC=30 μg m−3 and RH=50%, the (N·2p)ζ, MW, θ approach predicts that water uptake will lead to an organic PM level that is more double that predicted by the N·2p approach. Adoption of the (N·2p)ζ, MW, θ approach using reasonable lumped structures for SOA and POA compounds is recommended for ambient PM modeling.

Download Full-text

Organic particulate matter formation at varying relative humidity using surrogate secondary and primary organic compounds with activity corrections in the condensed phase obtained using a method based on the Wilson equation

Atmospheric Chemistry and Physics ◽

10.5194/acp-10-5475-2010 ◽

2010 ◽

Vol 10 (12) ◽

pp. 5475-5490 ◽

Cited By ~ 19

Author(s):

E. I. Chang ◽

J. F. Pankow

Keyword(s):

Particulate Matter ◽

Relative Humidity ◽

Water Uptake ◽

Large Scale ◽

Prediction Method ◽

Organic Aerosol ◽

Specific Chemical ◽

Small Water ◽

Chemical Structures ◽

Unifac Model

Abstract. Secondary organic aerosol (SOA) formation in the atmosphere is currently often modeled using a multiple lumped "two-product" (N·2p) approach. The N·2p approach neglects: 1) variation of activity coefficient (ζi) values and mean molecular weight MW in the particulate matter (PM) phase; 2) water uptake into the PM; and 3) the possibility of phase separation in the PM. This study considers these effects by adopting an (N·2p)ζpMW,ζ approach (θ is a phase index). Specific chemical structures are assigned to 25 lumped SOA compounds and to 15 representative primary organic aerosol (POA) compounds to allow calculation of ζi and MW values. The SOA structure assignments are based on chamber-derived 2p gas/particle partition coefficient values coupled with known effects of structure on vapor pressure pL,io (atm). To facilitate adoption of the (N·2p)ζpMW,θ approach in large-scale models, this study also develops CP-Wilson.1 (Chang-Pankow-Wilson.1), a group-contribution ζi-prediction method that is more computationally economical than the UNIFAC model of Fredenslund et al. (1975). Group parameter values required by CP-Wilson.1 are obtained by fitting ζi values to predictions from UNIFAC. The (N·2p)ζpMW,θ approach is applied (using CP-Wilson.1) to several real α-pinene/O3 chamber cases for high reacted hydrocarbon levels (ΔHC≈400 to 1000 μg m−3) when relative humidity (RH) ≈50%. Good agreement between the chamber and predicted results is obtained using both the (N·2p)ζpMW,θ and N·2p approaches, indicating relatively small water effects under these conditions. However, for a hypothetical α-pinene/O3 case at ΔHC=30 μg m−3 and RH=50%, the (N·2p)ζpMW,θ approach predicts that water uptake will lead to an organic PM level that is more double that predicted by the N·2p approach. Adoption of the (N·2p)ζpMW,θ approach using reasonable lumped structures for SOA and POA compounds is recommended for ambient PM modeling.

Download Full-text

DeepSSPred: A Deep Learning Based Sulfenylation site predictor via a novel n-segmented optimize federated feature encoder

Protein and Peptide Letters ◽

10.2174/0929866527666201202103411 ◽

2020 ◽

Vol 27 ◽

Author(s):

Zaheer Ullah Khan ◽

Dechang Pi

Keyword(s):

Large Scale ◽

Computational Models ◽

Research Work ◽

Training Data ◽

Training Dataset ◽

Validation Dataset ◽

Cytokine Signaling ◽

Minority Class ◽

Independent Dataset ◽

Feature Encoding

Background: S-sulfenylation (S-sulphenylation, or sulfenic acid) proteins, are special kinds of post-translation modification, which plays an important role in various physiological and pathological processes such as cytokine signaling, transcriptional regulation, and apoptosis. Despite these aforementioned significances, and by complementing existing wet methods, several computational models have been developed for sulfenylation cysteine sites prediction. However, the performance of these models was not satisfactory due to inefficient feature schemes, severe imbalance issues, and lack of an intelligent learning engine. Objective: In this study, our motivation is to establish a strong and novel computational predictor for discrimination of sulfenylation and non-sulfenylation sites. Methods: In this study, we report an innovative bioinformatics feature encoding tool, named DeepSSPred, in which, resulting encoded features is obtained via n-segmented hybrid feature, and then the resampling technique called synthetic minority oversampling was employed to cope with the severe imbalance issue between SC-sites (minority class) and non-SC sites (majority class). State of the art 2DConvolutional Neural Network was employed over rigorous 10-fold jackknife cross-validation technique for model validation and authentication. Results: Following the proposed framework, with a strong discrete presentation of feature space, machine learning engine, and unbiased presentation of the underline training data yielded into an excellent model that outperforms with all existing established studies. The proposed approach is 6% higher in terms of MCC from the first best. On an independent dataset, the existing first best study failed to provide sufficient details. The model obtained an increase of 7.5% in accuracy, 1.22% in Sn, 12.91% in Sp and 13.12% in MCC on the training data and12.13% of ACC, 27.25% in Sn, 2.25% in Sp, and 30.37% in MCC on an independent dataset in comparison with 2nd best method. These empirical analyses show the superlative performance of the proposed model over both training and Independent dataset in comparison with existing literature studies. Conclusion : In this research, we have developed a novel sequence-based automated predictor for SC-sites, called DeepSSPred. The empirical simulations outcomes with a training dataset and independent validation dataset have revealed the efficacy of the proposed theoretical model. The good performance of DeepSSPred is due to several reasons, such as novel discriminative feature encoding schemes, SMOTE technique, and careful construction of the prediction model through the tuned 2D-CNN classifier. We believe that our research work will provide a potential insight into a further prediction of S-sulfenylation characteristics and functionalities. Thus, we hope that our developed predictor will significantly helpful for large scale discrimination of unknown SC-sites in particular and designing new pharmaceutical drugs in general.

Download Full-text

Knowledge Transfer with Weighted Adversarial Network for Cold-Start Store Site Recommendation

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3442203 ◽

2021 ◽

Vol 15 (3) ◽

pp. 1-27

Author(s):

Yan Liu ◽

Bin Guo ◽

Daqing Zhang ◽

Djamal Zeghlache ◽

Jingmin Chen ◽

...

Keyword(s):

Large Scale ◽

Cold Start ◽

Weighting Scheme ◽

Training Data ◽

Chain Store ◽

Useful Knowledge ◽

Adversarial Network ◽

Brick And Mortar ◽

New City ◽

Learning Machine

Store site recommendation aims to predict the value of the store at candidate locations and then recommend the optimal location to the company for placing a new brick-and-mortar store. Most existing studies focus on learning machine learning or deep learning models based on large-scale training data of existing chain stores in the same city. However, the expansion of chain enterprises in new cities suffers from data scarcity issues, and these models do not work in the new city where no chain store has been placed (i.e., cold-start problem). In this article, we propose a unified approach for cold-start store site recommendation, Weighted Adversarial Network with Transferability weighting scheme (WANT), to transfer knowledge learned from a data-rich source city to a target city with no labeled data. In particular, to promote positive transfer, we develop a discriminator to diminish distribution discrepancy between source city and target city with different data distributions, which plays the minimax game with the feature extractor to learn transferable representations across cities by adversarial learning. In addition, to further reduce the risk of negative transfer, we design a transferability weighting scheme to quantify the transferability of examples in source city and reweight the contribution of relevant source examples to transfer useful knowledge. We validate WANT using a real-world dataset, and experimental results demonstrate the effectiveness of our proposed model over several state-of-the-art baseline models.

Download Full-text

Classification of Very-High-Spatial-Resolution Aerial Images Based on Multiscale Features with Limited Semantic Information

Remote Sensing ◽

10.3390/rs13030364 ◽

2021 ◽

Vol 13 (3) ◽

pp. 364

Author(s):

Han Gao ◽

Jinhui Guo ◽

Peng Guo ◽

Xiuwan Chen

Keyword(s):

Deep Learning ◽

Land Cover ◽

Spatial Resolution ◽

Large Scale ◽

High Spatial Resolution ◽

Training Data ◽

Aerial Images ◽

Rural Landscapes ◽

Feature Representations ◽

Object Based

Recently, deep learning has become the most innovative trend for a variety of high-spatial-resolution remote sensing imaging applications. However, large-scale land cover classification via traditional convolutional neural networks (CNNs) with sliding windows is computationally expensive and produces coarse results. Additionally, although such supervised learning approaches have performed well, collecting and annotating datasets for every task are extremely laborious, especially for those fully supervised cases where the pixel-level ground-truth labels are dense. In this work, we propose a new object-oriented deep learning framework that leverages residual networks with different depths to learn adjacent feature representations by embedding a multibranch architecture in the deep learning pipeline. The idea is to exploit limited training data at different neighboring scales to make a tradeoff between weak semantics and strong feature representations for operational land cover mapping tasks. We draw from established geographic object-based image analysis (GEOBIA) as an auxiliary module to reduce the computational burden of spatial reasoning and optimize the classification boundaries. We evaluated the proposed approach on two subdecimeter-resolution datasets involving both urban and rural landscapes. It presented better classification accuracy (88.9%) compared to traditional object-based deep learning methods and achieves an excellent inference time (11.3 s/ha).

Download Full-text

Gravity Control-Based Data Augmentation Technique for Improving VR User Activity Recognition

Symmetry ◽

10.3390/sym13050845 ◽

2021 ◽

Vol 13 (5) ◽

pp. 845

Author(s):

Dongheun Han ◽

Chulwoo Lee ◽

Hyeongyeop Kang

Keyword(s):

Activity Recognition ◽

Large Scale ◽

Data Augmentation ◽

Training Data ◽

Measurement Unit ◽

Gravitational Acceleration ◽

The Neural Network ◽

Typical Data ◽

Robust Recognition ◽

Gravity Acceleration

The neural-network-based human activity recognition (HAR) technique is being increasingly used for activity recognition in virtual reality (VR) users. The major issue of a such technique is the collection large-scale training datasets which are key for deriving a robust recognition model. However, collecting large-scale data is a costly and time-consuming process. Furthermore, increasing the number of activities to be classified will require a much larger number of training datasets. Since training the model with a sparse dataset can only provide limited features to recognition models, it can cause problems such as overfitting and suboptimal results. In this paper, we present a data augmentation technique named gravity control-based augmentation (GCDA) to alleviate the sparse data problem by generating new training data based on the existing data. The benefits of the symmetrical structure of the data are that it increased the number of data while preserving the properties of the data. The core concept of GCDA is two-fold: (1) decomposing the acceleration data obtained from the inertial measurement unit (IMU) into zero-gravity acceleration and gravitational acceleration, and augmenting them separately, and (2) exploiting gravity as a directional feature and controlling it to augment training datasets. Through the comparative evaluations, we validated that the application of GCDA to training datasets showed a larger improvement in classification accuracy (96.39%) compared to the typical data augmentation methods (92.29%) applied and those that did not apply the augmentation method (85.21%).

Download Full-text

Data Quality Measures and Efficient Evaluation Algorithms for Large-Scale High-Dimensional Data

Applied Sciences ◽

10.3390/app11020472 ◽

2021 ◽

Vol 11 (2) ◽

pp. 472

Author(s):

Hyeongmin Cho ◽

Sangkyun Lee

Keyword(s):

Machine Learning ◽

Data Quality ◽

Large Scale ◽

High Dimensional Data ◽

Quality Measures ◽

Training Data ◽

Measure Data ◽

High Dimensional ◽

Small Scale ◽

Class Separability

Machine learning has been proven to be effective in various application areas, such as object and speech recognition on mobile systems. Since a critical key to machine learning success is the availability of large training data, many datasets are being disclosed and published online. From a data consumer or manager point of view, measuring data quality is an important first step in the learning process. We need to determine which datasets to use, update, and maintain. However, not many practical ways to measure data quality are available today, especially when it comes to large-scale high-dimensional data, such as images and videos. This paper proposes two data quality measures that can compute class separability and in-class variability, the two important aspects of data quality, for a given dataset. Classical data quality measures tend to focus only on class separability; however, we suggest that in-class variability is another important data quality factor. We provide efficient algorithms to compute our quality measures based on random projections and bootstrapping with statistical benefits on large-scale high-dimensional data. In experiments, we show that our measures are compatible with classical measures on small-scale data and can be computed much more efficiently on large-scale high-dimensional datasets.

Download Full-text

57 Precision neoantigen discovery using novel algorithms and expanded HLA-ligandome datasets

Journal for ImmunoTherapy of Cancer ◽

10.1136/jitc-2020-sitc2020.0057 ◽

2020 ◽

Vol 8 (Suppl 3) ◽

pp. A62-A62

Author(s):

Dattatreya Mellacheruvu ◽

Rachel Pyke ◽

Charles Abbott ◽

Nick Phillips ◽

Sejal Desai ◽

...

Keyword(s):

Machine Learning ◽

Cell Lines ◽

Antigen Processing ◽

Large Scale ◽

Prediction Models ◽

K562 Cells ◽

Machine Learning Algorithms ◽

Training Data ◽

High Quality ◽

Tissue Samples

BackgroundAccurately identified neoantigens can be effective therapeutic agents in both adjuvant and neoadjuvant settings. A key challenge for neoantigen discovery has been the availability of accurate prediction models for MHC peptide presentation. We have shown previously that our proprietary model based on (i) large-scale, in-house mono-allelic data, (ii) custom features that model antigen processing, and (iii) advanced machine learning algorithms has strong performance. We have extended upon our work by systematically integrating large quantities of high-quality, publicly available data, implementing new modelling algorithms, and rigorously testing our models. These extensions lead to substantial improvements in performance and generalizability. Our algorithm, named Systematic HLA Epitope Ranking Pan Algorithm (SHERPA™), is integrated into the ImmunoID NeXT Platform®, our immuno-genomics and transcriptomics platform specifically designed to enable the development of immunotherapies.MethodsIn-house immunopeptidomic data was generated using stably transfected HLA-null K562 cells lines that express a single HLA allele of interest, followed by immunoprecipitation using W6/32 antibody and LC-MS/MS. Public immunopeptidomics data was downloaded from repositories such as MassIVE and processed uniformly using in-house pipelines to generate peptide lists filtered at 1% false discovery rate. Other metrics (features) were either extracted from source data or generated internally by re-processing samples utilizing the ImmunoID NeXT Platform.ResultsWe have generated large-scale and high-quality immunopeptidomics data by using approximately 60 mono-allelic cell lines that unambiguously assign peptides to their presenting alleles to create our primary models. Briefly, our primary ‘binding’ algorithm models MHC-peptide binding using peptide and binding pockets while our primary ‘presentation’ model uses additional features to model antigen processing and presentation. Both primary models have significantly higher precision across all recall values in multiple test data sets, including mono-allelic cell lines and multi-allelic tissue samples. To further improve the performance of our model, we expanded the diversity of our training set using high-quality, publicly available mono-allelic immunopeptidomics data. Furthermore, multi-allelic data was integrated by resolving peptide-to-allele mappings using our primary models. We then trained a new model using the expanded training data and a new composite machine learning architecture. The resulting secondary model further improves performance and generalizability across several tissue samples.ConclusionsImproving technologies for neoantigen discovery is critical for many therapeutic applications, including personalized neoantigen vaccines, and neoantigen-based biomarkers for immunotherapies. Our new and improved algorithm (SHERPA) has significantly higher performance compared to a state-of-the-art public algorithm and furthers this objective.

Download Full-text

Potentiating Effect of Mandelate and Lactate on Chemically Induced Germination in Members of Bacillus cereus Sensu Lato

Applied and Environmental Microbiology ◽

10.1128/aem.01722-17 ◽

2017 ◽

Vol 83 (24) ◽

Author(s):

Alistair H. Bishop

Keyword(s):

Bacillus Cereus ◽

Large Scale ◽

Germination Rate ◽

Dependent Manner ◽

Clostridium Sporogenes ◽

Specific Chemical ◽

Content Type ◽

Ph Range ◽

Significant Discovery ◽

The Impact

ABSTRACT Endospores of the genus Bacillus can be triggered to germinate by a limited number of chemicals. Mandelate had powerful additive effects on the levels and rates of germination produced in non-heat-shocked spores of Bacillus anthracis strain Sterne, Bacillus cereus, and Bacillus thuringiensis when combined with l-alanine and inosine. Mandelate had no germinant effect on its own but was active with these germinants in a dose-dependent manner at concentrations higher than 0.5 mM. The maximum rate and extent of germination were produced in B. anthracis by 100 mM l-alanine with 10 mM inosine; this was equaled by just 25% of these germinants when supplemented with 10 mM mandelate. Half the maximal germination rate was produced by 40% of the optimum germinant concentrations or 15% of them when supplemented with 0.8 mM mandelate. Germination rates in B. thuringiensis were highest around neutrality, but the potentiating effect of mandelate was maintained over a wider pH range than was germination with l-alanine and inosine alone. For all species, lactate also promoted germination in the presence of l-alanine and inosine; this was further increased by mandelate. Ammonium ions also enhanced l-alanine- and inosine-induced germination but only when mandelate was present. In spite of the structural similarities, mandelate did not compete with phenylalanine as a germinant. Mandelate appeared to bind to spores while enhancing germination. There was no effect when mandelate was used in conjunction with nonnutrient germinants. No effect was produced with spores of Bacillus subtilis, Clostridium sporogenes, or C. difficile. IMPORTANCE The number of chemicals that can induce germination in the species related to Bacillus cereus has been defined for many years, and they conform to specific chemical types. Although not a germinant itself, mandelate has a structure that is different from these germination-active compounds, and its addition to this list represents a significant discovery in the fundamental biology of spore germination. This novel activity may also have important applied relevance given the impact of spores of B. cereus in foodborne disease and B. anthracis as a threat agent. The destruction of spores of B. anthracis, for example, particularly over large outdoor areas, poses significant scientific and logistical problems. The addition of mandelate and lactate to the established mixtures of l-alanine and inosine would decrease the amount of the established germinants required and increase the speed and level of germination achieved. The large-scale application of “germinate to decontaminate” strategy may thus become more practicable.

Download Full-text

Maximum Variance Hashing via Column Generation

Mathematical Problems in Engineering ◽

10.1155/2013/379718 ◽

2013 ◽

Vol 2013 ◽

pp. 1-10

Author(s):

Lei Luo ◽

Chao Zhang ◽

Yongrui Qin ◽

Chunyuan Zhang

Keyword(s):

Column Generation ◽

Large Scale ◽

Web Search ◽

Nearest Neighbor ◽

Computational Cost ◽

Multimedia Retrieval ◽

Training Data ◽

Nonlinear Dimensionality Reduction ◽

Maximum Variance ◽

Data Volume

With the explosive growth of the data volume in modern applications such as web search and multimedia retrieval, hashing is becoming increasingly important for efficient nearest neighbor (similar item) search. Recently, a number of data-dependent methods have been developed, reflecting the great potential of learning for hashing. Inspired by the classic nonlinear dimensionality reduction algorithm—maximum variance unfolding, we propose a novel unsupervised hashing method, named maximum variance hashing, in this work. The idea is to maximize the total variance of the hash codes while preserving the local structure of the training data. To solve the derived optimization problem, we propose a column generation algorithm, which directly learns the binary-valued hash functions. We then extend it using anchor graphs to reduce the computational cost. Experiments on large-scale image datasets demonstrate that the proposed method outperforms state-of-the-art hashing methods in many cases.

Download Full-text