Machine Learning of Reaction Properties via Learned Representations of the Condensed Graph of Reaction

Mapping Intimacies ◽

10.33774/chemrxiv-2021-frfhz-v2 ◽

2021 ◽

Author(s):

Esther Heid ◽

William H. Green

Keyword(s):

Machine Learning ◽

Chemical Reaction ◽

Expert Knowledge ◽

Activation Energies ◽

Large Set ◽

Learning Approaches ◽

Product Graphs ◽

Benchmark Datasets ◽

Future Work ◽

Condensed Graph Of Reaction

The estimation of chemical reaction properties such as activation energies, rates or yields is a central topic of computational chemistry. In contrast to molecular properties, where machine learning approaches such as graph convolutional neural networks (GCNNs) have excelled for a wide variety of tasks, no general and transferable adaptations of GCNNs for reactions have been developed yet. We therefore combined a popular cheminformatics reaction representation, the so-called condensed graph of reaction (CGR), with a recent GCNN architecture to arrive at a versatile, robust and compact deep learning model. The CGR is a superposition of the reactant and product graphs of a chemical reaction, and thus an ideal input for graph-based machine learning approaches. The model learns to create a data-driven, task dependent reaction embedding that does not rely on expert knowledge, similar to current molecular GCNNs. Our approach outperforms current state-of-the-art models in accuracy, is applicable even to imbalanced reactions and possesses excellent predictive capabilities for diverse target properties, such as activation energies, reaction enthalpies, rate constants, yields or reaction classes. We furthermore curated a large set of atom-mapped reactions along with their target properties, which can serve as benchmark datasets for future work. All datasets and the developed reaction GCNN model are available online, free of charge and open-source.

Download Full-text

To use or not to use: Feature selection for sentiment analysis of highly imbalanced data

Natural Language Engineering ◽

10.1017/s1351324917000298 ◽

2017 ◽

Vol 24 (1) ◽

pp. 3-37 ◽

Cited By ~ 5

Author(s):

SANDRA KÜBLER ◽

CAN LIU ◽

ZEESHAN ALI SAYYED

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Sentiment Analysis ◽

Information Gain ◽

Binary Classification ◽

Small Subset ◽

Large Set ◽

Learning Approaches ◽

Selection Methods ◽

Data Set

AbstractWe investigate feature selection methods for machine learning approaches in sentiment analysis. More specifically, we use data from the cooking platform Epicurious and attempt to predict ratings for recipes based on user reviews. In machine learning approaches to such tasks, it is a common approach to use word or part-of-speech n-grams. This results in a large set of features, out of which only a small subset may be good indicators for the sentiment. One of the questions we investigate concerns the extension of feature selection methods from a binary classification setting to a multi-class problem. We show that an inherently multi-class approach, multi-class information gain, outperforms ensembles of binary methods. We also investigate how to mitigate the effects of extreme skewing in our data set by making our features more robust and by using review and recipe sampling. We show that over-sampling is the best method for boosting performance on the minority classes, but it also results in a severe drop in overall accuracy of at least 6 per cent points.

Download Full-text

Multiple Object Tracking in Deep Learning Approaches: A Survey

Electronics ◽

10.3390/electronics10192406 ◽

2021 ◽

Vol 10 (19) ◽

pp. 2406

Author(s):

Yesul Park ◽

L. Minh Dang ◽

Sujin Lee ◽

Dongil Han ◽

Hyeonjoon Moon

Keyword(s):

Deep Learning ◽

Object Tracking ◽

Multiple Object Tracking ◽

Motion Trajectory ◽

Learning Approaches ◽

Multiple Object ◽

Problems And Solutions ◽

Benchmark Datasets ◽

Future Work ◽

Appearance Changes

Object tracking is a fundamental computer vision problem that refers to a set of methods proposed to precisely track the motion trajectory of an object in a video. Multiple Object Tracking (MOT) is a subclass of object tracking that has received growing interest due to its academic and commercial potential. Although numerous methods have been introduced to cope with this problem, many challenges remain to be solved, such as severe object occlusion and abrupt appearance changes. This paper focuses on giving a thorough review of the evolution of MOT in recent decades, investigating the recent advances in MOT, and showing some potential directions for future work. The primary contributions include: (1) a detailed description of the MOT’s main problems and solutions, (2) a categorization of the previous MOT algorithms into 12 approaches and discussion of the main procedures for each category, (3) a review of the benchmark datasets and standard evaluation methods for evaluating the MOT, (4) a discussion of various MOT challenges and solutions by analyzing the related references, and (5) a summary of the latest MOT technologies and recent MOT trends using the mentioned MOT categories.

Download Full-text

Harnessing machine learning to boost heuristic strategies for phylogenetic-tree search

10.21203/rs.3.rs-48247/v1 ◽

2020 ◽

Author(s):

Dana Azouri ◽

Shiran Abadi ◽

Yishay Mansour ◽

Itay Mayrose ◽

Tal Pupko

Keyword(s):

Machine Learning ◽

Phylogenetic Tree ◽

Learning Algorithm ◽

Search Space ◽

Large Set ◽

Tree Search ◽

Learning Approaches ◽

Tree Reconstruction ◽

Heuristic Strategies ◽

Tree Inference

Abstract Inferring a phylogenetic tree, which describes the evolutionary relationships among a set of organisms, genes, or genomes, is a fundamental step in numerous evolutionary studies. With the aim of making tree inference feasible for problems involving more than a handful of sequences, current algorithms for phylogenetic tree reconstruction utilize various heuristic approaches. Such approaches rely on performing costly likelihood optimizations, and thus evaluate only a subset of all potential trees. Consequently, all existing methods suffer from the known tradeoff between accuracy and running time. Here, we train a machine-learning algorithm over an extensive cohort of empirical data to predict the neighboring trees that increase the likelihood, without actually computing their likelihood. This provides means to safely discard a large set of the search space, thus avoiding numerous expensive likelihood computations. Our analyses suggest that machine-learning approaches can make heuristic tree searches substantially faster without losing accuracy and thus could be incorporated for narrowing down the examined neighboring trees of each intermediate tree in any tree search methodology.

Download Full-text

Investigating Machine Learning Techniques for User Sentiment Analysis

International Journal of Decision Support System Technology ◽

10.4018/ijdsst.2019070101 ◽

2019 ◽

Vol 11 (3) ◽

pp. 1-12 ◽

Cited By ~ 2

Author(s):

Nimesh V Patel ◽

Hitesh Chhinkaniwala

Keyword(s):

Machine Learning ◽

Sentiment Analysis ◽

Social Networking Sites ◽

Machine Learning Techniques ◽

Support Vector ◽

Product Reviews ◽

Learning Approaches ◽

Current Trends ◽

Learning Techniques ◽

Benchmark Datasets

Sentiment analysis identifies users in the textual reviews available in social networking sites, tweets, blog posts, forums, status updates to share their emotions or reviews and these reviews are to be used by market researchers to do know the product reviews and current trends in the market. The sentiment analysis is performed by two methods. Machine learning approaches and lexicon methods which are also known as the knowledge base approach. These. In this article, the authors evaluate the performance of some machine learning techniques: Maximum Entropy, Naïve Bayes and Support Vector Machines on two benchmark datasets: the positive-negative dataset and a Movie Review dataset by measuring parameters like accuracy, precision, recall and F-score. In this article, the authors present the performance of various sentiment analysis and classification methods by classifying the reviews in binary classes as positive, negative opinion about reviews on different domains of dataset. It is also justified that sentiment analysis using the Support Vector Machine outperforms other machine learning techniques.

Download Full-text

Prioritizing bona fide bacterial small RNAs with machine learning classifiers

PeerJ ◽

10.7717/peerj.6304 ◽

2019 ◽

Vol 7 ◽

pp. e6304 ◽

Cited By ~ 2

Author(s):

Erik J.J. Eppenhof ◽

Lourdes Peña-Castillo

Keyword(s):

Machine Learning ◽

Bacterial Species ◽

Open Reading Frames ◽

Learning Approaches ◽

Genomic Context ◽

Cellular Processes ◽

Benchmark Datasets ◽

Bona Fide ◽

Wet Lab ◽

Independent Model

Bacterial small (sRNAs) are involved in the control of several cellular processes. Hundreds of putative sRNAs have been identified in many bacterial species through RNA sequencing. The existence of putative sRNAs is usually validated by Northern blot analysis. However, the large amount of novel putative sRNAs reported in the literature makes it impractical to validate each of them in the wet lab. In this work, we applied five machine learning approaches to construct twenty models to discriminate bona fide sRNAs from random genomic sequences in five bacterial species. Sequences were represented using seven features including free energy of their predicted secondary structure, their distances to the closest predicted promoter site and Rho-independent terminator, and their distance to the closest open reading frames (ORFs). To automatically calculate these features, we developed an sRNA Characterization Pipeline (sRNACharP). All seven features used in the classification task contributed positively to the performance of the predictive models. The best performing model obtained a median precision of 100% at 10% recall and of 64% at 40% recall across all five bacterial species, and it outperformed previous published approaches on two benchmark datasets in terms of precision and recall. Our results indicate that even though there is limited sRNA sequence conservation across different bacterial species, there are intrinsic features in the genomic context of sRNAs that are conserved across taxa. We show that these features are utilized by machine learning approaches to learn a species-independent model to prioritize bona fide bacterial sRNAs.

Download Full-text

Diagnosis of Leaf Disease in Cucurbita Gourd Family using Machine Learning Algorithms

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.c5232.098319 ◽

2019 ◽

Vol 8 (3) ◽

pp. 6800-6804

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Powdery Mildew ◽

Statistical Approach ◽

Analysis Data ◽

Machine Learning Algorithms ◽

Learning Approach ◽

Learning Approaches ◽

Time Frequency ◽

Future Work

PaperThe paper presents the implementation of various machines learning approach for the diagnosis of leaf diseases. For analysis data collection were done towards capturing the images of pumpkin leaf affected by different diseases. The pumpkin leaf samples were taken. The samples correspond to the blight and fungal problems like Alternaria, Powdery Mildew Anthracnose, and Yellow Vine Disease etc. The methods for analysis were implemented and tested which are based on time, frequency and statistical approach. For classification machine learning approaches like neural networks, SVM, KNN etc were analyzed. The implementation issues were presented for future work.

Download Full-text

Exploiting transfer learning for the reconstruction of the human gene regulatory network

Bioinformatics ◽

10.1093/bioinformatics/btz781 ◽

2019 ◽

Cited By ~ 5

Author(s):

Paolo Mignone ◽

Gianvito Pio ◽

Domenica D’Elia ◽

Michelangelo Ceci

Keyword(s):

Machine Learning ◽

Transfer Learning ◽

Regulatory Networks ◽

Supplementary Information ◽

Large Set ◽

Learning Approaches ◽

Expression Data ◽

Functional Relationships ◽

Learning Technique ◽

Gene Regulatory

Abstract Motivation The reconstruction of gene regulatory networks (GRNs) from gene expression data has received increasing attention in recent years, due to its usefulness in the understanding of regulatory mechanisms involved in human diseases. Most of the existing methods reconstruct the network through machine learning approaches, by analyzing known examples of interactions. However, (i) they often produce poor results when the amount of labeled examples is limited, or when no negative example is available and (ii) they are not able to exploit information extracted from GRNs of other (better studied) related organisms, when this information is available. Results In this paper, we propose a novel machine learning method that overcomes these limitations, by exploiting the knowledge about the GRN of a source organism for the reconstruction of the GRN of the target organism, by means of a novel transfer learning technique. Moreover, the proposed method is natively able to work in the positive-unlabeled setting, where no negative example is available, by fruitfully exploiting a (possibly large) set of unlabeled examples. In our experiments, we reconstructed the human GRN, by exploiting the knowledge of the GRN of Mus musculus. Results showed that the proposed method outperforms state-of-the-art approaches and identifies previously unknown functional relationships among the analyzed genes. Availability and implementation http://www.di.uniba.it/∼mignone/systems/biosfer/index.html. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Interval Coded Scoring: a toolbox for interpretable scoring systems

PeerJ Computer Science ◽

10.7717/peerj-cs.150 ◽

2018 ◽

Vol 4 ◽

pp. e150 ◽

Cited By ~ 3

Author(s):

Lieven Billiet ◽

Sabine Van Huffel ◽

Vanya Van Belle

Keyword(s):

Machine Learning ◽

Decision Support ◽

Expert Knowledge ◽

Real Life ◽

Scoring Systems ◽

Training Data ◽

Legal Responsibility ◽

Support Vector ◽

Medical Setting ◽

Learning Approaches

Over the last decades, clinical decision support systems have been gaining importance. They help clinicians to make effective use of the overload of available information to obtain correct diagnoses and appropriate treatments. However, their power often comes at the cost of a black box model which cannot be interpreted easily. This interpretability is of paramount importance in a medical setting with regard to trust and (legal) responsibility. In contrast, existing medical scoring systems are easy to understand and use, but they are often a simplified rule-of-thumb summary of previous medical experience rather than a well-founded system based on available data. Interval Coded Scoring (ICS) connects these two approaches, exploiting the power of sparse optimization to derive scoring systems from training data. The presented toolbox interface makes this theory easily applicable to both small and large datasets. It contains two possible problem formulations based on linear programming or elastic net. Both allow to construct a model for a binary classification problem and establish risk profiles that can be used for future diagnosis. All of this requires only a few lines of code. ICS differs from standard machine learning through its model consisting of interpretable main effects and interactions. Furthermore, insertion of expert knowledge is possible because the training can be semi-automatic. This allows end users to make a trade-off between complexity and performance based on cross-validation results and expert knowledge. Additionally, the toolbox offers an accessible way to assess classification performance via accuracy and the ROC curve, whereas the calibration of the risk profile can be evaluated via a calibration curve. Finally, the colour-coded model visualization has particular appeal if one wants to apply ICS manually on new observations, as well as for validation by experts in the specific application domains. The validity and applicability of the toolbox is demonstrated by comparing it to standard Machine Learning approaches such as Naive Bayes and Support Vector Machines for several real-life datasets. These case studies on medical problems show its applicability as a decision support system. ICS performs similarly in terms of classification and calibration. Its slightly lower performance is countered by its model simplicity which makes it the method of choice if interpretability is a key issue.

Download Full-text

Ensemble learning from ensemble docking: revisiting the optimum ensemble size problem

Scientific Reports ◽

10.1038/s41598-021-04448-5 ◽

2022 ◽

Vol 12 (1) ◽

Author(s):

Sara Mohammadi ◽

Zahra Narimani ◽

Mitra Ashouri ◽

Rohoullah Firouzi ◽

Mohammad Hossein Karimi‐Jafari

Keyword(s):

Machine Learning ◽

Learning Strategies ◽

Ensemble Learning ◽

Computational Cost ◽

Large Set ◽

Learning Approaches ◽

Cyclin Dependent Kinase ◽

Ensemble Docking ◽

Learning Procedure ◽

Open Question

AbstractDespite considerable advances obtained by applying machine learning approaches in protein–ligand affinity predictions, the incorporation of receptor flexibility has remained an important bottleneck. While ensemble docking has been used widely as a solution to this problem, the optimum choice of receptor conformations is still an open question considering the issues related to the computational cost and false positive pose predictions. Here, a combination of ensemble learning and ensemble docking is suggested to rank different conformations of the target protein in light of their importance for the final accuracy of the model. Available X-ray structures of cyclin-dependent kinase 2 (CDK2) in complex with different ligands are used as an initial receptor ensemble, and its redundancy is removed through a graph-based redundancy removal, which is shown to be more efficient and less subjective than clustering-based representative selection methods. A set of ligands with available experimental affinity are docked to this nonredundant receptor ensemble, and the energetic features of the best scored poses are used in an ensemble learning procedure based on the random forest method. The importance of receptors is obtained through feature selection measures, and it is shown that a few of the most important conformations are sufficient to reach 1 kcal/mol accuracy in affinity prediction with considerable improvement of the early enrichment power of the models compared to the different ensemble docking without learning strategies. A clear strategy has been provided in which machine learning selects the most important experimental conformers of the receptor among a large set of protein–ligand complexes while simultaneously maintaining the final accuracy of affinity predictions at the highest level possible for available data. Our results could be informative for future attempts to design receptor-specific docking-rescoring strategies.

Download Full-text

Machine learning approaches for parsing comorbidity/heterogeneity in antisociality and substance use disorders: A primer

Personality Neuroscience ◽

10.1017/pen.2021.2 ◽

2021 ◽

Vol 4 ◽

Author(s):

Matthew S. Shane ◽

William J. Denomme

Keyword(s):

Machine Learning ◽

Substance Use ◽

Data Driven ◽

Learning Approaches ◽

Learning Methods ◽

Machine Learning Methods ◽

High Level ◽

Future Work ◽

Interacting Features ◽

Research Domain

Abstract By some accounts, as many as 93% of individuals diagnosed with antisocial personality disorder (ASPD) or psychopathy also meet criteria for some form of substance use disorder (SUD). This high level of comorbidity, combined with an overlapping biopsychosocial profile, and potentially interacting features, has made it difficult to delineate the shared/unique characteristics of each disorder. Moreover, while rarely acknowledged, both SUD and antisociality exist as highly heterogeneous disorders in need of more targeted parcellation. While emerging data-driven nosology for psychiatric disorders (e.g., Research Domain Criteria (RDoC), Hierarchical Taxonomy of Psychopathology (HiTOP)) offers the opportunity for a more systematic delineation of the externalizing spectrum, the interrogation of large, complex neuroimaging-based datasets may require data-driven approaches that are not yet widely employed in psychiatric neuroscience. With this in mind, the proposed article sets out to provide an introduction into machine learning methods for neuroimaging that can help parse comorbid, heterogeneous externalizing samples. The modest machine learning work conducted to date within the externalizing domain demonstrates the potential utility of the approach but remains highly nascent. Within the paper, we make suggestions for how future work can make use of machine learning methods, in combination with emerging psychiatric nosology systems, to further diagnostic and etiological understandings of the externalizing spectrum. Finally, we briefly consider some challenges that will need to be overcome to encourage further progress in the field.

Download Full-text