scholarly journals Comparative analysis of machine learning methods by the example of the problem of determining muon decay

Author(s):  
Migran N. Gevorkyan ◽  
Anastasia V. Demidova ◽  
Dmitry S. Kulyabov

The history of using machine learning algorithms to analyze statistical models is quite long. The development of computer technology has given these algorithms a new breath. Nowadays deep learning is mainstream and most popular area in machine learning. However, the authors believe that many researchers are trying to use deep learning methods beyond their applicability. This happens because of the widespread availability of software systems that implement deep learning algorithms, and the apparent simplicity of research. All this motivate the authors to compare deep learning algorithms and classical machine learning algorithms. The Large Hadron Collider experiment is chosen for this task, because the authors are familiar with this scientific field, and also because the experiment data is open source. The article compares various machine learning algorithms in relation to the problem of recognizing the decay reaction + + + at the Large Hadron Collider. The authors use open source implementations of machine learning algorithms. We compare algorithms with each other based on calculated metrics. As a result of the research, we can conclude that all the considered machine learning methods are quite comparable with each other (taking into account the selected metrics), while different methods have different areas of applicability.

2020 ◽  
Author(s):  
Thomas R. Lane ◽  
Daniel H. Foil ◽  
Eni Minerali ◽  
Fabio Urbina ◽  
Kimberley M. Zorn ◽  
...  

<p>Machine learning methods are attracting considerable attention from the pharmaceutical industry for use in drug discovery and applications beyond. In recent studies we have applied multiple machine learning algorithms, modeling metrics and in some cases compared molecular descriptors to build models for individual targets or properties on a relatively small scale. Several research groups have used large numbers of datasets from public databases such as ChEMBL in order to evaluate machine learning methods of interest to them. The largest of these types of studies used on the order of 1400 datasets. We have now extracted well over 5000 datasets from CHEMBL for use with the ECFP6 fingerprint and comparison of our proprietary software Assay Central<sup>TM</sup> with random forest, k-Nearest Neighbors, support vector classification, naïve Bayesian, AdaBoosted decision trees, and deep neural networks (3 levels). Model performance <a>was</a> assessed using an array of five-fold cross-validation metrics including area-under-the-curve, F1 score, Cohen’s kappa and Matthews correlation coefficient. <a>Based on ranked normalized scores for the metrics or datasets all methods appeared comparable while the distance from the top indicated Assay Central<sup>TM</sup> and support vector classification were comparable. </a>Unlike prior studies which have placed considerable emphasis on deep neural networks (deep learning), no advantage was seen in this case where minimal tuning was performed of any of the methods. If anything, Assay Central<sup>TM</sup> may have been at a slight advantage as the activity cutoff for each of the over 5000 datasets representing over 570,000 unique compounds was based on Assay Central<sup>TM</sup>performance, but support vector classification seems to be a strong competitor. We also apply Assay Central<sup>TM</sup> to prospective predictions for PXR and hERG to further validate these models. This work currently appears to be the largest comparison of machine learning algorithms to date. Future studies will likely evaluate additional databases, descriptors and algorithms, as well as further refining methods for evaluating and comparing models. </p><p><b> </b></p>


2020 ◽  
Author(s):  
Thomas R. Lane ◽  
Daniel H. Foil ◽  
Eni Minerali ◽  
Fabio Urbina ◽  
Kimberley M. Zorn ◽  
...  

<p>Machine learning methods are attracting considerable attention from the pharmaceutical industry for use in drug discovery and applications beyond. In recent studies we have applied multiple machine learning algorithms, modeling metrics and in some cases compared molecular descriptors to build models for individual targets or properties on a relatively small scale. Several research groups have used large numbers of datasets from public databases such as ChEMBL in order to evaluate machine learning methods of interest to them. The largest of these types of studies used on the order of 1400 datasets. We have now extracted well over 5000 datasets from CHEMBL for use with the ECFP6 fingerprint and comparison of our proprietary software Assay Central<sup>TM</sup> with random forest, k-Nearest Neighbors, support vector classification, naïve Bayesian, AdaBoosted decision trees, and deep neural networks (3 levels). Model performance <a>was</a> assessed using an array of five-fold cross-validation metrics including area-under-the-curve, F1 score, Cohen’s kappa and Matthews correlation coefficient. <a>Based on ranked normalized scores for the metrics or datasets all methods appeared comparable while the distance from the top indicated Assay Central<sup>TM</sup> and support vector classification were comparable. </a>Unlike prior studies which have placed considerable emphasis on deep neural networks (deep learning), no advantage was seen in this case where minimal tuning was performed of any of the methods. If anything, Assay Central<sup>TM</sup> may have been at a slight advantage as the activity cutoff for each of the over 5000 datasets representing over 570,000 unique compounds was based on Assay Central<sup>TM</sup>performance, but support vector classification seems to be a strong competitor. We also apply Assay Central<sup>TM</sup> to prospective predictions for PXR and hERG to further validate these models. This work currently appears to be the largest comparison of machine learning algorithms to date. Future studies will likely evaluate additional databases, descriptors and algorithms, as well as further refining methods for evaluating and comparing models. </p><p><b> </b></p>


2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Alan Brnabic ◽  
Lisa M. Hess

Abstract Background Machine learning is a broad term encompassing a number of methods that allow the investigator to learn from the data. These methods may permit large real-world databases to be more rapidly translated to applications to inform patient-provider decision making. Methods This systematic literature review was conducted to identify published observational research of employed machine learning to inform decision making at the patient-provider level. The search strategy was implemented and studies meeting eligibility criteria were evaluated by two independent reviewers. Relevant data related to study design, statistical methods and strengths and limitations were identified; study quality was assessed using a modified version of the Luo checklist. Results A total of 34 publications from January 2014 to September 2020 were identified and evaluated for this review. There were diverse methods, statistical packages and approaches used across identified studies. The most common methods included decision tree and random forest approaches. Most studies applied internal validation but only two conducted external validation. Most studies utilized one algorithm, and only eight studies applied multiple machine learning algorithms to the data. Seven items on the Luo checklist failed to be met by more than 50% of published studies. Conclusions A wide variety of approaches, algorithms, statistical software, and validation strategies were employed in the application of machine learning methods to inform patient-provider decision making. There is a need to ensure that multiple machine learning approaches are used, the model selection strategy is clearly defined, and both internal and external validation are necessary to be sure that decisions for patient care are being made with the highest quality evidence. Future work should routinely employ ensemble methods incorporating multiple machine learning algorithms.


2019 ◽  
Vol 24 (34) ◽  
pp. 3998-4006
Author(s):  
Shijie Fan ◽  
Yu Chen ◽  
Cheng Luo ◽  
Fanwang Meng

Background: On a tide of big data, machine learning is coming to its day. Referring to huge amounts of epigenetic data coming from biological experiments and clinic, machine learning can help in detecting epigenetic features in genome, finding correlations between phenotypes and modifications in histone or genes, accelerating the screen of lead compounds targeting epigenetics diseases and many other aspects around the study on epigenetics, which consequently realizes the hope of precision medicine. Methods: In this minireview, we will focus on reviewing the fundamentals and applications of machine learning methods which are regularly used in epigenetics filed and explain their features. Their advantages and disadvantages will also be discussed. Results: Machine learning algorithms have accelerated studies in precision medicine targeting epigenetics diseases. Conclusion: In order to make full use of machine learning algorithms, one should get familiar with the pros and cons of them, which will benefit from big data by choosing the most suitable method(s).


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Duy Ngoc Nguyen ◽  
Tuoi Thi Phan ◽  
Phuc Do

AbstractSentiment classification, which uses deep learning algorithms, has achieved good results when tested with popular datasets. However, it will be challenging to build a corpus on new topics to train machine learning algorithms in sentiment classification with high confidence. This study proposes a method that processes embedding knowledge in the ontology of opinion datasets called knowledge processing and representation based on ontology (KPRO) to represent the significant features of the dataset into the word embedding layer of deep learning algorithms in sentiment classification. Unlike the methods that lexical encode or add information to the corpus, this method adds presentation of raw data based on the expert’s knowledge in the ontology. Once the data has a rich knowledge of the topic, the efficiency of the machine learning algorithms is significantly enhanced. Thus, this method is appliable to embed knowledge in datasets in other languages. The test results show that deep learning methods achieved considerably higher accuracy when trained with the KPRO method’s dataset than when trained with datasets not processed by this method. Therefore, this method is a novel approach to improve the accuracy of deep learning algorithms and increase the reliability of new datasets, thus making them ready for mining.


2021 ◽  
Author(s):  
Dhairya Vyas

In terms of Machine Learning, the majority of the data can be grouped into four categories: numerical data, category data, time-series data, and text. We use different classifiers for different data properties, such as the Supervised; Unsupervised; and Reinforcement. Each Categorises has classifier we have tested almost all machine learning methods and make analysis among them.


2019 ◽  
Author(s):  
Levi John Wolf ◽  
Elijah Knaap

Dimension reduction is one of the oldest concerns in geographical analysis. Despite significant, longstanding attention in geographical problems, recent advances in non-linear techniques for dimension reduction, called manifold learning, have not been adopted in classic data-intensive geographical problems. More generally, machine learning methods for geographical problems often focus more on applying standard machine learning algorithms to geographic data, rather than applying true "spatially-correlated learning," in the words of Kohonen. As such, we suggest a general way to incentivize geographical learning in machine learning algorithms, and link it to many past methods that introduced geography into statistical techniques. We develop a specific instance of this by specifying two geographical variants of Isomap, a non-linear dimension reduction, or "manifold learning," technique. We also provide a method for assessing what is added by incorporating geography and estimate the manifold's intrinsic geographic scale. To illustrate the concepts and provide interpretable results, we conducting a dimension reduction on geographical and high-dimensional structure of social and economic data on Brooklyn, New York. Overall, this paper's main endeavor--defining and explaining a way to "geographize" many machine learning methods--yields interesting and novel results for manifold learning the estimation of intrinsic geographical scale in unsupervised learning.


Sensors ◽  
2019 ◽  
Vol 19 (7) ◽  
pp. 1521 ◽  
Author(s):  
Tomasz Rymarczyk ◽  
Grzegorz Kłosowski ◽  
Edward Kozłowski ◽  
Paweł Tchórzewski

The main goal of this work was to compare the selected machine learning methods with the classic deterministic method in the industrial field of electrical impedance tomography. The research focused on the development and comparison of algorithms and models for the analysis and reconstruction of data using electrical tomography. The novelty was the use of original machine learning algorithms. Their characteristic feature is the use of many separately trained subsystems, each of which generates a single pixel of the output image. Artificial Neural Network (ANN), LARS and Elastic net methods were used to solve the inverse problem. These algorithms have been modified by a corresponding increase in equations (multiply) for electrical impedance tomography using the finite element method grid. The Gauss-Newton method was used as a reference to machine learning methods. The algorithms were trained using learning data obtained through computer simulation based on real models. The results of the experiments showed that in the considered cases the best quality of reconstructions was achieved by ANN. At the same time, ANN was the slowest in terms of both the training process and the speed of image generation. Other machine learning methods were comparable with the deterministic Gauss-Newton method and with each other.


Author(s):  
Hong Cui

Despite the sub-language nature of taxonomic descriptions of animals and plants, researchers have warned about the existence of large variations among different description collections in terms of information content and its representation. These variations impose a serious threat to the development of automatic tools to structure large volumes of text-based descriptions. This paper presents a general approach to mark up different collections of taxonomic descriptions with XML, using two large-scale floras as examples. The markup system, MARTT, is based on machine learning methods and enhanced by machine learned domain rules and conventions. Experiments show that our simple and efficient machine learning algorithms outperform significantly general purpose algorithms and that rules learned from one flora can be used when marking up a second flora and help to improve the markup performance, especially for elements that have sparse training examples.Malgré la nature de sous-langage des descriptions taxinomiques des animaux et des plantes, les chercheurs reconnaissent l’existence de vastes variations parmi différentes collections de descriptions, en termes de contenu informationnel et de leur représentation. Ces variations présentent une menace sérieuse pour le développement d’outils automatiques pour la structuration de larges… 


Sign in / Sign up

Export Citation Format

Share Document