Machine learning for the extragalactic astronomy educational manual

2019 ◽  
Vol 15 (S367) ◽  
pp. 461-463
Author(s):  
Maksym Vasylenko ◽  
Daria Dobrycheva

AbstractWe evaluated a new approach to the automated morphological classification of large galaxy samples based on the supervised machine learning techniques (Naive Bayes, Random Forest, Support Vector Machine, Logistic Regression, and k-Nearest Neighbours) and Deep Learning using the Python programming language. A representative sample of ∼315000 SDSS DR9 galaxies at z < 0.1 and stellar magnitudes r < 17.7m was considered as a target sample of galaxies with indeterminate morphological types. Classical machine learning methods were used to binary morphologically classification of galaxies into early and late types (96.4% with Support Vector Machine). Deep machine learning methods were used to classify images of galaxies into five visual types (completely rounded, rounded in-between, smooth cigar-shaped, edge-on, and spiral) with the Xception architecture (94% accuracy for four classes and 88% for cigar-like galaxies). These results created a basis for educational manual on the processing of large data sets in the Python programming language, which is intended for students of the Ukrainian universities.

2021 ◽  
Vol 9 (1) ◽  
Author(s):  
Hui Yu ◽  
Jian Deng ◽  
Ran Nathan ◽  
Max Kröschel ◽  
Sasha Pekarsky ◽  
...  

Abstract Background Our understanding of movement patterns and behaviours of wildlife has advanced greatly through the use of improved tracking technologies, including application of accelerometry (ACC) across a wide range of taxa. However, most ACC studies either use intermittent sampling that hinders continuity or continuous data logging relying on tracker retrieval for data downloading which is not applicable for long term study. To allow long-term, fine-scale behavioural research, we evaluated a range of machine learning methods for their suitability for continuous on-board classification of ACC data into behaviour categories prior to data transmission. Methods We tested six supervised machine learning methods, including linear discriminant analysis (LDA), decision tree (DT), support vector machine (SVM), artificial neural network (ANN), random forest (RF) and extreme gradient boosting (XGBoost) to classify behaviour using ACC data from three bird species (white stork Ciconia ciconia, griffon vulture Gyps fulvus and common crane Grus grus) and two mammals (dairy cow Bos taurus and roe deer Capreolus capreolus). Results Using a range of quality criteria, SVM, ANN, RF and XGBoost performed well in determining behaviour from ACC data and their good performance appeared little affected when greatly reducing the number of input features for model training. On-board runtime and storage-requirement tests showed that notably ANN, RF and XGBoost would make suitable on-board classifiers. Conclusions Our identification of using feature reduction in combination with ANN, RF and XGBoost as suitable methods for on-board behavioural classification of continuous ACC data has considerable potential to benefit movement ecology and behavioural research, wildlife conservation and livestock husbandry.


PLoS ONE ◽  
2016 ◽  
Vol 11 (12) ◽  
pp. e0166898 ◽  
Author(s):  
Monique A. Ladds ◽  
Adam P. Thompson ◽  
David J. Slip ◽  
David P. Hocking ◽  
Robert G. Harcourt

2020 ◽  
Vol 18 (1) ◽  
Author(s):  
Kerry E. Poppenberg ◽  
Vincent M. Tutino ◽  
Lu Li ◽  
Muhammad Waqas ◽  
Armond June ◽  
...  

Abstract Background Intracranial aneurysms (IAs) are dangerous because of their potential to rupture. We previously found significant RNA expression differences in circulating neutrophils between patients with and without unruptured IAs and trained machine learning models to predict presence of IA using 40 neutrophil transcriptomes. Here, we aim to develop a predictive model for unruptured IA using neutrophil transcriptomes from a larger population and more robust machine learning methods. Methods Neutrophil RNA extracted from the blood of 134 patients (55 with IA, 79 IA-free controls) was subjected to next-generation RNA sequencing. In a randomly-selected training cohort (n = 94), the Least Absolute Shrinkage and Selection Operator (LASSO) selected transcripts, from which we constructed prediction models via 4 well-established supervised machine-learning algorithms (K-Nearest Neighbors, Random Forest, and Support Vector Machines with Gaussian and cubic kernels). We tested the models in the remaining samples (n = 40) and assessed model performance by receiver-operating-characteristic (ROC) curves. Real-time quantitative polymerase chain reaction (RT-qPCR) of 9 IA-associated genes was used to verify gene expression in a subset of 49 neutrophil RNA samples. We also examined the potential influence of demographics and comorbidities on model prediction. Results Feature selection using LASSO in the training cohort identified 37 IA-associated transcripts. Models trained using these transcripts had a maximum accuracy of 90% in the testing cohort. The testing performance across all methods had an average area under ROC curve (AUC) = 0.97, an improvement over our previous models. The Random Forest model performed best across both training and testing cohorts. RT-qPCR confirmed expression differences in 7 of 9 genes tested. Gene ontology and IPA network analyses performed on the 37 model genes reflected dysregulated inflammation, cell signaling, and apoptosis processes. In our data, demographics and comorbidities did not affect model performance. Conclusions We improved upon our previous IA prediction models based on circulating neutrophil transcriptomes by increasing sample size and by implementing LASSO and more robust machine learning methods. Future studies are needed to validate these models in larger cohorts and further investigate effect of covariates.


2021 ◽  
Author(s):  
Qifei Zhao ◽  
Xiaojun Li ◽  
Yunning Cao ◽  
Zhikun Li ◽  
Jixin Fan

Abstract Collapsibility of loess is a significant factor affecting engineering construction in loess area, and testing the collapsibility of loess is costly. In this study, A total of 4,256 loess samples are collected from the north, east, west and middle regions of Xining. 70% of the samples are used to generate training data set, and the rest are used to generate verification data set, so as to construct and validate the machine learning models. The most important six factors are selected from thirteen factors by using Grey Relational analysis and multicollinearity analysis: burial depth、water content、specific gravity of soil particles、void rate、geostatic stress and plasticity limit. In order to predict the collapsibility of loess, four machine learning methods: Support Vector Machine (SVM), Random Subspace Based Support Vector Machine (RSSVM), Random Forest (RF) and Naïve Bayes Tree (NBTree), are studied and compared. The receiver operating characteristic (ROC) curve indicators, standard error (SD) and 95% confidence interval (CI) are used to verify and compare the models in different research areas. The results show that: RF model is the most efficient in predicting the collapsibility of loess in Xining, and its AUC average is above 80%, which can be used in engineering practice.


Energies ◽  
2021 ◽  
Vol 14 (22) ◽  
pp. 7714
Author(s):  
Ha Quang Man ◽  
Doan Huy Hien ◽  
Kieu Duy Thong ◽  
Bui Viet Dung ◽  
Nguyen Minh Hoa ◽  
...  

The test study area is the Miocene reservoir of Nam Con Son Basin, offshore Vietnam. In the study we used unsupervised learning to automatically cluster hydraulic flow units (HU) based on flow zone indicators (FZI) in a core plug dataset. Then we applied supervised learning to predict HU by combining core and well log data. We tested several machine learning algorithms. In the first phase, we derived hydraulic flow unit clustering of porosity and permeability of core data using unsupervised machine learning methods such as Ward’s, K mean, Self-Organize Map (SOM) and Fuzzy C mean (FCM). Then we applied supervised machine learning methods including Artificial Neural Networks (ANN), Support Vector Machines (SVM), Boosted Tree (BT) and Random Forest (RF). We combined both core and log data to predict HU logs for the full well section of the wells without core data. We used four wells with six logs (GR, DT, NPHI, LLD, LSS and RHOB) and 578 cores from the Miocene reservoir to train, validate and test the data. Our goal was to show that the correct combination of cores and well logs data would provide reservoir engineers with a tool for HU classification and estimation of permeability in a continuous geological profile. Our research showed that machine learning effectively boosts the prediction of permeability, reduces uncertainty in reservoir modeling, and improves project economics.


2021 ◽  
Vol 5 (3) ◽  
pp. 905
Author(s):  
Muhammad Afrizal Amrustian ◽  
Vika Febri Muliati ◽  
Elsa Elvira Awal

Japanese is one of the most difficult languages to understand and read. Japanese writing that does not use the alphabet is the reason for the difficulty of the Japanese language to read. There are three types of Japanese, namely kanji, katakana, and hiragana. Hiragana letters are the most commonly used type of writing. In addition, hiragana has a cursive nature, so each person's writing will be different. Machine learning methods can be used to read Japanese letters by recognizing the image of the letters. The Japanese letters that are used in this study are hiragana vowels. This study focuses on conducting a comparative study of machine learning methods for the image classification of Japanese letters. The machine learning methods that were successfully compared are Naïve Bayes, Support Vector Machine, Decision Tree, Random Forest, and K-Nearest Neighbor. The results of the comparative study show that the K-Nearest Neighbor method is the best method for image classification of hiragana vowels. K-Nearest Neighbor gets an accuracy of 89.4% with a low error rate.


2021 ◽  
Author(s):  
Leonie Lampe ◽  
Sebastian Niehaus ◽  
Hans-Jürgen Huppertz ◽  
Alberto Merola ◽  
Janis Reinelt ◽  
...  

Abstract Importance The entry of artificial intelligence into medicine is pending. Several methods have been used for predictions of structured neuroimaging data, yet nobody compared them in this context.Objective Multi-class prediction is key for building computational aid systems for differential diagnosis. We compared support vector machine, random forest, gradient boosting, and deep feed-forward neural networks for the classification of different neurodegenerative syndromes based on structural magnetic resonance imaging.Design, Setting, and Participants Atlas-based volumetry was performed on multi-centric T1weighted MRI data from 940 subjects, i.e. 124 healthy controls and 816 patients with ten different neurodegenerative diseases, leading to a multi-diagnostic multi-class classification task with eleven different classes.Interventions n.a.Main Outcomes and Measures Cohen’s Kappa, Accuracy, and F1-score to assess model performance.Results Over all, the neural network produced both the best performance measures as well as the most robust results. The smaller classes however were better classified by either the ensemble learning methods or the support vector machine, while performance measures for small classes were comparatively low, as expected. Diseases with regionally specific and pronounced atrophy patterns were generally better classified than diseases with wide-spread and rather weak atrophy.Conclusions and Relevance Our study furthermore underlines the necessity of larger data sets but also calls for a careful consideration of different machine learning methods that can handle the type of data and the classification task best.Trial Registration n.a.


2016 ◽  
Vol 16 (13) ◽  
pp. 8181-8191 ◽  
Author(s):  
Jani Huttunen ◽  
Harri Kokkola ◽  
Tero Mielonen ◽  
Mika Esa Juhani Mononen ◽  
Antti Lipponen ◽  
...  

Abstract. In order to have a good estimate of the current forcing by anthropogenic aerosols, knowledge on past aerosol levels is needed. Aerosol optical depth (AOD) is a good measure for aerosol loading. However, dedicated measurements of AOD are only available from the 1990s onward. One option to lengthen the AOD time series beyond the 1990s is to retrieve AOD from surface solar radiation (SSR) measurements taken with pyranometers. In this work, we have evaluated several inversion methods designed for this task. We compared a look-up table method based on radiative transfer modelling, a non-linear regression method and four machine learning methods (Gaussian process, neural network, random forest and support vector machine) with AOD observations carried out with a sun photometer at an Aerosol Robotic Network (AERONET) site in Thessaloniki, Greece. Our results show that most of the machine learning methods produce AOD estimates comparable to the look-up table and non-linear regression methods. All of the applied methods produced AOD values that corresponded well to the AERONET observations with the lowest correlation coefficient value being 0.87 for the random forest method. While many of the methods tended to slightly overestimate low AODs and underestimate high AODs, neural network and support vector machine showed overall better correspondence for the whole AOD range. The differences in producing both ends of the AOD range seem to be caused by differences in the aerosol composition. High AODs were in most cases those with high water vapour content which might affect the aerosol single scattering albedo (SSA) through uptake of water into aerosols. Our study indicates that machine learning methods benefit from the fact that they do not constrain the aerosol SSA in the retrieval, whereas the LUT method assumes a constant value for it. This would also mean that machine learning methods could have potential in reproducing AOD from SSR even though SSA would have changed during the observation period.


Metabolites ◽  
2020 ◽  
Vol 10 (6) ◽  
pp. 243 ◽  
Author(s):  
Ulf W. Liebal ◽  
An N. T. Phan ◽  
Malvika Sudhakar ◽  
Karthik Raman ◽  
Lars M. Blank

The metabolome of an organism depends on environmental factors and intracellular regulation and provides information about the physiological conditions. Metabolomics helps to understand disease progression in clinical settings or estimate metabolite overproduction for metabolic engineering. The most popular analytical metabolomics platform is mass spectrometry (MS). However, MS metabolome data analysis is complicated, since metabolites interact nonlinearly, and the data structures themselves are complex. Machine learning methods have become immensely popular for statistical analysis due to the inherent nonlinear data representation and the ability to process large and heterogeneous data rapidly. In this review, we address recent developments in using machine learning for processing MS spectra and show how machine learning generates new biological insights. In particular, supervised machine learning has great potential in metabolomics research because of the ability to supply quantitative predictions. We review here commonly used tools, such as random forest, support vector machines, artificial neural networks, and genetic algorithms. During processing steps, the supervised machine learning methods help peak picking, normalization, and missing data imputation. For knowledge-driven analysis, machine learning contributes to biomarker detection, classification and regression, biochemical pathway identification, and carbon flux determination. Of important relevance is the combination of different omics data to identify the contributions of the various regulatory levels. Our overview of the recent publications also highlights that data quality determines analysis quality, but also adds to the challenge of choosing the right model for the data. Machine learning methods applied to MS-based metabolomics ease data analysis and can support clinical decisions, guide metabolic engineering, and stimulate fundamental biological discoveries.


Sign in / Sign up

Export Citation Format

Share Document