Compression and Machine Learning: A New Perspective on Feature Space Vectors

AbstractPredicting mechanical properties such as yield strength (YS) and ultimate tensile strength (UTS) is an intricate undertaking in practice, notwithstanding a plethora of well-established theoretical and empirical models. A data-driven approach should be a fundamental exercise when making YS/UTS predictions. For this study, we collected 16 descriptors (attributes) that implicate the compositional and processing information and the corresponding YS/UTS values for 5473 thermo-mechanically controlled processed (TMCP) steel alloys. We set up an integrated machine-learning (ML) platform consisting of 16 ML algorithms to predict the YS/UTS based on the descriptors. The integrated ML platform involved regularization-based linear regression algorithms, ensemble ML algorithms, and some non-linear ML algorithms. Despite the dirty nature of most real-world industry data, we obtained acceptable holdout dataset test results such as R2 > 0.6 and MSE < 0.01 for seven non-linear ML algorithms. The seven fully trained non-linear ML models were used for the ensuing ‘inverse design (prediction)’ based on an elitist-reinforced, non-dominated sorting genetic algorithm (NSGA-II). The NSGA-II enabled us to predict solutions that exhibit desirable YS/UTS values for each ML algorithm. In addition, the NSGA-II-driven solutions in the 16-dimensional input feature space were visualized using holographic research strategy (HRS) in order to systematically compare and analyze the inverse-predicted solutions for each ML algorithm.

Download Full-text

A Machine Learning Approach to Reveal the NeuroPhenotypes of Autisms

International Journal of Neural Systems ◽

10.1142/s0129065718500582 ◽

2019 ◽

Vol 29 (07) ◽

pp. 1850058 ◽

Cited By ~ 8

Author(s):

Juan M. Górriz ◽

Javier Ramírez ◽

F. Segovia ◽

Francisco J. Martínez ◽

Meng-Chuan Lai ◽

...

Keyword(s):

Machine Learning ◽

Brain Structure ◽

Feature Space ◽

Classification Problem ◽

Small Sample ◽

Biological Sex ◽

Machine Learning Approach ◽

Learning Machine ◽

Small Sample Sizes ◽

Low Dimensional

Although much research has been undertaken, the spatial patterns, developmental course, and sexual dimorphism of brain structure associated with autism remains enigmatic. One of the difficulties in investigating differences between the sexes in autism is the small sample sizes of available imaging datasets with mixed sex. Thus, the majority of the investigations have involved male samples, with females somewhat overlooked. This paper deploys machine learning on partial least squares feature extraction to reveal differences in regional brain structure between individuals with autism and typically developing participants. A four-class classification problem (sex and condition) is specified, with theoretical restrictions based on the evaluation of a novel upper bound in the resubstitution estimate. These conditions were imposed on the classifier complexity and feature space dimension to assure generalizable results from the training set to test samples. Accuracies above [Formula: see text] on gray and white matter tissues estimated from voxel-based morphometry (VBM) features are obtained in a sample of equal-sized high-functioning male and female adults with and without autism ([Formula: see text], [Formula: see text]/group). The proposed learning machine revealed how autism is modulated by biological sex using a low-dimensional feature space extracted from VBM. In addition, a spatial overlap analysis on reference maps partially corroborated predictions of the “extreme male brain” theory of autism, in sexual dimorphic areas.

Download Full-text

Exploiting node metadata to predict interactions in large networks using graph embedding and neural networks

10.1101/2021.06.10.447991 ◽

2021 ◽

Author(s):

Rogini Runghen ◽

Daniel B Stouffer ◽

Giulio Valentino Dalla Riva

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Link Prediction ◽

Graph Embedding ◽

Feature Space ◽

Machine Learning Techniques ◽

Large Networks ◽

Data Set ◽

Learning Techniques ◽

Low Dimensional

Collecting network interaction data is difficult. Non-exhaustive sampling and complex hidden processes often result in an incomplete data set. Thus, identifying potentially present but unobserved interactions is crucial both in understanding the structure of large scale data, and in predicting how previously unseen elements will interact. Recent studies in network analysis have shown that accounting for metadata (such as node attributes) can improve both our understanding of how nodes interact with one another, and the accuracy of link prediction. However, the dimension of the object we need to learn to predict interactions in a network grows quickly with the number of nodes. Therefore, it becomes computationally and conceptually challenging for large networks. Here, we present a new predictive procedure combining a graph embedding method with machine learning techniques to predict interactions on the base of nodes' metadata. Graph embedding methods project the nodes of a network onto a---low dimensional---latent feature space. The position of the nodes in the latent feature space can then be used to predict interactions between nodes. Learning a mapping of the nodes' metadata to their position in a latent feature space corresponds to a classic---and low dimensional---machine learning problem. In our current study we used the Random Dot Product Graph model to estimate the embedding of an observed network, and we tested different neural networks architectures to predict the position of nodes in the latent feature space. Flexible machine learning techniques to map the nodes onto their latent positions allow to account for multivariate and possibly complex nodes' metadata. To illustrate the utility of the proposed procedure, we apply it to a large dataset of tourist visits to destinations across New Zealand. We found that our procedure accurately predicts interactions for both existing nodes and nodes newly added to the network, while being computationally feasible even for very large networks. Overall, our study highlights that by exploiting the properties of a well understood statistical model for complex networks and combining it with standard machine learning techniques, we can simplify the link prediction problem when incorporating multivariate node metadata. Our procedure can be immediately applied to different types of networks, and to a wide variety of data from different systems. As such, both from a network science and data science perspective, our work offers a flexible and generalisable procedure for link prediction.

Download Full-text

New perspective? Comparing Frame Occurrence in Online and Traditional News Media Reporting on Europe’s “Migration Crisis”

10.31235/osf.io/h3tpy ◽

2019 ◽

Author(s):

Christian S. Czymara ◽

Marijn van Klingeren

Keyword(s):

Machine Learning ◽

News Media ◽

Print Media ◽

Online News ◽

Online Media ◽

Media Reporting ◽

Machine Learning Methods ◽

Migration Crisis ◽

New Perspective ◽

Key Events

News media have shape-shifted over the last decades, with rising online news suppliers and an increase in online news consumption. We examine how reporting on immigration differs between popular German online and print media over three crucial years of the so-called immigration crisis, from 2015 to 2017. We extend knowledge on framing of the crisis by examining a period covering start, peak and the time after the intake of refugees. Moreover, we establish whether online and print reporting differs in terms of both frame occurrence and variability. Crises generally create an opening for the formation of new perspectives and frames. These conditions provide an ideal test to see whether the focus of media reporting differs between online and print sources. We extract the dominant frames in almost 18,500 articles using machine-learning methods. While results indicate that many frames are, on average, more visible in either online or print media, these differences do not appear to follow a systematic logic. Regarding diversity of frame usage, we find that online media are, on average, more dominated by particular frames compared to print and that frame diversity is largely independent of important key events happening during our period of investigation.

Download Full-text

treeheatr: an R package for interpretable decision tree visualizations

10.1101/2020.07.10.196352 ◽

2020 ◽

Author(s):

Trang T. Le ◽

Jason H. Moore

Keyword(s):

Machine Learning ◽

Decision Tree ◽

Feature Space ◽

R Package ◽

Tree Structure ◽

Decision Tree Model ◽

Teaching Tool ◽

Tree Model ◽

Machine Learning Methods ◽

Link Type

AbstractSummarytreeheatr is an R package for creating interpretable decision tree visualizations with the data represented as a heatmap at the tree’s leaf nodes. The integrated presentation of the tree structure along with an overview of the data efficiently illustrates how the tree nodes split up the feature space and how well the tree model performs. This visualization can also be examined in depth to uncover the correlation structure in the data and importance of each feature in predicting the outcome. Implemented in an easily installed package with a detailed vignette, treeheatr can be a useful teaching tool to enhance students’ understanding of a simple decision tree model before diving into more complex tree-based machine learning methods.AvailabilityThe treeheatr package is freely available under the permissive MIT license at https://trang1618.github.io/treeheatr and https://cran.r-project.org/package=treeheatr. It comes with a detailed vignette that is automatically built with GitHub Actions continuous [email protected]

Download Full-text

Broadening volcanic eruption forecasting using transfer machine learning

10.5194/egusphere-egu21-970 ◽

2021 ◽

Author(s):

David Dempsey ◽

Shane Cronin ◽

Andreas Kempa-Liehr ◽

Martin Letourneur

Keyword(s):

Machine Learning ◽

Seismic Station ◽

Feature Space ◽

Forecast Model ◽

Linear Interpolation ◽

Lessons Learned ◽

Training Data ◽

Single Station ◽

Data Driven Approach ◽

Model Training

Sudden steam-driven eruptions at tourist volcanoes were the cause of 63 deaths at Mt Ontake (Japan) in 2014, and 22 deaths at Whakaari (New Zealand) in 2019. Warning systems that can anticipate these eruptions could provide crucial hours for evacuation or sheltering but these require reliable forecasting. Recently, machine learning has been used to extract eruption precursors from observational data and train forecasting models. However, a weakness of this data-driven approach is its reliance on long observational records that span multiple eruptions. As many volcano datasets may only record one or no eruptions, there is a need to extend these techniques to data-poor locales.Transfer machine learning is one approach for generalising lessons learned at data-rich volcanoes and applying them to data-poor ones. Here, we tackle two problems: (1) generalising time series features between seismic stations at Whakaari to address recording gaps, and (2) training a forecasting model for Mt Ruapehu augmented using data from Whakaari. This required that we standardise data records at different stations for direct comparisons, devise an interpolation scheme to fill in missing eruption data, and combine volcano-specific feature matrices prior to model training.We trained a forecast model for Whakaari using tremor data from three eruptions recorded at one seismic station (WSRZ) and augmented by data from two other eruptions recorded at a second station (WIZ). First, the training data from both stations were standardised to a unit normal distribution in log space. Then, linear interpolation in feature space was used to infer missing eruption features at WSRZ. Under pseudo-prospective testing, the augmented model had similar forecasting skill to one trained using all five eruptions recorded at a single station (WIZ). However, extending this approach to Ruapehu, we saw reduced performance indicating that more work is needed in standardisation and feature selection.

Download Full-text

Empirical analysis of time series using feature selection algorithms

10.5194/egusphere-egu21-6697 ◽

2021 ◽

Author(s):

Mikhail Kanevski

Keyword(s):

Machine Learning ◽

Time Series ◽

Feature Selection ◽

Dimensional Space ◽

Multivariate Time Series ◽

Feature Space ◽

Real Data ◽

Theoretical Concepts ◽

Wide Range ◽

Linear And Nonlinear

Nowadays a wide range of methods and tools to study and forecast time series is available. An important problem in forecasting concerns embedding of time series, i.e. construction of a high dimensional space where forecasting problem is considered as a regression task. There are several basic linear and nonlinear approaches of constructing such space by defining an optimal delay vector using different theoretical concepts. Another way is to consider this space as an input feature space &#8211; IFS, and to apply machine learning feature selection (FS) algorithms to optimize IFS according to the problem under study (analysis, modelling or forecasting). Such approach is an empirical one: it is based on data and depends on the FS algorithms applied. In machine learning features are generally classified as relevant, redundant and irrelevant. It gives a reach possibility to perform advanced multivariate time series exploration and development of interpretable predictive models.Therefore, in the present research different FS algorithms are used to analyze fundamental properties of time series from empirical point of view. Linear and nonlinear simulated time series are studied in detail to understand the advantages and drawbacks of the proposed approach. Real data case studies deal with air pollution and wind speed times series. Preliminary results are quite promising and more research is in progress.

Download Full-text

Using Remote Monitoring and Machine Learning to Classify Slam Events of Wave Piercing Catamarans

The International Journal of Maritime Engineering ◽

10.5750/ijme.v163ia3.797 ◽

2021 ◽

Vol 163 (A3) ◽

Author(s):

B Shabani ◽

J Ali-Lavroff ◽

D S Holloway ◽

S Penev ◽

D Dessi ◽

...

Keyword(s):

Machine Learning ◽

Monitoring System ◽

Probability Distributions ◽

Feature Space ◽

Support Vector ◽

Acceleration Threshold ◽

Vector Machines ◽

Structural Responses ◽

As Stress

An onboard monitoring system can measure features such as stress cycles counts and provide warnings due to slamming. Considering current technology trends there is the opportunity of incorporating machine learning methods into monitoring systems. A hull monitoring system has been developed and installed on a 111 m wave piercing catamaran (Hull 091) to remotely monitor the ship kinematics and hull structural responses. Parallel to that, an existing dataset of a similar vessel (Hull 061) was analysed using unsupervised and supervised learning models; these were found to be beneficial for the classification of bow entry events according to key kinematic parameters. A comparison of different algorithms including linear support vector machines, naïve Bayes and decision tree for the bow entry classification were conducted. In addition, using empirical probability distributions, the likelihood of wet-deck slamming was estimated given a vertical bow acceleration threshold of 1 in head seas, clustering the feature space with the approximate probabilities of 0.001, 0.030 and 0.25.

Download Full-text

Sentiment Analysis on UAV-aided Product Comments Based on Machine Learning: From Sentence to Document Level

10.21203/rs.3.rs-104009/v1 ◽

2020 ◽

Author(s):

JINGYANG CAO ◽

Shirong Yin ◽

Guoxu Zhang

Keyword(s):

Machine Learning ◽

Sentiment Analysis ◽

Accurate Result ◽

Supervised Machine Learning ◽

Hotel Management ◽

Novel Approach ◽

Online Comments ◽

New Perspective ◽

The Relationship ◽

Document Level

Abstract This paper presents a novel approach to analyze the sentiment of the product comments from sentence to document level and apply to the customers sentiment analysis on UAV-aided product comments for hotel management. In order to realize the effiffifficient sentiment analysis, a cascaded sentence-to-document sentiment classifification method is investigated. Initially, a supervised machine learning method is applied to explore the sentiment polarity of the sentence (SPS). Afterward, the contribution of the sentence to document (CSD) is calculated by using various statistical algorithms. Lastly, the sentiment polarity of the document (SPD) is determined by the SPS as well as its contribution. Comparative experiments have been established on the basis of hotel online comments, and the outcomes indicate that the proposed method not only raises the effiffifficiency in attaining a more accurate result but also assists immensely in regards to the B5G wireless communication supported by the UAV. The fifindings provide a new perspective that sentence position and its sentiment similarity with document (sentiment condition) dramatically disclose the relationship between sentence and document.

Download Full-text

DeepAntigen: a novel method for neoantigen prioritization via 3D genome and deep sparse learning

Bioinformatics ◽

10.1093/bioinformatics/btaa596 ◽

2020 ◽

Vol 36 (19) ◽

pp. 4894-4901 ◽

Cited By ~ 1

Author(s):

Yi Shi ◽

Zehua Guo ◽

Xianbin Su ◽

Luming Meng ◽

Mingxuan Zhang ◽

...

Keyword(s):

Machine Learning ◽

T Cell ◽

Distribution Patterns ◽

Supplementary Information ◽

Sparse Learning ◽

Mhc I ◽

3D Genome ◽

Immunogenic Peptides ◽

New Perspective ◽

Personalized Cancer

Abstract Motivation The mutations of cancers can encode the seeds of their own destruction, in the form of T-cell recognizable immunogenic peptides, also known as neoantigens. It is computationally challenging, however, to accurately prioritize the potential neoantigen candidates according to their ability of activating the T-cell immunoresponse, especially when the somatic mutations are abundant. Although a few neoantigen prioritization methods have been proposed to address this issue, advanced machine learning model that is specifically designed to tackle this problem is still lacking. Moreover, none of the existing methods considers the original DNA loci of the neoantigens in the perspective of 3D genome which may provide key information for inferring neoantigens’ immunogenicity. Results In this study, we discovered that DNA loci of the immunopositive and immunonegative MHC-I neoantigens have distinct spatial distribution patterns across the genome. We therefore used the 3D genome information along with an ensemble pMHC-I coding strategy, and developed a group feature selection-based deep sparse neural network model (DNN-GFS) that is optimized for neoantigen prioritization. DNN-GFS demonstrated increased neoantigen prioritization power comparing to existing sequence-based approaches. We also developed a webserver named deepAntigen (http://yishi.sjtu.edu.cn/deepAntigen) that implements the DNN-GFS as well as other machine learning methods. We believe that this work provides a new perspective toward more accurate neoantigen prediction which eventually contribute to personalized cancer immunotherapy. Availability and implementation Data and implementation are available on webserver: http://yishi.sjtu.edu.cn/deepAntigen. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text