Image-Label Recovery on Fashion Data Using Image Similarity from Triple Siamese Network

Weakly labeled data are inevitable in various research areas in artificial intelligence (AI) where one has a modicum of knowledge about the complete dataset. One of the reasons for weakly labeled data in AI is insufficient accurately labeled data. Strict privacy control or accidental loss may also cause missing-data problems. However, supervised machine learning (ML) requires accurately labeled data in order to successfully solve a problem. Data labeling is difficult and time-consuming as it requires manual work, perfect results, and sometimes human experts to be involved (e.g., medical labeled data). In contrast, unlabeled data are inexpensive and easily available. Due to there not being enough labeled training data, researchers sometimes only obtain one or few data points per category or label. Training a supervised ML model from the small set of labeled data is a challenging task. The objective of this research is to recover missing labels from the dataset using state-of-the-art ML techniques using a semisupervised ML approach. In this work, a novel convolutional neural network-based framework is trained with a few instances of a class to perform metric learning. The dataset is then converted into a graph signal, which is recovered using a recover algorithm (RA) in graph Fourier transform. The proposed approach was evaluated on a Fashion dataset for accuracy and precision and performed significantly better than graph neural networks and other state-of-the-art methods.

Download Full-text

Iterative Metric Learning for Imbalance Data Classification

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/389 ◽

2018 ◽

Cited By ~ 1

Author(s):

Nan Wang ◽

Xibin Zhao ◽

Yu Jiang ◽

Yue Gao

Keyword(s):

Learning Strategy ◽

State Of The Art ◽

Metric Learning ◽

Training Data ◽

Imbalance Data ◽

Data Space ◽

Software Defect ◽

Training Samples ◽

Testing Data ◽

Data Program

In many classification applications, the amount of data from different categories usually vary significantly, such as software defect predication and medical diagnosis. Under such circumstances, it is essential to propose a proper method to solve the imbalance issue among the data. However, most of the existing methods mainly focus on improving the performance of classifiers rather than searching for an appropriate way to find an effective data space for classification. In this paper, we propose a method named Iterative Metric Learning (IML) to explore the correlations among imbalance data and construct an effective data space for classification. Given the imbalance training data, it is important to select a subset of training samples for each testing data. Thus, we aim to find a more stable neighborhood for testing data using the iterative metric learning strategy. To evaluate the effectiveness of the proposed method, we have conducted experiments on two groups of dataset, i.e., the NASA Metrics Data Program (NASA) dataset and UCI Machine Learning Repository (UCI) dataset. Experimental results and comparisons with state-of-the-art methods have exhibited better performance of our proposed method.

Download Full-text

Superscan: Supervised Single-Cell Annotation

10.1101/2021.05.20.445014 ◽

2021 ◽

Author(s):

Carolyn Shasha ◽

Yuan Tian ◽

Florian Mair ◽

Helen E Rodgers Miller ◽

Raphael Gottardo

Keyword(s):

Single Cell ◽

State Of The Art ◽

Marker Gene ◽

Surface Protein ◽

Cell Types ◽

Training Data ◽

Supervised Machine Learning ◽

Cell Type ◽

Surface Protein Expression ◽

Meta Analyses

Automated cell type annotation of single-cell RNA-seq data has the potential to significantly improve and streamline single cell data analysis, facilitating comparisons and meta-analyses. However, many of the current state-of-the-art techniques suffer from limitations, such as reliance on a single reference dataset or marker gene set, or excessive run times for large datasets. Acquiring high-quality labeled data to use as a reference can be challenging. With CITE-seq, surface protein expression of cells can be directly measured in addition to the RNA expression, facilitating cell type annotation. Here, we compiled and annotated a collection of 16 publicly available CITE-seq datasets. This data was then used as training data to develop Superscan, a supervised machine learning-based prediction model. Using our 16 reference datasets, we benchmarked Superscan and showed that it performs better in terms of both accuracy and speed when compared to other state-of-the-art cell annotation methods. Superscan is pre-trained on a collection of primarily PBMC immune datasets; however, additional data and cell types can be easily added to the training data for further improvement. Finally, we used Superscan to reanalyze a previously published dataset, demonstrating its applicability even when the dataset includes cell types that are missing from the training set.

Download Full-text

Data-Adaptive Metric Learning with Scale Alignment

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33013347 ◽

2019 ◽

Vol 33 ◽

pp. 3347-3354 ◽

Cited By ~ 1

Author(s):

Shuo Chen ◽

Chen Gong ◽

Jian Yang ◽

Ying Tai ◽

Le Hui ◽

...

Keyword(s):

Metric Learning ◽

Projection Matrix ◽

Training Data ◽

Data Pair ◽

Data Points ◽

Local Patterns ◽

Data Adaptive ◽

Thresholding Algorithm ◽

Projection Matrices ◽

Adaptive Metric

The central problem for most existing metric learning methods is to find a suitable projection matrix on the differences of all pairs of data points. However, a single unified projection matrix can hardly characterize all data similarities accurately as the practical data are usually very complicated, and simply adopting one global projection matrix might ignore important local patterns hidden in the dataset. To address this issue, this paper proposes a novel method dubbed “Data-Adaptive Metric Learning” (DAML), which constructs a data-adaptive projection matrix for each data pair by selectively combining a set of learned candidate matrices. As a result, every data pair can obtain a specific projection matrix, enabling the proposed DAML to flexibly fit the training data and produce discriminative projection results. The model of DAML is formulated as an optimization problem which jointly learns candidate projection matrices and their sparse combination for every data pair. Nevertheless, the over-fitting problem may occur due to the large amount of parameters to be learned. To tackle this issue, we adopt the Total Variation (TV) regularizer to align the scales of data embedding produced by all candidate projection matrices, and thus the generated metrics of these learned candidates are generally comparable. Furthermore, we extend the basic linear DAML model to the kernerlized version (denoted “KDAML”) to handle the non-linear cases, and the Iterative Shrinkage-Thresholding Algorithm (ISTA) is employed to solve the optimization model. Intensive experimental results on various applications including retrieval, classification, and verification clearly demonstrate the superiority of our algorithm to other state-of-the-art metric learning methodologies.

Download Full-text

Benchmarking materials property prediction methods: the Matbench test set and Automatminer reference algorithm

npj Computational Materials ◽

10.1038/s41524-020-00406-3 ◽

2020 ◽

Vol 6 (1) ◽

Cited By ~ 1

Author(s):

Alexander Dunn ◽

Qi Wang ◽

Alex Ganose ◽

Daniel Dopp ◽

Anubhav Jain

Keyword(s):

Crystal Structure ◽

Machine Learning ◽

Density Functional ◽

Supervised Machine Learning ◽

Test Suite ◽

Crystal Graph ◽

Learning Procedure ◽

Data Points ◽

Automated Machine Learning ◽

Graph Neural Networks

Abstract We present a benchmark test suite and an automated machine learning procedure for evaluating supervised machine learning (ML) models for predicting properties of inorganic bulk materials. The test suite, Matbench, is a set of 13 ML tasks that range in size from 312 to 132k samples and contain data from 10 density functional theory-derived and experimental sources. Tasks include predicting optical, thermal, electronic, thermodynamic, tensile, and elastic properties given a material’s composition and/or crystal structure. The reference algorithm, Automatminer, is a highly-extensible, fully automated ML pipeline for predicting materials properties from materials primitives (such as composition and crystal structure) without user intervention or hyperparameter tuning. We test Automatminer on the Matbench test suite and compare its predictive power with state-of-the-art crystal graph neural networks and a traditional descriptor-based Random Forest model. We find Automatminer achieves the best performance on 8 of 13 tasks in the benchmark. We also show our test suite is capable of exposing predictive advantages of each algorithm—namely, that crystal graph methods appear to outperform traditional machine learning methods given ~104 or greater data points. We encourage evaluating materials ML algorithms on the Matbench benchmark and comparing them against the latest version of Automatminer.

Download Full-text

Automatic microseismic event picking via unsupervised machine learning

Geophysical Journal International ◽

10.1093/gji/ggaa186 ◽

2020 ◽

Vol 222 (3) ◽

pp. 1750-1764 ◽

Cited By ~ 1

Author(s):

Yangkang Chen

Keyword(s):

Machine Learning ◽

Clustering Algorithm ◽

Learning Algorithm ◽

State Of The Art ◽

The State ◽

Training Data ◽

Supervised Machine Learning ◽

Machine Learning Algorithm ◽

Unsupervised Machine Learning ◽

Earthquake Data

SUMMARY Effective and efficient arrival picking plays an important role in microseismic and earthquake data processing and imaging. Widely used short-term-average long-term-average ratio (STA/LTA) based arrival picking algorithms suffer from the sensitivity to moderate-to-strong random ambient noise. To make the state-of-the-art arrival picking approaches effective, microseismic data need to be first pre-processed, for example, removing sufficient amount of noise, and second analysed by arrival pickers. To conquer the noise issue in arrival picking for weak microseismic or earthquake event, I leverage the machine learning techniques to help recognizing seismic waveforms in microseismic or earthquake data. Because of the dependency of supervised machine learning algorithm on large volume of well-designed training data, I utilize an unsupervised machine learning algorithm to help cluster the time samples into two groups, that is, waveform points and non-waveform points. The fuzzy clustering algorithm has been demonstrated to be effective for such purpose. A group of synthetic, real microseismic and earthquake data sets with different levels of complexity show that the proposed method is much more robust than the state-of-the-art STA/LTA method in picking microseismic events, even in the case of moderately strong background noise.

Download Full-text

Adversarial Metric Learning

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/279 ◽

2018 ◽

Cited By ~ 6

Author(s):

Shuo Chen ◽

Chen Gong ◽

Jian Yang ◽

Xiang Li ◽

Yang Wei ◽

...

Keyword(s):

State Of The Art ◽

Metric Learning ◽

Sampling Bias ◽

Training Data ◽

Loss Functions ◽

Original Training ◽

Training Set ◽

Learning Problem ◽

Optimization Framework ◽

The Past

In the past decades, intensive efforts have been put to design various loss functions and metric forms for metric learning problem. These improvements have shown promising results when the test data is similar to the training data. However, the trained models often fail to produce reliable distances on the ambiguous test pairs due to the different samplings between training set and test set. To address this problem, the Adversarial Metric Learning (AML) is proposed in this paper, which automatically generates adversarial pairs to remedy the sampling bias and facilitate robust metric learning. Specifically, AML consists of two adversarial stages, i.e. confusion and distinguishment. In confusion stage, the ambiguous but critical adversarial data pairs are adaptively generated to mislead the learned metric. In distinguishment stage, a metric is exhaustively learned to try its best to distinguish both adversarial pairs and original training pairs. Thanks to the challenges posed by the confusion stage in such competing process, the AML model is able to grasp plentiful difficult knowledge that has not been contained by the original training pairs, so the discriminability of AML can be significantly improved. The entire model is formulated into optimization framework, of which the global convergence is theoretically proved. The experimental results on toy data and practical datasets clearly demonstrate the superiority of AML to representative state-of-the-art metric learning models.

Download Full-text

Federated Meta-Learning for Fraudulent Credit Card Detection

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/642 ◽

2020 ◽

Author(s):

Wenbo Zheng ◽

Lan Yan ◽

Chao Gou ◽

Fei-Yue Wang

Keyword(s):

Data Security ◽

Credit Card ◽

State Of The Art ◽

Metric Learning ◽

Fraud Detection ◽

Training Data ◽

Security And Privacy ◽

Sensitive Information ◽

Detection Model ◽

Meta Learning

Credit card transaction fraud costs billions of dollars to card issuers every year. Besides, the credit card transaction dataset is very skewed, there are much fewer samples of frauds than legitimate transactions. Due to the data security and privacy, different banks are usually not allowed to share their transaction datasets. These problems make traditional model difficult to learn the patterns of frauds and also difficult to detect them. In this paper, we introduce a novel framework termed as federated meta-learning for fraud detection. Different from the traditional technologies trained with data centralized in the cloud, our model enables banks to learn fraud detection model with the training data distributed on their own local database. A shared whole model is constructed by aggregating locallycomputed updates of fraud detection model. Banks can collectively reap the benefits of shared model without sharing the dataset and protect the sensitive information of cardholders. To achieve the good performance of classification, we further formulate an improved triplet-like metric learning, and design a novel meta-learning-based classifier, which allows joint comparison with K negative samples in each mini-batch. Experimental results demonstrate that the proposed approach achieves significantly higher performance compared with the other state-of-the-art approaches.

Download Full-text

RS-SSKD: Self-Supervision Equipped with Knowledge Distillation for Few-Shot Remote Sensing Scene Classification

Sensors ◽

10.3390/s21051566 ◽

2021 ◽

Vol 21 (5) ◽

pp. 1566

Author(s):

Pei Zhang ◽

Ying Li ◽

Dong Wang ◽

Jiyue Wang

Keyword(s):

Remote Sensing ◽

State Of The Art ◽

Ground Truth ◽

Training Data ◽

Scene Classification ◽

Training Time ◽

Shot Classification ◽

Meta Learning ◽

Knowledge Distillation ◽

Few Data

While growing instruments generate more and more airborne or satellite images, the bottleneck in remote sensing (RS) scene classification has shifted from data limits toward a lack of ground truth samples. There are still many challenges when we are facing unknown environments, especially those with insufficient training data. Few-shot classification offers a different picture under the umbrella of meta-learning: digging rich knowledge from a few data are possible. In this work, we propose a method named RS-SSKD for few-shot RS scene classification from a perspective of generating powerful representation for the downstream meta-learner. Firstly, we propose a novel two-branch network that takes three pairs of original-transformed images as inputs and incorporates Class Activation Maps (CAMs) to drive the network mining, the most relevant category-specific region. This strategy ensures that the network generates discriminative embeddings. Secondly, we set a round of self-knowledge distillation to prevent overfitting and boost the performance. Our experiments show that the proposed method surpasses current state-of-the-art approaches on two challenging RS scene datasets: NWPU-RESISC45 and RSD46-WHU. Finally, we conduct various ablation experiments to investigate the effect of each component of the proposed method and analyze the training time of state-of-the-art methods and ours.

Download Full-text

Identifying Physico-Chemical Laws from the Robotically Collected Data

10.26434/chemrxiv.8490149 ◽

2019 ◽

Author(s):

Liwei Cao ◽

Danilo Russo ◽

Vassilios S. Vassiliadis ◽

Alexei Lapkin

Keyword(s):

Experimental Data ◽

Numerical Models ◽

Predictor Variable ◽

Physical Models ◽

Training Data ◽

Mixed Integer ◽

Physico Chemical ◽

Data Points ◽

Future Work ◽

The Relationship

A mixed-integer nonlinear programming (MINLP) formulation for symbolic regression was proposed to identify physical models from noisy experimental data. The formulation was tested using numerical models and was found to be more efficient than the previous literature example with respect to the number of predictor variables and training data points. The globally optimal search was extended to identify physical models and to cope with noise in the experimental data predictor variable. The methodology was coupled with the collection of experimental data in an automated fashion, and was proven to be successful in identifying the correct physical models describing the relationship between the shear stress and shear rate for both Newtonian and non-Newtonian fluids, and simple kinetic laws of reactions. Future work will focus on addressing the limitations of the formulation presented in this work, by extending it to be able to address larger complex physical models.

Download Full-text

RLC-GNN: An Improved Deep Architecture for Spatial-Based Graph Neural Network with Application to Fraud Detection

Applied Sciences ◽

10.3390/app11125656 ◽

2021 ◽

Vol 11 (12) ◽

pp. 5656

Author(s):

Yufan Zeng ◽

Jiashan Tang

Keyword(s):

Numerical Experiments ◽

State Of The Art ◽

Single Layer ◽

Fraud Detection ◽

Layer By Layer ◽

Residual Structure ◽

Detection Algorithms ◽

Deep Architecture ◽

Graph Neural Networks ◽

Node Embeddings

Graph neural networks (GNNs) have been very successful at solving fraud detection tasks. The GNN-based detection algorithms learn node embeddings by aggregating neighboring information. Recently, CAmouflage-REsistant GNN (CARE-GNN) is proposed, and this algorithm achieves state-of-the-art results on fraud detection tasks by dealing with relation camouflages and feature camouflages. However, stacking multiple layers in a traditional way defined by hop leads to a rapid performance drop. As the single-layer CARE-GNN cannot extract more information to fix the potential mistakes, the performance heavily relies on the only one layer. In order to avoid the case of single-layer learning, in this paper, we consider a multi-layer architecture which can form a complementary relationship with residual structure. We propose an improved algorithm named Residual Layered CARE-GNN (RLC-GNN). The new algorithm learns layer by layer progressively and corrects mistakes continuously. We choose three metrics—recall, AUC, and F1-score—to evaluate proposed algorithm. Numerical experiments are conducted. We obtain up to 5.66%, 7.72%, and 9.09% improvements in recall, AUC, and F1-score, respectively, on Yelp dataset. Moreover, we also obtain up to 3.66%, 4.27%, and 3.25% improvements in the same three metrics on the Amazon dataset.

Download Full-text