Cancer type classification in liquid biopsies based on sparse mutational profiles enabled through data augmentation and integration

AbstractIdentifying the cell of origin of cancer is important to guide treatment decisions. However, in patients with ‘cancer of unknown primary’ (CUP), standard diagnostic tools often fail to identify the primary tumor. As an alternative, machine learning approaches have been proposed to classify the cell of origin based on somatic mutation profiles in the genome of solid tissue biopsies. However, solid biopsies can cause complications and certain tumors are not accessible. A promising alternative would be liquid biopsies, which contain ctDNA originating from the tumor. Problematically, somatic mutation profiles of tumors obtained from liquid biopsies are inherently extremely sparse and current machine learning models fail to perform in this setting.Here we propose an improved machine learning method to deal with the sparse nature of liquid biopsy data. Firstly, we downsample the SNVs in the samples in order to mimic sparse data conditions. Then extensive data augmentation is performed to artificially increase the number of training samples in order to enhance model robustness under sparse data conditions. Finally, we employ data integration to merge information from i) somatic single nucleotide variant (SNV) density across the genome, ii) somatic SNVs in driver genes and iii) trinucleotide motifs. Our adapted method achieves an average accuracy of 0.88 on the data where only 70% of SNVs are retained, which is comparable to an average accuracy of 0.87 with the original model on the full SNV data. Even when only 2% of the data is retained, the average accuracy is 0.65 compared to 0.41 with the original model. The method and results presented here open the way for application of machine learning in the detection of the cell of origin of cancer from sparse liquid biopsy data.Author SummaryThe identification of the ‘cell of origin’ of cancer is an important step towards more personalized cancer care, but this remains a challenge for patients with ‘cancer of unknown primary’ (CUP) where the source of the malignancy cannot be identified even after extensive clinical assessment with standard diagnostic methods. Somatic mutation profile-based ‘cell of origin’ classification has emerged in recent years as a promising alternative diagnostic tool that could circumvent the issues of standard CUP diagnostic. In this approach the somatic mutations are obtained from whole genome sequencing (WGS) of solid tissue biopsies from the tumor. However, needle biopsies from tumor tissue can be challenging, as accessibility to the tumor can be limited and taking a biopsy can cause further complications. For these reasons, liquid biopsies have been proposed as a safer alternative to solid tissue biopsies. Problematically, the circulating tumor DNA fragments available in e.g. blood typically represent a much scarcer tumor source than conventional solid tissue biopsies and therefore liquid biopsies give rise to sparse somatic mutation profiles. Therefore it is crucial to investigate the applicability of sparse somatic mutation profiles in the identification of ‘cell of origin’ and explore potential improvements of the data analysis and prediction models to overcome sparsity.

Download Full-text

Cancer Type Classification in Liquid Biopsies Based on Sparse Mutational Profiles Enabled through Data Augmentation and Integration

Life ◽

10.3390/life12010001 ◽

2021 ◽

Vol 12 (1) ◽

pp. 1

Author(s):

Alexandra Danyi ◽

Myrthe Jager ◽

Jeroen de Ridder

Keyword(s):

Machine Learning ◽

Somatic Mutation ◽

Liquid Biopsy ◽

Data Augmentation ◽

Cancer Type ◽

Learning Approaches ◽

Cell Of Origin ◽

Liquid Biopsies ◽

Driver Genes ◽

Average Accuracy

Identifying the cell of origin of cancer is important to guide treatment decisions. Machine learning approaches have been proposed to classify the cell of origin based on somatic mutation profiles from solid biopsies. However, solid biopsies can cause complications and certain tumors are not accessible. Liquid biopsies are promising alternatives but their somatic mutation profile is sparse and current machine learning models fail to perform in this setting. We propose an improved method to deal with sparsity in liquid biopsy data. Firstly, data augmentation is performed on sparse data to enhance model robustness. Secondly, we employ data integration to merge information from: (i) SNV density; (ii) SNVs in driver genes and (iii) trinucleotide motifs. Our adapted method achieves an average accuracy of 0.88 and 0.65 on data where only 70% and 2% of SNVs are retained, compared to 0.83 and 0.41 with the original model, respectively. The method and results presented here open the way for application of machine learning in the detection of the cell of origin of cancer from liquid biopsy data.

Download Full-text

Mind wandering as data augmentation: How mental travel supports abstraction

Behavioral and Brain Sciences ◽

10.1017/s0140525x1900311x ◽

2020 ◽

Vol 43 ◽

Author(s):

Myrthe Faber

Keyword(s):

Machine Learning ◽

Data Augmentation ◽

Mental Content ◽

Mind Wandering ◽

Theoretical Framework ◽

Important Addition

Abstract Gilead et al. state that abstraction supports mental travel, and that mental travel critically relies on abstraction. I propose an important addition to this theoretical framework, namely that mental travel might also support abstraction. Specifically, I argue that spontaneous mental travel (mind wandering), much like data augmentation in machine learning, provides variability in mental content and context necessary for abstraction.

Download Full-text

Enhancement of Image Classification through Data Augmentation using Machine Learning

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v6i9.220224 ◽

2018 ◽

Vol 6 (9) ◽

pp. 220-224

Author(s):

Th. S. Kumar

Keyword(s):

Machine Learning ◽

Image Classification ◽

Data Augmentation

Download Full-text

Building Damage Detection from Post-Event Aerial Imagery Using Single Shot Multibox Detector

Applied Sciences ◽

10.3390/app9061128 ◽

2019 ◽

Vol 9 (6) ◽

pp. 1128 ◽

Cited By ~ 12

Author(s):

Yundong Li ◽

Wei Hu ◽

Han Dong ◽

Xueyan Zhang

Keyword(s):

Machine Learning ◽

Data Augmentation ◽

Hurricane Sandy ◽

Training Data ◽

Aerial Images ◽

Detection Methods ◽

Single Shot ◽

Data Set ◽

Augmentation Strategies ◽

Post Disaster

Using aerial cameras, satellite remote sensing or unmanned aerial vehicles (UAV) equipped with cameras can facilitate search and rescue tasks after disasters. The traditional manual interpretation of huge aerial images is inefficient and could be replaced by machine learning-based methods combined with image processing techniques. Given the development of machine learning, researchers find that convolutional neural networks can effectively extract features from images. Some target detection methods based on deep learning, such as the single-shot multibox detector (SSD) algorithm, can achieve better results than traditional methods. However, the impressive performance of machine learning-based methods results from the numerous labeled samples. Given the complexity of post-disaster scenarios, obtaining many samples in the aftermath of disasters is difficult. To address this issue, a damaged building assessment method using SSD with pretraining and data augmentation is proposed in the current study and highlights the following aspects. (1) Objects can be detected and classified into undamaged buildings, damaged buildings, and ruins. (2) A convolution auto-encoder (CAE) that consists of VGG16 is constructed and trained using unlabeled post-disaster images. As a transfer learning strategy, the weights of the SSD model are initialized using the weights of the CAE counterpart. (3) Data augmentation strategies, such as image mirroring, rotation, Gaussian blur, and Gaussian noise processing, are utilized to augment the training data set. As a case study, aerial images of Hurricane Sandy in 2012 were maximized to validate the proposed method’s effectiveness. Experiments show that the pretraining strategy can improve of 10% in terms of overall accuracy compared with the SSD trained from scratch. These experiments also demonstrate that using data augmentation strategies can improve mAP and mF1 by 72% and 20%, respectively. Finally, the experiment is further verified by another dataset of Hurricane Irma, and it is concluded that the paper method is feasible.

Download Full-text

Data Augmentation for Machine Learning-Based Hardware Trojan Detection at Gate-Level Netlists

2021 IEEE 27th International Symposium on On-Line Testing and Robust System Design (IOLTS) ◽

10.1109/iolts52814.2021.9486713 ◽

2021 ◽

Author(s):

Kento Hasegawa ◽

Seira Hidano ◽

Kohei Nozawa ◽

Shinsaku Kiyomoto ◽

Nozomu Togawa

Keyword(s):

Machine Learning ◽

Data Augmentation ◽

Hardware Trojan ◽

Hardware Trojan Detection ◽

Trojan Detection

Download Full-text

Loss of Smell and Taste Can Accurately Predict COVID-19 Infection: A Machine-Learning Approach

Journal of Clinical Medicine ◽

10.3390/jcm10040570 ◽

2021 ◽

Vol 10 (4) ◽

pp. 570

Author(s):

María A Callejon-Leblic ◽

Ramon Moreno-Luna ◽

Alfonso Del Cuvillo ◽

Isabel M Reyes-Tejero ◽

Miguel A Garcia-Villaran ◽

...

Keyword(s):

Machine Learning ◽

Modelling Framework ◽

Visual Analog Scales ◽

Average Accuracy ◽

Machine Learning Approach ◽

Taste Disorders ◽

Polymerase Chain ◽

Fold Cross Validation ◽

Validation Scheme ◽

Control Study

The COVID-19 outbreak has spread extensively around the world. Loss of smell and taste have emerged as main predictors for COVID-19. The objective of our study is to develop a comprehensive machine learning (ML) modelling framework to assess the predictive value of smell and taste disorders, along with other symptoms, in COVID-19 infection. A multicenter case-control study was performed, in which suspected cases for COVID-19, who were tested by real-time reverse-transcription polymerase chain reaction (RT-PCR), informed about the presence and severity of their symptoms using visual analog scales (VAS). ML algorithms were applied to the collected data to predict a COVID-19 diagnosis using a 50-fold cross-validation scheme by randomly splitting the patients in training (75%) and testing datasets (25%). A total of 777 patients were included. Loss of smell and taste were found to be the symptoms with higher odds ratios of 6.21 and 2.42 for COVID-19 positivity. The ML algorithms applied reached an average accuracy of 80%, a sensitivity of 82%, and a specificity of 78% when using VAS to predict a COVID-19 diagnosis. This study concludes that smell and taste disorders are accurate predictors, with ML algorithms constituting helpful tools for COVID-19 diagnostic prediction.

Download Full-text

DataLoc+: A Data Augmentation Technique for Machine Learning in Room-Level Indoor Localization

2021 IEEE Wireless Communications and Networking Conference (WCNC) ◽

10.1109/wcnc49053.2021.9417246 ◽

2021 ◽

Author(s):

Amr Hilal ◽

Ismail Arai ◽

Samy El-Tawab

Keyword(s):

Machine Learning ◽

Data Augmentation ◽

Indoor Localization

Download Full-text

A Generative Adversarial Network (GAN) Technique for Internet of Medical Things Data

Sensors ◽

10.3390/s21113726 ◽

2021 ◽

Vol 21 (11) ◽

pp. 3726

Author(s):

Ivan Vaccari ◽

Vanessa Orani ◽

Alessia Paglialonga ◽

Enrico Cambiaso ◽

Maurizio Mongelli

Keyword(s):

Machine Learning ◽

Data Augmentation ◽

Monitoring Program ◽

Clinical Decision Support Systems ◽

Direct Access ◽

Generative Adversarial Networks ◽

Chronic Obstructive ◽

Generative Adversarial Network ◽

Internet Of Medical Things ◽

Synthetic Datasets

The application of machine learning and artificial intelligence techniques in the medical world is growing, with a range of purposes: from the identification and prediction of possible diseases to patient monitoring and clinical decision support systems. Furthermore, the widespread use of remote monitoring medical devices, under the umbrella of the “Internet of Medical Things” (IoMT), has simplified the retrieval of patient information as they allow continuous monitoring and direct access to data by healthcare providers. However, due to possible issues in real-world settings, such as loss of connectivity, irregular use, misuse, or poor adherence to a monitoring program, the data collected might not be sufficient to implement accurate algorithms. For this reason, data augmentation techniques can be used to create synthetic datasets sufficiently large to train machine learning models. In this work, we apply the concept of generative adversarial networks (GANs) to perform a data augmentation from patient data obtained through IoMT sensors for Chronic Obstructive Pulmonary Disease (COPD) monitoring. We also apply an explainable AI algorithm to demonstrate the accuracy of the synthetic data by comparing it to the real data recorded by the sensors. The results obtained demonstrate how synthetic datasets created through a well-structured GAN are comparable with a real dataset, as validated by a novel approach based on machine learning.

Download Full-text

SVM-Based Bearing Anomaly Identification with Self-Tuning Network-Fuzzy Robust Proportional Multi Integral and Smart Autoregressive Model

Applied Sciences ◽

10.3390/app11062784 ◽

2021 ◽

Vol 11 (6) ◽

pp. 2784

Author(s):

Shahnaz TayebiHaghighi ◽

Insoo Koo

Keyword(s):

Machine Learning ◽

Variable Structure ◽

Rolling Element Bearing ◽

Crack Identification ◽

Original Signal ◽

Signal Modeling ◽

Residual Signal ◽

Average Accuracy ◽

Self Tuning ◽

The Difference

In this paper, the combination of an indirect self-tuning observer, smart signal modeling, and machine learning-based classification is proposed for rolling element bearing (REB) anomaly identification. The proposed scheme has three main stages. In the first stage, the original signal is resampled, and the root mean square (RMS) signal is extracted from it. In the second stage, the normal resampled RMS signal is approximated using the AutoRegressive with eXternal Uncertainty (ARXU) technique. Moreover, the nonlinearity of the bearing signal is solved using the combination of the ARXU and the machine learning-based regression, which is called AMRXU. After signal modeling by AMRXU, the RMS resampled signal is estimated using a combination of the proportional multi-integral (PMI) technique, the variable structure (VS) Lyapunov technique, and a self-tuning network-fuzzy system (SNFS). Finally, in the third stage, the difference between the original signal and the estimated one is calculated to generate the residual signal. A machine learning-based classification technique is utilized to classify the residual signal. The Case Western Reserve University (CWRU) dataset is used to evaluate anomaly identification performance of the proposed scheme. Regarding the experimental results, the average accuracy for REB crack identification is 98.65%, 97.7%, 97.35%, and 97.67%, respectively, when the motor torque loads are 0-hp, 1-hp, 2-hp, and 3-hp.

Download Full-text

Machine Learning-Based Malicious X.509 Certificates’ Detection

Applied Sciences ◽

10.3390/app11052164 ◽

2021 ◽

Vol 11 (5) ◽

pp. 2164

Author(s):

Jiaxin Li ◽

Zhaoxin Zhang ◽

Changyong Guo

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Ensemble Learning ◽

Traffic Analysis ◽

Learning Models ◽

Detection Model ◽

Analysis Tools ◽

Average Accuracy ◽

Machine Learning Models

X.509 certificates play an important role in encrypting the transmission of data on both sides under HTTPS. With the popularization of X.509 certificates, more and more criminals leverage certificates to prevent their communications from being exposed by malicious traffic analysis tools. Phishing sites and malware are good examples. Those X.509 certificates found in phishing sites or malware are called malicious X.509 certificates. This paper applies different machine learning models, including classical machine learning models, ensemble learning models, and deep learning models, to distinguish between malicious certificates and benign certificates with Verification for Extraction (VFE). The VFE is a system we design and implement for obtaining plentiful characteristics of certificates. The result shows that ensemble learning models are the most stable and efficient models with an average accuracy of 95.9%, which outperforms many previous works. In addition, we obtain an SVM-based detection model with an accuracy of 98.2%, which is the highest accuracy. The outcome indicates the VFE is capable of capturing essential and crucial characteristics of malicious X.509 certificates.

Download Full-text