A Memory-Efficient Encoding Method for Processing Mixed-Type Data on Machine Learning

The Umpire 2.0 R-package offers a streamlined, user-friendly workflow to simulate complex, heterogeneous, mixed-type data with known subgroup identities, dichotomous outcomes, and time-to-event data, while providing ample opportunities for fine-tuning and flexibility. Here, we describe how we have expanded the core Umpire 1.0 R-package, developed to simulate gene expression data, to generate clinically realistic, mixed-type data for use in evaluating unsupervised and supervised machine learning (ML) methods. As the availability of large-scale clinical data for ML has increased, clinical data has posed unique challenges, including widely variable size, individual biological heterogeneity, data collection and measurement noise, and mixed data types. Developing and validating ML methods for clinical data requires data sets with known ground truth, generated from simulation. Umpire 2.0 addresses challenges to simulating realistic clinical data by providing the user a series of modules to generate survival parameters and subgroups, apply meaningful additive noise, and discretize to single or mixed data types. Umpire 2.0 provides broad functionality across sample sizes, feature spaces, and data types, allowing the user to simulate correlated, heterogeneous, binary, continuous, categorical, or mixed type data from the scale of a small clinical trial to data on thousands of patients drawn from electronic health records. The user may generate elaborate simulations by varying parameters in order to compare algorithms or interrogate operating characteristics of an algorithm in both supervised and unsupervised ML.

Download Full-text

Differentially private synthetic mixed-type data generation for unsupervised learning

Intelligent Decision Technologies ◽

10.3233/idt-210195 ◽

2021 ◽

pp. 1-29

Author(s):

Uthaipon Tao Tantipongpipat ◽

Chris Waites ◽

Digvijay Boob ◽

Amaresh Ankit Siva ◽

Rachel Cummings

Keyword(s):

Binary Data ◽

Mixed Type ◽

Differential Privacy ◽

Synthetic Data ◽

Original Data ◽

Generative Adversarial Networks ◽

Data Generation ◽

Sensitive Data ◽

Type Data ◽

Low Dimensional

We introduce the DP-auto-GAN framework for synthetic data generation, which combines the low dimensional representation of autoencoders with the flexibility of Generative Adversarial Networks (GANs). This framework can be used to take in raw sensitive data and privately train a model for generating synthetic data that will satisfy similar statistical properties as the original data. This learned model can generate an arbitrary amount of synthetic data, which can then be freely shared due to the post-processing guarantee of differential privacy. Our framework is applicable to unlabeled mixed-type data, that may include binary, categorical, and real-valued data. We implement this framework on both binary data (MIMIC-III) and mixed-type data (ADULT), and compare its performance with existing private algorithms on metrics in unsupervised settings. We also introduce a new quantitative metric able to detect diversity, or lack thereof, of synthetic data.

Download Full-text

Umpire 2.0: Simulating realistic, mixed-type, clinical data for machine learning

F1000Research ◽

10.12688/f1000research.25877.2 ◽

2021 ◽

Vol 9 ◽

pp. 1186

Author(s):

Caitlin E. Coombes ◽

Zachary B. Abrams ◽

Samantha Nakayiza ◽

Guy Brock ◽

Kevin R. Coombes

Keyword(s):

Machine Learning ◽

Clinical Data ◽

Mixed Type ◽

R Package ◽

Fine Tuning ◽

Supervised Machine Learning ◽

Mixed Data ◽

Operating Characteristics ◽

Data Types ◽

Type Data

The Umpire 2.0 R-package offers a streamlined, user-friendly workflow to simulate complex, heterogeneous, mixed-type data with known subgroup identities, dichotomous outcomes, and time-to-event data, while providing ample opportunities for fine-tuning and flexibility. Here, we describe how we have expanded the core Umpire 1.0 R-package, developed to simulate gene expression data, to generate clinically realistic, mixed-type data for use in evaluating unsupervised and supervised machine learning (ML) methods. As the availability of large-scale clinical data for ML has increased, clinical data has posed unique challenges, including widely variable size, individual biological heterogeneity, data collection and measurement noise, and mixed data types. Developing and validating ML methods for clinical data requires data sets with known ground truth, generated from simulation. Umpire 2.0 addresses challenges to simulating realistic clinical data by providing the user a series of modules to generate survival parameters and subgroups, apply meaningful additive noise, and discretize to single or mixed data types. Umpire 2.0 provides broad functionality across sample sizes, feature spaces, and data types, allowing the user to simulate correlated, heterogeneous, binary, continuous, categorical, or mixed type data from the scale of a small clinical trial to data on thousands of patients drawn from electronic health records. The user may generate elaborate simulations by varying parameters in order to compare algorithms or interrogate operating characteristics of an algorithm in both supervised and unsupervised ML.

Download Full-text

Machine Learning Based Predictive Action on Categorical Non-Sequential Data

Recent Advances in Computer Science and Communications ◽

10.2174/2213275912666190417150421 ◽

2020 ◽

Vol 13 (5) ◽

pp. 1020-1030

Author(s):

Pradeep S. ◽

Jagadish S. Kallimani

Keyword(s):

Machine Learning ◽

Data Analysis ◽

Categorical Data ◽

Numerical Data ◽

Processing Technique ◽

Machine Learning Algorithms ◽

Sequential Data ◽

Industry Standard ◽

Robust Model ◽

Future Work

Background: With the advent of data analysis and machine learning, there is a growing impetus of analyzing and generating models on historic data. The data comes in numerous forms and shapes with an abundance of challenges. The most sorted form of data for analysis is the numerical data. With the plethora of algorithms and tools it is quite manageable to deal with such data. Another form of data is of categorical nature, which is subdivided into, ordinal (order wise) and nominal (number wise). This data can be broadly classified as Sequential and Non-Sequential. Sequential data analysis is easier to preprocess using algorithms. Objective: The challenge of applying machine learning algorithms on categorical data of nonsequential nature is dealt in this paper. Methods: Upon implementing several data analysis algorithms on such data, we end up getting a biased result, which makes it impossible to generate a reliable predictive model. In this paper, we will address this problem by walking through a handful of techniques which during our research helped us in dealing with a large categorical data of non-sequential nature. In subsequent sections, we will discuss the possible implementable solutions and shortfalls of these techniques. Results: The methods are applied to sample datasets available in public domain and the results with respect to accuracy of classification are satisfactory. Conclusion: The best pre-processing technique we observed in our research is one hot encoding, which facilitates breaking down the categorical features into binary and feeding it into an Algorithm to predict the outcome. The example that we took is not abstract but it is a real – time production services dataset, which had many complex variations of categorical features. Our Future work includes creating a robust model on such data and deploying it into industry standard applications.

Download Full-text

Machine Learning Approach for Confirmation of COVID-19 Cases: Positive, Negative, Death and Release (Preprint)

10.2196/preprints.19526 ◽

2020 ◽

Cited By ~ 1

Author(s):

Samir Bandyopadhyay Sr ◽

SHAWNI DUTTA ◽

SHAWNI DUTTA

Keyword(s):

Machine Learning ◽

Short Term Memory ◽

Health Workers ◽

Original Data ◽

High Rate ◽

Learning Approach ◽

Good Decision ◽

Number Of Patients ◽

Machine Learning Approach ◽

Death Cases

BACKGROUND In recent days, Covid-19 coronavirus has been an immense impact on social, economic fields in the world. The objective of this study determines if it is feasible to use machine learning method to evaluate how much prediction results are close to original data related to Confirmed-Negative-Released-Death cases of Covid-19. For this purpose, a verification method is proposed in this paper that uses the concept of Deep-learning Neural Network. In this framework, Long short-term memory (LSTM) and Gated Recurrent Unit (GRU) are also assimilated finally for training the dataset and the prediction results are tally with the results predicted by clinical doctors. The prediction results are validated against the original data based on some predefined metric. The experimental results showcase that the proposed approach is useful in generating suitable results based on the critical disease outbreak. It also helps doctors to recheck further verification of virus by the proposed method. The outbreak of Coronavirus has the nature of exponential growth and so it is difficult to control with limited clinical persons for handling a huge number of patients with in a reasonable time. So it is necessary to build an automated model, based on machine learning approach, for corrective measure after the decision of clinical doctors. It could be a promising supplementary confirmation method for frontline clinical doctors. The proposed method has a high prediction rate and works fast for probable accurate identification of the disease. The performance analysis shows that a high rate of accuracy is obtained by the proposed method. OBJECTIVE Validation of COVID-19 disease METHODS Machine Learning RESULTS 90% CONCLUSIONS The combined LSTM-GRU based RNN model provides a comparatively better results in terms of prediction of confirmed, released, negative, death cases on the data. This paper presented a novel method that could recheck occurred cases of COVID-19 automatically. The data driven RNN based model is capable of providing automated tool for confirming, estimating the current position of this pandemic, assessing the severity, and assisting government and health workers to act for good decision making policy. It could be a promising supplementary rechecking method for frontline clinical doctors. It is now essential for improving the accuracy of detection process. CLINICALTRIAL 2020-04-03 3:22:36 PM

Download Full-text

Application of Bayesian networks to generate synthetic health data

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocaa303 ◽

2020 ◽

Author(s):

Dhamanpreet Kaur ◽

Matthew Sobiesk ◽

Shubham Patil ◽

Jin Liu ◽

Puran Bhagat ◽

...

Keyword(s):

Machine Learning ◽

Bayesian Networks ◽

Data Privacy ◽

Statistical Tests ◽

Synthetic Data ◽

Original Data ◽

Health Data ◽

Minimal Risk ◽

Data Types ◽

Automated Method

Abstract Objective This study seeks to develop a fully automated method of generating synthetic data from a real dataset that could be employed by medical organizations to distribute health data to researchers, reducing the need for access to real data. We hypothesize the application of Bayesian networks will improve upon the predominant existing method, medBGAN, in handling the complexity and dimensionality of healthcare data. Materials and Methods We employed Bayesian networks to learn probabilistic graphical structures and simulated synthetic patient records from the learned structure. We used the University of California Irvine (UCI) heart disease and diabetes datasets as well as the MIMIC-III diagnoses database. We evaluated our method through statistical tests, machine learning tasks, preservation of rare events, disclosure risk, and the ability of a machine learning classifier to discriminate between the real and synthetic data. Results Our Bayesian network model outperformed or equaled medBGAN in all key metrics. Notable improvement was achieved in capturing rare variables and preserving association rules. Discussion Bayesian networks generated data sufficiently similar to the original data with minimal risk of disclosure, while offering additional transparency, computational efficiency, and capacity to handle more data types in comparison to existing methods. We hope this method will allow healthcare organizations to efficiently disseminate synthetic health data to researchers, enabling them to generate hypotheses and develop analytical tools. Conclusion We conclude the application of Bayesian networks is a promising option for generating realistic synthetic health data that preserves the features of the original data without compromising data privacy.

Download Full-text

A review: preprocessing techniques and data augmentation for sentiment analysis

Computational Social Networks ◽

10.1186/s40649-020-00080-x ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Huu-Thanh Duong ◽

Tram-Anh Nguyen-Thi

Keyword(s):

Machine Learning ◽

Sentiment Analysis ◽

Supervised Learning ◽

Data Augmentation ◽

Original Data ◽

Training Data ◽

Unseen Data ◽

Augmentation Techniques ◽

User Intervention

AbstractIn literature, the machine learning-based studies of sentiment analysis are usually supervised learning which must have pre-labeled datasets to be large enough in certain domains. Obviously, this task is tedious, expensive and time-consuming to build, and hard to handle unseen data. This paper has approached semi-supervised learning for Vietnamese sentiment analysis which has limited datasets. We have summarized many preprocessing techniques which were performed to clean and normalize data, negation handling, intensification handling to improve the performances. Moreover, data augmentation techniques, which generate new data from the original data to enrich training data without user intervention, have also been presented. In experiments, we have performed various aspects and obtained competitive results which may motivate the next propositions.

Download Full-text

A machine learning-based predictor for the identification of the recurrence of patients with gastric cancer after operation

Scientific Reports ◽

10.1038/s41598-021-81188-6 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Chengmao Zhou ◽

Junhong Hu ◽

Ying Wang ◽

Mu-Huo Ji ◽

Jianhua Tong ◽

...

Keyword(s):

Machine Learning ◽

Gastric Cancer ◽

Learning Algorithms ◽

Test Group ◽

Operation Time ◽

Predictive Performance ◽

Original Data ◽

Postoperative Recurrence ◽

Machine Learning Algorithms ◽

Gastric Cancer Patients

AbstractTo explore the predictive performance of machine learning on the recurrence of patients with gastric cancer after the operation. The available data is divided into two parts. In particular, the first part is used as a training set (such as 80% of the original data), and the second part is used as a test set (the remaining 20% of the data). And we use fivefold cross-validation. The weight of recurrence factors shows the top four factors are BMI, Operation time, WGT and age in order. In training group:among the 5 machine learning models, the accuracy of gbm was 0.891, followed by gbm algorithm was 0.876; The AUC values of the five machine learning algorithms are from high to low as forest (0.962), gbm (0.922), GradientBoosting (0.898), DecisionTree (0.790) and Logistic (0.748). And the precision of the forest is the highest 0.957, followed by the GradientBoosting algorithm (0.878). At the same time, in the test group is as follows: the highest accuracy of Logistic was 0.801, followed by forest algorithm and gbm; the AUC values of the five algorithms are forest (0.795), GradientBoosting (0.774), DecisionTree (0.773), Logistic (0.771) and gbm (0.771), from high to low. Among the five machine learning algorithms, the highest precision rate of Logistic is 1.000, followed by the gbm (0.487). Machine learning can predict the recurrence of gastric cancer patients after an operation. Besides, the first four factors affecting postoperative recurrence of gastric cancer were BMI, Operation time, WGT and age.

Download Full-text

Integrated dimensionality reduction technique for mixed-type data involving categorical values

Applied Soft Computing ◽

10.1016/j.asoc.2016.02.015 ◽

2016 ◽

Vol 43 ◽

pp. 199-209 ◽

Cited By ~ 5

Author(s):

Chung-Chian Hsu ◽

Wei-Hao Huang

Keyword(s):

Dimensionality Reduction ◽

Mixed Type ◽

Reduction Technique ◽

Dimensionality Reduction Technique ◽

Type Data

Download Full-text

Improved minimum-minimum roughness algorithm for clustering categorical data

International Journal of ADVANCED AND APPLIED SCIENCES ◽

10.21833/ijaas.2021.10.006 ◽

2021 ◽

Vol 8 (10) ◽

pp. 43-50

Author(s):

Truong et al. ◽

Keyword(s):

Machine Learning ◽

Data Mining ◽

Hierarchical Clustering ◽

Categorical Data ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Experimental Results ◽

Data Sets ◽

Top Down ◽

Hierarchical Clustering Algorithm

Clustering is a fundamental technique in data mining and machine learning. Recently, many researchers are interested in the problem of clustering categorical data and several new approaches have been proposed. One of the successful and pioneering clustering algorithms is the Minimum-Minimum Roughness algorithm (MMR) which is a top-down hierarchical clustering algorithm and can handle the uncertainty in clustering categorical data. However, MMR tends to choose the category with less value leaf node with more objects, leading to undesirable clustering results. To overcome such shortcomings, this paper proposes an improved version of the MMR algorithm for clustering categorical data, called IMMR (Improved Minimum-Minimum Roughness). Experimental results on actual data sets taken from UCI show that the IMMR algorithm outperforms MMR in clustering categorical data.

Download Full-text