Semi-automated Conversion of Clinical Trial Legacy Data into CDISC SDTM Standards Format Using Supervised Machine Learning

Author(s):  
Takuma Oda ◽  
Shih-Wei Chiu ◽  
Takuhiro Yamaguchi

Abstract Objective This study aimed to develop a semi-automated process to convert legacy data into clinical data interchange standards consortium (CDISC) study data tabulation model (SDTM) format by combining human verification and three methods: data normalization; feature extraction by distributed representation of dataset names, variable names, and variable labels; and supervised machine learning. Materials and Methods Variable labels, dataset names, variable names, and values of legacy data were used as machine learning features. Because most of these data are string data, they had been converted to a distributed representation to make them usable as machine learning features. For this purpose, we utilized the following methods for distributed representation: Gestalt pattern matching, cosine similarity after vectorization by Doc2vec, and vectorization by Doc2vec. In this study, we examined five algorithms—namely decision tree, random forest, gradient boosting, neural network, and an ensemble that combines the four algorithms—to identify the one that could generate the best prediction model. Results The accuracy rate was highest for the neural network, and the distribution of prediction probabilities also showed a split between the correct and incorrect distributions. By combining human verification and the three methods, we were able to semi-automatically convert legacy data into the CDISC SDTM format. Conclusion By combining human verification and the three methods, we have successfully developed a semi-automated process to convert legacy data into the CDISC SDTM format; this process is more efficient than the conventional fully manual process.

Author(s):  
Dmitriy D. Matyushin ◽  
Anastasia Yu. Sholokhova ◽  
Aleksey K. Buryak

The estimation of gas chromatographic retention indices based on compounds structures is an importantproblem. Predicted retention indices can be used in a mass spectral library search for the identificationof unknowns. Various machine learning methods are used for this task, but methods based on decisiontrees, in particular gradient boosting, are not used widely. The aim of this work is to examine the usability ofthis method for the retention index prediction. 177 molecular descriptors computed with Chemistry Development Kit are used as the input representation of a molecule. Random subsets of the whole NIST 17 database are used as training, test and validation sets. 8000 trees with 6 leaves each are used. A neural network with one hidden layer (90 hidden nodes) is used for the comparison. The same data sets and the set of descriptors are used for the neural network and gradient boosting. The model based on gradient boosting outperforms the neural network with one hidden layer for subsets of NIST 17 and for the set of essential oils.The performance of this model is comparable or better than performance of other modern retention prediction models. The average relative deviation is ~3.0%, the median relative deviation is ~1.7% for subsets of NIST 17. The median absolute deviation is ~34 retention index units. Only non-polar liquid stationary phases (such as polydimethylsiloxane, 5% phenyl 95% polydimethylsiloxane, squalane) are considered. Errors obtained with different machine learning algorithms and with the same representation of the molecule strongly correlate with each other.


Author(s):  
Michael Fortunato ◽  
Connor W. Coley ◽  
Brian Barnes ◽  
Klavs F. Jensen

This work presents efforts to augment the performance of data-driven machine learning algorithms for reaction template recommendation used in computer-aided synthesis planning software. Often, machine learning models designed to perform the task of prioritizing reaction templates or molecular transformations are focused on reporting high accuracy metrics for the one-to-one mapping of product molecules in reaction databases to the template extracted from the recorded reaction. The available templates that get selected for inclusion in these machine learning models have been previously limited to those that appear frequently in the reaction databases and exclude potentially useful transformations. By augmenting open-access datasets of organic reactions with artificially calculated template applicability and pretraining a template relevance neural network on this augmented applicability dataset, we report an increase in the template applicability recall and an increase in the diversity of predicted precursors. The augmentation and pretraining effectively teaches the neural network an increased set of templates that could theoretically lead to successful reactions for a given target. Even on a small dataset of well curated reactions, the data augmentation and pretraining methods resulted in an increase in top-1 accuracy, especially for rare templates, indicating these strategies can be very useful for small datasets.


2021 ◽  
Vol 2072 (1) ◽  
pp. 012005
Author(s):  
M Sumanto ◽  
M A Martoprawiro ◽  
A L Ivansyah

Abstract Machine Learning is an artificial intelligence system, where the system has the ability to learn automatically from experience without being explicitly programmed. The learning process from Machine Learning starts from observing the data and then looking at the pattern of the data. The main purpose of this process is to make computers learn automatically. In this study, we will use Machine Learning to predict molecular atomization energy. From various methods in Machine Learning, we use two methods namely Neural Network and Extreme Gradient Boosting. Both methods have several parameters that must be adjusted so that the predicted value of the atomization energy of the molecule has the lowest possible error. We are trying to find the right parameter values for both methods. For the neural network method, it is quite difficult to find the right parameter value because it takes a long time to train the model of the neural network to find out whether the model is good or bad, while for the Extreme Gradient Boosting method the time needed to train the model is shorter, so it is quite easy to find the right parameter values for the model. This study also looked at the effects of the modification on the dataset with the output transformation of normalization and standardization then removing molecules containing Br atoms and changing the entry in the Coulomb matrix to 0 if the distance between atoms in the molecule exceeds 2 angstrom.


Author(s):  
B.Meena Preethi ◽  
◽  
R. Gowtham ◽  
S. Aishvarya ◽  
S. Karthick ◽  
...  

The project entitled as “Rainfall Prediction using Machine Learning & Deep Learning Algorithms” is a research project which is developed in Python Language and dataset is stored in Microsoft Excel. This prediction uses various machine learning and deep learning algorithms to find which algorithm predicts with most accurately. Rainfall prediction can be achieved by using binary classification under Data Mining. Predicting the rainfall is very important in several aspects of one’s country and can help from preventing serious natural disasters. For this prediction, Artificial Neural Network using Forward and Backward Propagation, Ada Boost, Gradient Boosting and XGBoost algorithms are used in this model for predicting the rainfall. There are totally five modules used in this project. The Data Analysis Module will analyse the datasets and finding the missing values in the dataset. The Data Pre-processing includes Data Cleaning which is the process of filling the missing values in the dataset. The Feature Transformation Module is used to modify the features of the dataset. The Data Mining Module is used to train the dataset to models using any algorithm for learning the pattern. The Model Evaluation Module is used to measure the performance of the model and finalize the overall best accuracy for the prediction. Dataset used in this prediction is for the country Australia. This main aim of the project is to compare the various boosting algorithms with the neural network and find the best algorithm among them. This prediction can be major advantage to the farmers in order to plant the types of crops according to the needy of water. Overall, we analyse the algorithm which is feasible for qualitatively predicting the rainfall.


2020 ◽  
Vol 10 (17) ◽  
pp. 6048 ◽  
Author(s):  
Nedeljko Dučić ◽  
Aleksandar Jovičić ◽  
Srećko Manasijević ◽  
Radomir Radiša ◽  
Žarko Ćojbašić ◽  
...  

This paper presents the application of machine learning in the control of the metal melting process. Metal melting is a dynamic production process characterized by nonlinear relations between process parameters. In this particular case, the subject of research is the production of white cast iron. Two supervised machine learning algorithms have been applied: the neural network and the support vector regression. The goal of their application is the prediction of the amount of alloying additives in order to obtain the desired chemical composition of white cast iron. The neural network model provided better results than the support vector regression model in the training and testing phases, which qualifies it to be used in the control of the white cast iron production.


Research ◽  
2021 ◽  
Vol 2021 ◽  
pp. 1-9
Author(s):  
Hang Guo ◽  
Ji Wan ◽  
Haobin Wang ◽  
Hanxiang Wu ◽  
Chen Xu ◽  
...  

Handwritten signatures widely exist in our daily lives. The main challenge of signal recognition on handwriting is in the development of approaches to obtain information effectively. External mechanical signals can be easily detected by triboelectric nanogenerators which can provide immediate opportunities for building new types of active sensors capable of recording handwritten signals. In this work, we report an intelligent human-machine interaction interface based on a triboelectric nanogenerator. Using the horizontal-vertical symmetrical electrode array, the handwritten triboelectric signal can be recorded without external energy supply. Combined with supervised machine learning methods, it can successfully recognize handwritten English letters, Chinese characters, and Arabic numerals. The principal component analysis algorithm preprocesses the triboelectric signal data to reduce the complexity of the neural network in the machine learning process. Further, it can realize the anticounterfeiting recognition of writing habits by controlling the samples input to the neural network. The results show that the intelligent human-computer interaction interface has broad application prospects in signature security and human-computer interaction.


2021 ◽  
pp. 1-13
Author(s):  
Nikolaos Napoleon Vlassis ◽  
Waiching Sun

Abstract Conventionally, neural network constitutive laws for path-dependent elasto-plastic solids are trained via supervised learning performed on recurrent neural network, with the time history of strain as input and the stress as input. However, training neural network to replicate path-dependent constitutive responses require significant more amount of data due to the path dependence. This demand on diverse and abundance of accurate data, as well as the lack of interpretability to guide the data generation process, could become major roadblocks for engineering applications. In this work, we attempt to simplify these training processes and improve the interpretability of the trained models by breaking down the training of material models into multiple supervised machine learning programs for elasticity, initial yielding and hardening laws that can be conducted sequentially. To predict pressure-sensitivity and rate dependence of the plastic responses, we reformulate the Hamliton-Jacobi equation such that the yield function is parametrized in the product space spanned by the principle stress, the accumulated plastic strain and time. To test the versatility of the neural network meta-modeling framework, we conduct multiple numerical experiments where neural networks are trained and validated against (1) data generated from known benchmark models, (2) data obtained from physical experiments and (3) data inferred from homogenizing sub-scale direct numerical simulations of microstructures. The neural network model is also incorporated into an offline FFT-FEM model to improve the efficiency of the multiscale calculations.


2020 ◽  
Author(s):  
Michael Fortunato ◽  
Connor W. Coley ◽  
Brian Barnes ◽  
Klavs F. Jensen

This work presents efforts to augment the performance of data-driven machine learning algorithms for reaction template recommendation used in computer-aided synthesis planning software. Often, machine learning models designed to perform the task of prioritizing reaction templates or molecular transformations are focused on reporting high accuracy metrics for the one-to-one mapping of product molecules in reaction databases to the template extracted from the recorded reaction. The available templates that get selected for inclusion in these machine learning models have been previously limited to those that appear frequently in the reaction databases and exclude potentially useful transformations. By augmenting open-access datasets of organic reactions with artificially calculated template applicability and pretraining a template relevance neural network on this augmented applicability dataset, we report an increase in the template applicability recall and an increase in the diversity of predicted precursors. The augmentation and pretraining effectively teaches the neural network an increased set of templates that could theoretically lead to successful reactions for a given target. Even on a small dataset of well curated reactions, the data augmentation and pretraining methods resulted in an increase in top-1 accuracy, especially for rare templates, indicating these strategies can be very useful for small datasets.


2020 ◽  
Vol 635 ◽  
pp. A124
Author(s):  
A. A. Elyiv ◽  
O. V. Melnyk ◽  
I. B. Vavilova ◽  
D. V. Dobrycheva ◽  
V. E. Karachentseva

Context. Quickly growing computing facilities and an increasing number of extragalactic observations encourage the application of data-driven approaches to uncover hidden relations from astronomical data. In this work we raise the problem of distance reconstruction for a large number of galaxies from available extensive observations. Aims. We propose a new data-driven approach for computing distance moduli for local galaxies based on the machine-learning regression as an alternative to physically oriented methods. We use key observable parameters for a large number of galaxies as input explanatory variables for training: magnitudes in U, B, I, and K bands, corresponding colour indices, surface brightness, angular size, radial velocity, and coordinates. Methods. We performed detailed tests of the five machine-learning regression techniques for inference of m−M: linear, polynomial, k-nearest neighbours, gradient boosting, and artificial neural network regression. As a test set we selected 91 760 galaxies at z <  0.2 from the NASA/IPAC extragalactic database with distance moduli measured by different independent redshift methods. Results. We find that the most effective and precise is the neural network regression model with two hidden layers. The obtained root–mean–square error of 0.35 mag, which corresponds to a relative error of 16%, does not depend on the distance to galaxy and is comparable with methods based on the Tully–Fisher and Fundamental Plane relations. The proposed model shows a 0.44 mag (20%) error in the case of spectroscopic redshift absence and is complementary to existing photometric redshift methodologies. Our approach has great potential for obtaining distance moduli for around 250 000 galaxies at z <  0.2 for which the above-mentioned parameters are already observed.


2021 ◽  
Vol 10 (3) ◽  
Author(s):  
Shreya Nag ◽  
Nimitha Jammula

The diagnosis of a disease to determine a specific condition is crucial in caring for patients and furthering medical research. The timely and accurate diagnosis can have important implications for both patients and healthcare providers. An earlier diagnosis allows doctors to consider more methods of treatment, allowing them to have a greater flexibility of tailoring their decisions, and ultimately improving the patient’s health. Additionally, a timely detection allows patients to have a greater control over their health and their decisions, allowing them to plan ahead. As advancements in computer science and technology continue to improve, these two factors can play a major role in aiding healthcare providers with medical issues. The emergence of artificial intelligence and machine learning can aid in addressing the challenge of completing timely and accurate diagnosis. The goal of this research work is to design a system that utilizes machine learning and neural network techniques to diagnose chronic kidney disease with more than 90% accuracy based on a clinical data set, and to do a comparative study of the performance of the neural network versus supervised machine learning approaches. Based on the results, all the algorithms performed well in prediction of chronic kidney disease (CKD) with more that 90% accuracy. The neural network system provided the best performance (accuracy = 100%) in prediction of chronic kidney disease in comparison with the supervised Random Forest algorithm (accuracy = 99%) and the supervised Decision Tree algorithm (accuracy = 97%).


Sign in / Sign up

Export Citation Format

Share Document