scholarly journals Clinical trial registries as Scientometric data: A novel solution for linking and deduplicating clinical trials from multiple registries

2021 ◽  
Author(s):  
Christian Thiele ◽  
Gerrit Hirschfeld ◽  
Ruth von Brachel

AbstractRegistries of clinical trials are a potential source for scientometric analysis of medical research and serve important functions for the research community and the public at large. Clinical trials that recruit patients in Germany are usually registered in the German Clinical Trials Register (DRKS) or in international registries such as ClinicalTrials.gov. Furthermore, the International Clinical Trials Registry Platform (ICTRP) aggregates trials from multiple primary registries. We queried the DRKS, ClinicalTrials.gov, and the ICTRP for trials with a recruiting location in Germany. Trials that were registered in multiple registries were linked using the primary and secondary identifiers and a Random Forest model based on various similarity metrics. We identified 35,912 trials that were conducted in Germany. The majority of the trials was registered in multiple databases. 32,106 trials were linked using primary IDs, 26 were linked using a Random Forest model, and 10,537 internal duplicates on ICTRP were identified using the Random Forest model after finding pairs with matching primary or secondary IDs. In cross-validation, the Random Forest increased the F1-score from 96.4% to 97.1% compared to a linkage based solely on secondary IDs on a manually labelled data set. 28% of all trials were registered in the German DRKS. 54% of the trials on ClinicalTrials.gov, 43% of the trials on the DRKS and 56% of the trials on the ICTRP were pre-registered. The ratio of pre-registered studies and the ratio of studies that are registered in the DRKS increased over time.

Electronics ◽  
2020 ◽  
Vol 9 (1) ◽  
pp. 99 ◽  
Author(s):  
Krzysztof Gajowniczek ◽  
Iga Grzegorczyk ◽  
Tomasz Ząbkowski ◽  
Chandrajit Bajaj

Construction of an ensemble model is a process of combining many diverse base predictive learners. It arises questions of how to weight each model and how to tune the parameters of the weighting process. The most straightforward approach is simply to average the base models. However, numerous studies have shown that a weighted ensemble can provide superior prediction results to a simple average of models. The main goals of this article are to propose a new weighting algorithm applicable for each tree in the Random Forest model and the comprehensive examination of the optimal parameter tuning. Importantly, the approach is motivated by its flexibility, good performance, stability, and resistance to overfitting. The proposed scheme is examined and evaluated on the Physionet/Computing in Cardiology Challenge 2015 data set. It consists of signals (electrocardiograms and pulsatory waveforms) from intensive care patients which triggered an alarm for five cardiac arrhythmia types (Asystole, Bradycardia, Tachycardia, Ventricular Tachycardia, and Ventricular Fultter/Fibrillation). The classification problem regards whether the alarm should or should not have been generated. It was proved that the proposed weighting approach improved classification accuracy for the three most challenging out of the five investigated arrhythmias comparing to the standard Random Forest model.


2021 ◽  
Author(s):  
Runmei Ma ◽  
Jie Ban ◽  
Qing Wang ◽  
Yayi Zhang ◽  
Yang Yang ◽  
...  

Abstract. The health risks of fine particulate matter (PM2.5) and ambient ozone (O3) have been widely recognized in recent years. An accurate estimate of PM2.5 and O3 exposures is important for supporting health risk analysis and environmental policy-making. The aim of our study was to construct random forest models with high-performance, and estimate daily average PM2.5 concentration and O3 daily maximum 8 h average concentration (O3-8hmax) of China in 2005–2017 at a spatial resolution of 1 km×1 km. The model variables included meteorological variables, satellite data, chemical transport model output, geographic variables and socioeconomic variables. Random forest model based on ten-fold cross validation was established, and spatial and temporal validations were performed to evaluate the model performance. According to our sample-based division method, the daily, monthly and yearly simulations of PM2.5 gave average model fitting R2 values of 0.85, 0.88 and 0.90, respectively; these R2 values were 0.77, 0.77, and 0.69 for O3-8hmax, respectively. The meteorological variables and their lagged values can significantly affect both PM2.5 and O3-8hmax simulations. During 2005–2017, PM2.5 exhibited an overall downward trend, while ambient O3 experienced an upward trend. Whilst the spatial patterns of PM2.5 and O3-8hmax barely changed between 2005 and 2017, the temporal trend had spatial characteristic. The dataset is accessible to the public at https://doi.org/10.5281/zenodo.4009308, and the shared data set of Chinese Environmental Public Health Tracking: CEPHT (https://cepht.niehs.cn:8282/developSDS3.html).


2018 ◽  
Author(s):  
JL Cabrera-Alarcon ◽  
J Garcia-Martinez

ABSTRACTCurrently, there are available several tools to predict the effect of variants, with the aim of classify variants in neutral or pathogenic. In this study, we propose a new model trained over ensemble scores with two particularities, first we consider minor frequency allele from gnomAD and second, we split variants based on their splicing for training each specific model. Variants Stacked Random Forest Model (VSRFM) was constructed for variants not involved in splicing and Variants Stacked Random Forest Model for splicing (VSRFM-s) was trained for variants affected by splicing. Comparing these scores with their constituent scores used as features, our models showed the best outcomes. These results were confirmed using an independent data set from Clinvar database, with similar results.


2020 ◽  
Author(s):  
Jun-Feng Peng ◽  
Xing-Ji Chen ◽  
Xiao-Xin Li ◽  
Mi Zhou ◽  
Jun Xu ◽  
...  

Abstract BackgroundRandom forest (RF) is a powerful ensemble algorithm for medical decision-making supporting (MDS). However the requirement of higher accuracy and smaller ensemble size remain significant burdens for the current RF, particularly for the risk identification of disease deterioration. To achieve the goal of higher accuracy and smaller ensemble size for the risk identification of disease deterioration, a diversity enhancement random forest (DERF) model is proposed.MethodsWe explored the idea of integrating trees that are accurate and diverse to build the DERF model. First, we calculated the accuracy of the out of bag data to select the best K trees. Then, we assessed the diversity of these trees using logarithmic loss functions on the validation data set. Further, we utilized the greedy stepwise backward search to increase the diversity of the random forest. Finally, public bench mark data sets on disease deterioration from KEEL and real data sets from tertiary hospitals in the last three years were used to assess the performance of the proposed DERF model and compared it with the existing model. ResultsExperiments show that the proposed model can improve the prediction performance and reduce the ensemble size of random forest model. Compared with the existing model random forest, the extreme random tree and the ensemble of optimal tree, our proposed DERF model obtains a higher predictive accuracy and a smaller ensemble size. ConclusionIt reveals that the proposed DERF could reduce the size of the ensemble and achieve good classification results in the risk identification of disease deterioration


2021 ◽  
Vol 10 (8) ◽  
pp. 503
Author(s):  
Hang Liu ◽  
Riken Homma ◽  
Qiang Liu ◽  
Congying Fang

The simulation of future land use can provide decision support for urban planners and decision makers, which is important for sustainable urban development. Using a cellular automata-random forest model, we considered two scenarios to predict intra-land use changes in Kumamoto City from 2018 to 2030: an unconstrained development scenario, and a planning-constrained development scenario that considers disaster-related factors. The random forest was used to calculate the transition probabilities and the importance of driving factors, and cellular automata were used for future land use prediction. The results show that disaster-related factors greatly influence land vacancy, while urban planning factors are more important for medium high-rise residential, commercial, and public facilities. Under the unconstrained development scenario, urban land use tends towards spatially disordered growth in the total amount of steady growth, with the largest increase in low-rise residential areas. Under the planning-constrained development scenario that considers disaster-related factors, the urban land area will continue to grow, albeit slowly and with a compact growth trend. This study provides planners with information on the relevant trends in different scenarios of land use change in Kumamoto City. Furthermore, it provides a reference for Kumamoto City’s future post-disaster recovery and reconstruction planning.


2021 ◽  
pp. 100017
Author(s):  
Xinyu Dou ◽  
Cuijuan Liao ◽  
Hengqi Wang ◽  
Ying Huang ◽  
Ying Tu ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document