scholarly journals ClinEpiDB: an open-access clinical epidemiology database resource encouraging online exploration of complex studies

2019 ◽  
Vol 3 ◽  
pp. 1661 ◽  
Author(s):  
Emmanuel Ruhamyankaka ◽  
Brian P. Brunk ◽  
Grant Dorsey ◽  
Omar S. Harb ◽  
Danica A. Helb ◽  
...  

The concept of open data has been gaining traction as a mechanism to increase data use, ensure that data are preserved over time, and accelerate discovery. While epidemiology data sets are increasingly deposited in databases and repositories, barriers to access still remain. ClinEpiDB was constructed as an open-access online resource for clinical and epidemiologic studies by leveraging the extensive web toolkit and infrastructure of the Eukaryotic Pathogen Database Resources (EuPathDB; a collection of databases covering 170+ eukaryotic pathogens, relevant related species, and select hosts) combined with a unified semantic web framework. Here we present an intuitive point-and-click website that allows users to visualize and subset data directly in the ClinEpiDB browser and immediately explore potential associations. Supporting study documentation aids contextualization, and data can be downloaded for advanced analyses. By facilitating access and interrogation of high-quality, large-scale data sets, ClinEpiDB aims to spur collaboration and discovery that improves global health.

2020 ◽  
Vol 3 ◽  
pp. 1661
Author(s):  
Emmanuel Ruhamyankaka ◽  
Brian P. Brunk ◽  
Grant Dorsey ◽  
Omar S. Harb ◽  
Danica A. Helb ◽  
...  

The concept of open data has been gaining traction as a mechanism to increase data use, ensure that data are preserved over time, and accelerate discovery. While epidemiology data sets are increasingly deposited in databases and repositories, barriers to access still remain. ClinEpiDB was constructed as an open-access online resource for clinical and epidemiologic studies by leveraging the extensive web toolkit and infrastructure of the Eukaryotic Pathogen Database Resources (EuPathDB; a collection of databases covering 170+ eukaryotic pathogens, relevant related species, and select hosts) combined with a unified semantic web framework. Here we present an intuitive point-and-click website that allows users to visualize and subset data directly in the ClinEpiDB browser and immediately explore potential associations. Supporting study documentation aids contextualization, and data can be downloaded for advanced analyses. By facilitating access and interrogation of high-quality, large-scale data sets, ClinEpiDB aims to spur collaboration and discovery that improves global health.


2021 ◽  
Author(s):  
Florian Betz ◽  
Magdalena Lauermann ◽  
Bernd Cyffka

<p>In fluvial geomorphology as well as in freshwater ecology, rivers are commonly seen as nested hierarchical systems functioning over a range of spatial and temporal scales. Thus, for a comprehensive assessment, information on various scales is required. Over the past decade, remote sensing based approaches have become increasingly popular in river science to increase the spatial scale of analysis. However, data-scarce areas have been mostly ignored so far despite the fact that most remaining free flowing – and thus ecologically valuable – rivers worldwide are located in regions characterized by a lack of data sources like LiDAR or even aerial imagery. High resolution satellite data would be able to fill this data gap, but tends to be too costly for large scale applications what limits the ability for comprehensive studies on river systems in such remote areas. This in turn is a limitation for management and conservation of these rivers.</p><p>In this contribution, we suggest an approach for river corridor mapping based on open access data only in order to foster large scale geomorphological mapping of river corridors in data-scarce areas. For this aim, we combine advanced terrain analysis with multispectral remote sensing using the SRTM-1 DEM along with Landsat OLI imagery. We take the Naryn River in Kyrgyzstan as an example to demonstrate the potential of these open access data sets to derive a comprehensive set of parameters for characterizing this river corridor. The methods are adapted to the specific characteristics of medium resolution open access data sets and include an innovative, fuzzy logic based approach for riparian zone delineation, longitudinal profile smoothing based on constrained quantile regression and a delineation of the active channel width as needed for specific stream power computation. In addition, an indicator for river dynamics based on Landsat time series is developed. For each derived river corridor parameter, a rigor validation is performed. The results demonstrate, that our open access approach for geomorphological mapping of river corridors is capable to provide results sufficiently accurate to derive reach averaged information. Thus, it is well suited for large scale river characterization in data-scarce regions where otherwise the river corridors would remain largely unexplored from an up-to-date riverscape perspective. Such a characterization might be an entry point for further, more detailed research in selected study reaches and can deliver the required comprehensive background information for a range of topics in river science.</p>


Complexity ◽  
2018 ◽  
Vol 2018 ◽  
pp. 1-16 ◽  
Author(s):  
Yiwen Zhang ◽  
Yuanyuan Zhou ◽  
Xing Guo ◽  
Jintao Wu ◽  
Qiang He ◽  
...  

The K-means algorithm is one of the ten classic algorithms in the area of data mining and has been studied by researchers in numerous fields for a long time. However, the value of the clustering number k in the K-means algorithm is not always easy to be determined, and the selection of the initial centers is vulnerable to outliers. This paper proposes an improved K-means clustering algorithm called the covering K-means algorithm (C-K-means). The C-K-means algorithm can not only acquire efficient and accurate clustering results but also self-adaptively provide a reasonable numbers of clusters based on the data features. It includes two phases: the initialization of the covering algorithm (CA) and the Lloyd iteration of the K-means. The first phase executes the CA. CA self-organizes and recognizes the number of clusters k based on the similarities in the data, and it requires neither the number of clusters to be prespecified nor the initial centers to be manually selected. Therefore, it has a “blind” feature, that is, k is not preselected. The second phase performs the Lloyd iteration based on the results of the first phase. The C-K-means algorithm combines the advantages of CA and K-means. Experiments are carried out on the Spark platform, and the results verify the good scalability of the C-K-means algorithm. This algorithm can effectively solve the problem of large-scale data clustering. Extensive experiments on real data sets show that the accuracy and efficiency of the C-K-means algorithm outperforms the existing algorithms under both sequential and parallel conditions.


2017 ◽  
Author(s):  
Christopher R Madan

Until recently, neuroimaging data for a research study needed to be collected within one’s own lab. However, when studying inter-individual differences in brain structure, a large sample of participants is necessary. Given the financial costs involved in collecting neuroimaging data from hundreds or thousands of participants, large-scale studies of brain morphology could previously only be conducted by well-funded laboratories with access to MRI facilities and to large samples of participants. With the advent of broad open-access data-sharing initiatives, this has recently changed–here the primary goal of the study is to collect large datasets to be shared, rather than sharing of the data as an afterthought. This paradigm shift is evident as increase in the pace of discovery, leading to a rapid rate of advances in our characterization of brain structure. The utility of open-access brain morphology data is numerous, ranging from observing novel patterns of age-related differences in subcortical structures to the development of more robust cortical parcellation atlases, with these advances being translatable to improved methods for characterizing clinical disorders (see Figure 1 for an illustration). Moreover, structural MRIs are generally more robust than functional MRIs, relative to potential artifacts and in being not task-dependent, resulting in large potential yields. While the benefits of open-access data have been discussed more broadly within the field of cognitive neuroscience elsewhere (Gilmore et al., 2017; Poldrack and Gorgolewski, 2014; Van Horn and Gazzaniga, 2013; Voytek, 2016), as well as in other fields (Ascoli et al., 2017; Choudhury et al., 2014; Davies et al., 2017), the current paper is focused specifically on the implications of open data to brain morphology research.


Author(s):  
Jun Huang ◽  
Linchuan Xu ◽  
Jing Wang ◽  
Lei Feng ◽  
Kenji Yamanishi

Existing multi-label learning (MLL) approaches mainly assume all the labels are observed and construct classification models with a fixed set of target labels (known labels). However, in some real applications, multiple latent labels may exist outside this set and hide in the data, especially for large-scale data sets. Discovering and exploring the latent labels hidden in the data may not only find interesting knowledge but also help us to build a more robust learning model. In this paper, a novel approach named DLCL (i.e., Discovering Latent Class Labels for MLL) is proposed which can not only discover the latent labels in the training data but also predict new instances with the latent and known labels simultaneously. Extensive experiments show a competitive performance of DLCL against other state-of-the-art MLL approaches.


2021 ◽  
Vol 27 (7) ◽  
pp. 667-692
Author(s):  
Lamia Berkani ◽  
Lylia Betit ◽  
Louiza Belarif

Clustering-based approaches have been demonstrated to be efficient and scalable to large-scale data sets. However, clustering-based recommender systems suffer from relatively low accuracy and coverage. To address these issues, we propose in this article an optimized multiview clustering approach for the recommendation of items in social networks. First, the selection of the initial medoids is optimized using the Bees Swarm optimization algorithm (BSO) in order to generate better partitions (i.e. refining the quality of medoids according to the objective function). Then, the multiview clustering (MV) is applied, where users are iteratively clustered from the views of both rating patterns and social information (i.e. friendships and trust). Finally, a framework is proposed for testing the different alternatives, namely: (1) the standard recommendation algorithms; (2) the clustering-based and the optimized clustering-based recommendation algorithms using BSO; and (3) the MV and the optimized MV (BSO-MV) algorithms. Experimental results conducted on two real-world datasets demonstrate the effectiveness of the proposed BSO-MV algorithm in terms of improving accuracy, as it outperforms the existing related approaches and baselines.


2017 ◽  
Vol 44 (2) ◽  
pp. 203-229 ◽  
Author(s):  
Javier D Fernández ◽  
Miguel A Martínez-Prieto ◽  
Pablo de la Fuente Redondo ◽  
Claudio Gutiérrez

The publication of semantic web data, commonly represented in Resource Description Framework (RDF), has experienced outstanding growth over the last few years. Data from all fields of knowledge are shared publicly and interconnected in active initiatives such as Linked Open Data. However, despite the increasing availability of applications managing large-scale RDF information such as RDF stores and reasoning tools, little attention has been given to the structural features emerging in real-world RDF data. Our work addresses this issue by proposing specific metrics to characterise RDF data. We specifically focus on revealing the redundancy of each data set, as well as common structural patterns. We evaluate the proposed metrics on several data sets, which cover a wide range of designs and models. Our findings provide a basis for more efficient RDF data structures, indexes and compressors.


2014 ◽  
Vol 571-572 ◽  
pp. 497-501 ◽  
Author(s):  
Qi Lv ◽  
Wei Xie

Real-time log analysis on large scale data is important for applications. Specifically, real-time refers to UI latency within 100ms. Therefore, techniques which efficiently support real-time analysis over large log data sets are desired. MongoDB provides well query performance, aggregation frameworks, and distributed architecture which is suitable for real-time data query and massive log analysis. In this paper, a novel implementation approach for an event driven file log analyzer is presented, and performance comparison of query, scan and aggregation operations over MongoDB, HBase and MySQL is analyzed. Our experimental results show that HBase performs best balanced in all operations, while MongoDB provides less than 10ms query speed in some operations which is most suitable for real-time applications.


Sign in / Sign up

Export Citation Format

Share Document