ClinEpiDB: an open-access clinical epidemiology database resource encouraging online exploration of complex studies

The concept of open data has been gaining traction as a mechanism to increase data use, ensure that data are preserved over time, and accelerate discovery. While epidemiology data sets are increasingly deposited in databases and repositories, barriers to access still remain. ClinEpiDB was constructed as an open-access online resource for clinical and epidemiologic studies by leveraging the extensive web toolkit and infrastructure of the Eukaryotic Pathogen Database Resources (EuPathDB; a collection of databases covering 170+ eukaryotic pathogens, relevant related species, and select hosts) combined with a unified semantic web framework. Here we present an intuitive point-and-click website that allows users to visualize and subset data directly in the ClinEpiDB browser and immediately explore potential associations. Supporting study documentation aids contextualization, and data can be downloaded for advanced analyses. By facilitating access and interrogation of high-quality, large-scale data sets, ClinEpiDB aims to spur collaboration and discovery that improves global health.

Download Full-text

ClinEpiDB: an open-access clinical epidemiology database resource encouraging online exploration of complex studies

Gates Open Research ◽

10.12688/gatesopenres.13087.2 ◽

2020 ◽

Vol 3 ◽

pp. 1661

Author(s):

Emmanuel Ruhamyankaka ◽

Brian P. Brunk ◽

Grant Dorsey ◽

Omar S. Harb ◽

Danica A. Helb ◽

...

Keyword(s):

Open Access ◽

Large Scale ◽

Clinical Epidemiology ◽

Open Data ◽

Epidemiologic Studies ◽

Data Sets ◽

Large Scale Data ◽

Barriers To Access ◽

Online Resource ◽

Epidemiology Data

Download Full-text

Faculty Opinions recommendation of Comparative assessment of large-scale data sets of protein-protein interactions.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.1006598.82257 ◽

2002 ◽

Author(s):

Rob Russell

Keyword(s):

Protein Interactions ◽

Large Scale ◽

Comparative Assessment ◽

Data Sets ◽

Protein Protein Interactions ◽

Large Scale Data ◽

Scale Data ◽

Large Scale Data Sets

Download Full-text

Open source riverscapes: Analyzing the river corridor of the Naryn River in Kyrgyzstan based on open access data

10.5194/egusphere-egu21-4735 ◽

2021 ◽

Author(s):

Florian Betz ◽

Magdalena Lauermann ◽

Bernd Cyffka

Keyword(s):

Remote Sensing ◽

Open Access ◽

Large Scale ◽

Data Sets ◽

Hierarchical Systems ◽

Geomorphological Mapping ◽

Open Access Data ◽

Access Data ◽

River Corridors ◽

River Corridor

<p>In fluvial geomorphology as well as in freshwater ecology, rivers are commonly seen as nested hierarchical systems functioning over a range of spatial and temporal scales. Thus, for a comprehensive assessment, information on various scales is required. Over the past decade, remote sensing based approaches have become increasingly popular in river science to increase the spatial scale of analysis. However, data-scarce areas have been mostly ignored so far despite the fact that most remaining free flowing &#8211; and thus ecologically valuable &#8211; rivers worldwide are located in regions characterized by a lack of data sources like LiDAR or even aerial imagery. High resolution satellite data would be able to fill this data gap, but tends to be too costly for large scale applications what limits the ability for comprehensive studies on river systems in such remote areas. This in turn is a limitation for management and conservation of these rivers.</p><p>In this contribution, we suggest an approach for river corridor mapping based on open access data only in order to foster large scale geomorphological mapping of river corridors in data-scarce areas. For this aim, we combine advanced terrain analysis with multispectral remote sensing using the SRTM-1 DEM along with Landsat OLI imagery. We take the Naryn River in Kyrgyzstan as an example to demonstrate the potential of these open access data sets to derive a comprehensive set of parameters for characterizing this river corridor. The methods are adapted to the specific characteristics of medium resolution open access data sets and include an innovative, fuzzy logic based approach for riparian zone delineation, longitudinal profile smoothing based on constrained quantile regression and a delineation of the active channel width as needed for specific stream power computation. In addition, an indicator for river dynamics based on Landsat time series is developed. For each derived river corridor parameter, a rigor validation is performed. The results demonstrate, that our open access approach for geomorphological mapping of river corridors is capable to provide results sufficiently accurate to derive reach averaged information. Thus, it is well suited for large scale river characterization in data-scarce regions where otherwise the river corridors would remain largely unexplored from an up-to-date riverscape perspective. Such a characterization might be an entry point for further, more detailed research in selected study reaches and can deliver the required comprehensive background information for a range of topics in river science.</p>

Download Full-text

Self-Adaptive K-Means Based on a Covering Algorithm

Complexity ◽

10.1155/2018/7698274 ◽

2018 ◽

Vol 2018 ◽

pp. 1-16 ◽

Cited By ~ 1

Author(s):

Yiwen Zhang ◽

Yuanyuan Zhou ◽

Xing Guo ◽

Jintao Wu ◽

Qiang He ◽

...

Keyword(s):

Large Scale ◽

Clustering Algorithm ◽

Real Data ◽

Second Phase ◽

Data Sets ◽

Number Of Clusters ◽

Large Scale Data ◽

Long Time ◽

Two Phases ◽

Selection Of

The K-means algorithm is one of the ten classic algorithms in the area of data mining and has been studied by researchers in numerous fields for a long time. However, the value of the clustering number k in the K-means algorithm is not always easy to be determined, and the selection of the initial centers is vulnerable to outliers. This paper proposes an improved K-means clustering algorithm called the covering K-means algorithm (C-K-means). The C-K-means algorithm can not only acquire efficient and accurate clustering results but also self-adaptively provide a reasonable numbers of clusters based on the data features. It includes two phases: the initialization of the covering algorithm (CA) and the Lloyd iteration of the K-means. The first phase executes the CA. CA self-organizes and recognizes the number of clusters k based on the similarities in the data, and it requires neither the number of clusters to be prespecified nor the initial centers to be manually selected. Therefore, it has a “blind” feature, that is, k is not preselected. The second phase performs the Lloyd iteration based on the results of the first phase. The C-K-means algorithm combines the advantages of CA and K-means. Experiments are carried out on the Spark platform, and the results verify the good scalability of the C-K-means algorithm. This algorithm can effectively solve the problem of large-scale data clustering. Extensive experiments on real data sets show that the accuracy and efficiency of the C-K-means algorithm outperforms the existing algorithms under both sequential and parallel conditions.

Download Full-text

Pattern Recognition in Large-Scale Data Sets: Application in Integrated Circuit Manufacturing

Big Data Analytics - Lecture Notes in Computer Science ◽

10.1007/978-3-319-03689-2_13 ◽

2013 ◽

pp. 185-196 ◽

Cited By ~ 1

Author(s):

Choudur K. Lakshminarayan ◽

Michael I. Baron

Keyword(s):

Pattern Recognition ◽

Integrated Circuit ◽

Large Scale ◽

Data Sets ◽

Integrated Circuit Manufacturing ◽

Large Scale Data ◽

Scale Data ◽

Large Scale Data Sets

Download Full-text

Advances in studying brain morphology: The benefits of open-access data

10.7287/peerj.preprints.3010v1 ◽

2017 ◽

Author(s):

Christopher R Madan

Keyword(s):

Open Access ◽

Brain Structure ◽

Large Scale ◽

Open Data ◽

Brain Morphology ◽

Subcortical Structures ◽

Open Access Data ◽

Age Related ◽

Neuroimaging Data ◽

Access Data

Until recently, neuroimaging data for a research study needed to be collected within one’s own lab. However, when studying inter-individual differences in brain structure, a large sample of participants is necessary. Given the financial costs involved in collecting neuroimaging data from hundreds or thousands of participants, large-scale studies of brain morphology could previously only be conducted by well-funded laboratories with access to MRI facilities and to large samples of participants. With the advent of broad open-access data-sharing initiatives, this has recently changed–here the primary goal of the study is to collect large datasets to be shared, rather than sharing of the data as an afterthought. This paradigm shift is evident as increase in the pace of discovery, leading to a rapid rate of advances in our characterization of brain structure. The utility of open-access brain morphology data is numerous, ranging from observing novel patterns of age-related differences in subcortical structures to the development of more robust cortical parcellation atlases, with these advances being translatable to improved methods for characterizing clinical disorders (see Figure 1 for an illustration). Moreover, structural MRIs are generally more robust than functional MRIs, relative to potential artifacts and in being not task-dependent, resulting in large potential yields. While the benefits of open-access data have been discussed more broadly within the field of cognitive neuroscience elsewhere (Gilmore et al., 2017; Poldrack and Gorgolewski, 2014; Van Horn and Gazzaniga, 2013; Voytek, 2016), as well as in other fields (Ascoli et al., 2017; Choudhury et al., 2014; Davies et al., 2017), the current paper is focused specifically on the implications of open data to brain morphology research.

Download Full-text

Discovering Latent Class Labels for Multi-Label Learning

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/423 ◽

2020 ◽

Author(s):

Jun Huang ◽

Linchuan Xu ◽

Jing Wang ◽

Lei Feng ◽

Kenji Yamanishi

Keyword(s):

Large Scale ◽

Latent Class ◽

Training Data ◽

Data Sets ◽

Robust Learning ◽

Large Scale Data ◽

Novel Approach ◽

Fixed Set ◽

Class Labels ◽

Scale Data

Existing multi-label learning (MLL) approaches mainly assume all the labels are observed and construct classification models with a fixed set of target labels (known labels). However, in some real applications, multiple latent labels may exist outside this set and hide in the data, especially for large-scale data sets. Discovering and exploring the latent labels hidden in the data may not only find interesting knowledge but also help us to build a more robust learning model. In this paper, a novel approach named DLCL (i.e., Discovering Latent Class Labels for MLL) is proposed which can not only discover the latent labels in the training data but also predict new instances with the latent and known labels simultaneously. Extensive experiments show a competitive performance of DLCL against other state-of-the-art MLL approaches.

Download Full-text

BSO-MV: An Optimized Multiview Clustering Approach for Items Recommendation in Social Networks

JUCS - Journal of Universal Computer Science ◽

10.3897/jucs.70341 ◽

2021 ◽

Vol 27 (7) ◽

pp. 667-692

Author(s):

Lamia Berkani ◽

Lylia Betit ◽

Louiza Belarif

Keyword(s):

Social Networks ◽

Large Scale ◽

Data Sets ◽

Large Scale Data ◽

Recommendation Algorithms ◽

Clustering Approach ◽

Real World Datasets ◽

Multiview Clustering ◽

Improving Accuracy

Clustering-based approaches have been demonstrated to be efficient and scalable to large-scale data sets. However, clustering-based recommender systems suffer from relatively low accuracy and coverage. To address these issues, we propose in this article an optimized multiview clustering approach for the recommendation of items in social networks. First, the selection of the initial medoids is optimized using the Bees Swarm optimization algorithm (BSO) in order to generate better partitions (i.e. refining the quality of medoids according to the objective function). Then, the multiview clustering (MV) is applied, where users are iteratively clustered from the views of both rating patterns and social information (i.e. friendships and trust). Finally, a framework is proposed for testing the different alternatives, namely: (1) the standard recommendation algorithms; (2) the clustering-based and the optimized clustering-based recommendation algorithms using BSO; and (3) the MV and the optimized MV (BSO-MV) algorithms. Experimental results conducted on two real-world datasets demonstrate the effectiveness of the proposed BSO-MV algorithm in terms of improving accuracy, as it outperforms the existing related approaches and baselines.

Download Full-text

Characterising RDF data sets

Journal of Information Science ◽

10.1177/0165551516677945 ◽

2017 ◽

Vol 44 (2) ◽

pp. 203-229 ◽

Cited By ~ 6

Author(s):

Javier D Fernández ◽

Miguel A Martínez-Prieto ◽

Pablo de la Fuente Redondo ◽

Claudio Gutiérrez

Keyword(s):

Data Structures ◽

Large Scale ◽

Open Data ◽

Structural Features ◽

Data Sets ◽

Data Set ◽

Wide Range ◽

Rdf Data ◽

Description Framework ◽

Resource Description

The publication of semantic web data, commonly represented in Resource Description Framework (RDF), has experienced outstanding growth over the last few years. Data from all fields of knowledge are shared publicly and interconnected in active initiatives such as Linked Open Data. However, despite the increasing availability of applications managing large-scale RDF information such as RDF stores and reasoning tools, little attention has been given to the structural features emerging in real-world RDF data. Our work addresses this issue by proposing specific metrics to characterise RDF data. We specifically focus on revealing the redundancy of each data set, as well as common structural patterns. We evaluate the proposed metrics on several data sets, which cover a wide range of designs and models. Our findings provide a basis for more efficient RDF data structures, indexes and compressors.

Download Full-text

A Real-Time Log Analyzer Based on MongoDB

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.571-572.497 ◽

2014 ◽

Vol 571-572 ◽

pp. 497-501 ◽

Cited By ~ 3

Author(s):

Qi Lv ◽

Wei Xie

Keyword(s):

Real Time ◽

Large Scale ◽

Performance Comparison ◽

Log Analysis ◽

Data Sets ◽

Time Data ◽

Real Time Analysis ◽

Large Scale Data ◽

Implementation Approach ◽

And Performance

Real-time log analysis on large scale data is important for applications. Specifically, real-time refers to UI latency within 100ms. Therefore, techniques which efficiently support real-time analysis over large log data sets are desired. MongoDB provides well query performance, aggregation frameworks, and distributed architecture which is suitable for real-time data query and massive log analysis. In this paper, a novel implementation approach for an event driven file log analyzer is presented, and performance comparison of query, scan and aggregation operations over MongoDB, HBase and MySQL is analyzed. Our experimental results show that HBase performs best balanced in all operations, while MongoDB provides less than 10ms query speed in some operations which is most suitable for real-time applications.

Download Full-text