Data discovery in large scale heterogeneous and autonomous databases

Author(s):  
Athman Bouguettaya ◽  
Stephen Milliner

1995 ◽  
Vol 5 (2) ◽  
pp. 145-173 ◽  
Author(s):  
Athman Bouguettaya ◽  
Stephen Milliner ◽  
Roger King




Major challenge in the analysis of clinical data and knowledge discovery is to suggest an integrated, advanced and efficient tools, methods and technologies for access and processing of progressively increasing amounts of data in multiple formats. The paper presents a platform for multidimensional large-scale biomedical data management and analytics, which covers all phases of data discovery, data integration, data preprocessing, data storage, data analytics and visualization. The goal is to suggest an intelligent solution as integrated, scalable workflow development environment consisting of a suite of software tools to automate the computational process in conducting scientific experiments.



2019 ◽  
Vol 9 (6) ◽  
pp. 1114 ◽  
Author(s):  
Yun Li ◽  
Yongyao Jiang ◽  
Juan Gu ◽  
Mingyue Lu ◽  
Manzhu Yu ◽  
...  

The volume, variety, and velocity of different data, e.g., simulation data, observation data, and social media data, are growing ever faster, posing grand challenges for data discovery. An increasing trend in data discovery is to mine hidden relationships among users and metadata from the web usage logs to support the data discovery process. Web usage log mining is the process of reconstructing sessions from raw logs and finding interesting patterns or implicit linkages. The mining results play an important role in improving quality of search-related components, e.g., ranking, query suggestion, and recommendation. While researches were done in the data discovery domain, collecting and analyzing logs efficiently remains a challenge because (1) the volume of web usage logs continues to grow as long as users access the data; (2) the dynamic volume of logs requires on-demand computing resources for mining tasks; (3) the mining process is compute-intensive and time-intensive. To speed up the mining process, we propose a cloud-based log-mining framework using Apache Spark and Elasticsearch. In addition, a data partition paradigm, logPartitioner, is designed to solve the data imbalance problem in data parallelism. As a proof of concept, oceanographic data search and access logs are chosen to validate performance of the proposed parallel log-mining framework.



2019 ◽  
Author(s):  
Adrienne Hoarfrost ◽  
Nick Brown ◽  
C. Titus Brown ◽  
Carol Arnosti

Sequencing data resources have increased exponentially in recent years, as has interest in large-scale meta-analyses of integrated next-generation sequencing datasets. However, curation of integrated datasets that match a user’s particular research priorities is currently a time-intensive and imprecise task. MetaSeek is a sequencing data discovery tool that enables users to flexibly search and filter on any metadata field to quickly find the sequencing datasets that meet their needs. MetaSeek automatically scrapes metadata from all publicly available datasets in the Sequence Read Archive, cleans and parses messy, user-provided metadata into a structured, standard-compliant database, and predicts missing fields where possible. MetaSeek provides a web-based graphical user interface and interactive visualization dashboard, as well as a programmatic API to rapidly search, filter, visualize, save, share, and download matching sequencing metadata. The MetaSeek online interface is available at https://www.metaseek.cloud/. The MetaSeek database can also be accessed via API to programmatically search, filter, and download all metadata. MetaSeek source code, metadata scrapers, and documents are available at https://github.com/MetaSeek-Sequencing-Data-Discovery/metaseek/. Additional guides, tutorials, and documents are available at https://github.com/MetaSeek-Sequencing-Data-Discovery/metaseek, and on the MetaSeek website, https://www.metaseek.cloud/. MetaSeek is distributed under an MIT license.





2019 ◽  
Vol 35 (22) ◽  
pp. 4857-4859 ◽  
Author(s):  
Adrienne Hoarfrost ◽  
Nick Brown ◽  
C Titus Brown ◽  
Carol Arnosti

Abstract Summary Sequencing data resources have increased exponentially in recent years, as has interest in large-scale meta-analyses of integrated next-generation sequencing datasets. However, curation of integrated datasets that match a user’s particular research priorities is currently a time-intensive and imprecise task. MetaSeek is a sequencing data discovery tool that enables users to flexibly search and filter on any metadata field to quickly find the sequencing datasets that meet their needs. MetaSeek automatically scrapes metadata from all publicly available datasets in the Sequence Read Archive, cleans and parses messy, user-provided metadata into a structured, standard-compliant database and predicts missing fields where possible. MetaSeek provides a web-based graphical user interface and interactive visualization dashboard, as well as a programmatic API to rapidly search, filter, visualize, save, share and download matching sequencing metadata. Availability and implementation The MetaSeek online interface is available at https://www.metaseek.cloud/. The MetaSeek database can also be accessed via API to programmatically search, filter and download all metadata. MetaSeek source code, metadata scrapers and documents are available at https://github.com/MetaSeek-Sequencing-Data-Discovery/metaseek/.



Author(s):  
Raul Castro Fernandez ◽  
Ziawasch Abedjan ◽  
Samuel Madden ◽  
Michael Stonebraker


Sign in / Sign up

Export Citation Format

Share Document