scholarly journals Technical Challenges for Big Data in Biomedicine and Health: Data Sources, Infrastructure, and Analytics

2014 ◽  
Vol 23 (01) ◽  
pp. 42-47 ◽  
Author(s):  
J. H. Holmes ◽  
J. Sun ◽  
N. Peek

Summary Objectives: To review technical and methodological challenges for big data research in biomedicine and health. Methods: We discuss sources of big datasets, survey infrastructures for big data storage and big data processing, and describe the main challenges that arise when analyzing big data. Results: The life and biomedical sciences are massively contributing to the big data revolution through secondary use of data that were collected during routine care and through new data sources such as social media. Efficient processing of big datasets is typically achieved by distributing computation over a cluster of computers. Data analysts should be aware of pitfalls related to big data such as bias in routine care data and the risk of false-positive findings in high-dimensional datasets. Conclusions: The major challenge for the near future is to transform analytical methods that are used in the biomedical and health domain, to fit the distributed storage and processing model that is required to handle big data, while ensuring confidentiality of the data being analyzed.

2020 ◽  
Vol 30 (Supplement_5) ◽  
Author(s):  
◽  

Abstract Countries have a wide range of lifestyles, environmental exposures and different health(care) systems providing a large natural experiment to be investigated. Through pan-European comparative studies, underlying determinants of population health can be explored and provide rich new insights into the dynamics of population health and care such as the safety, quality, effectiveness and costs of interventions. Additionally, in the big data era, secondary use of data has become one of the major cornerstones of digital transformation for health systems improvement. Several countries are reviewing governance models and regulatory framework for data reuse. Precision medicine and public health intelligence share the same population-based approach, as such, aligning secondary use of data initiatives will increase cost-efficiency of the data conversion value chain by ensuring that different stakeholders needs are accounted for since the beginning. At EU level, the European Commission has been raising awareness of the need to create adequate data ecosystems for innovative use of big data for health, specially ensuring responsible development and deployment of data science and artificial intelligence technologies in the medical and public health sectors. To this end, the Joint Action on Health Information (InfAct) is setting up the Distributed Infrastructure on Population Health (DIPoH). DIPoH provides a framework for international and multi-sectoral collaborations in health information. More specifically, DIPoH facilitates the sharing of research methods, data and results through participation of countries and already existing research networks. DIPoH's efforts include harmonization and interoperability, strengthening of the research capacity in MSs and providing European and worldwide perspectives to national data. In order to be embedded in the health information landscape, DIPoH aims to interact with existing (inter)national initiatives to identify common interfaces, to avoid duplication of the work and establish a sustainable long-term health information research infrastructure. In this workshop, InfAct lays down DIPoH's core elements in coherence with national and European initiatives and actors i.e. To-Reach, eHAction, the French Health Data Hub and ECHO. Pitch presentations on DIPoH and its national nodes will set the scene. In the format of a round table, possible collaborations with existing initiatives at (inter)national level will be debated with the audience. Synergies will be sought, reflections on community needs will be made and expectations on services will be discussed. The workshop will increase the knowledge of delegates around the latest health information infrastructure and initiatives that strive for better public health and health systems in countries. The workshop also serves as a capacity building activity to promote cooperation between initiatives and actors in the field. Key messages DIPoH an infrastructure aiming to interact with existing (inter)national initiatives to identify common interfaces, avoid duplication and enable a long-term health information research infrastructure. National nodes can improve coordination, communication and cooperation between health information stakeholders in a country, potentially reducing overlap and duplication of research and field-work.


2021 ◽  
Vol 2021 ◽  
pp. 1-13
Author(s):  
Xiangli Chang ◽  
Hailang Cui

With the increasing popularity of a large number of Internet-based services and a large number of services hosted on cloud platforms, a more powerful back-end storage system is needed to support these services. At present, it is very difficult or impossible to implement a distributed storage to meet all the above assumptions. Therefore, the focus of research is to limit different characteristics to design different distributed storage solutions to meet different usage scenarios. Economic big data should have the basic requirements of high storage efficiency and fast retrieval speed. The large number of small files and the diversity of file types make the storage and retrieval of economic big data face severe challenges. This paper is oriented to the application requirements of cross-modal analysis of economic big data. According to the source and characteristics of economic big data, the data types are analyzed and the database storage architecture and data storage structure of economic big data are designed. Taking into account the spatial, temporal, and semantic characteristics of economic big data, this paper proposes a unified coding method based on the spatiotemporal data multilevel division strategy combined with Geohash and Hilbert and spatiotemporal semantic constraints. A prototype system was constructed based on Mongo DB, and the performance of the multilevel partition algorithm proposed in this paper was verified by the prototype system based on the realization of data storage management functions. The Wiener distributed memory based on the principle of Wiener filter is used to store the workload of each workload distributed storage window in a distributed manner. For distributed storage workloads, this article adopts specific types of workloads. According to its periodicity, the workload is divided into distributed storage windows of specific duration. At the beginning of each distributed storage window, distributed storage is distributed to the next distributed storage window. Experiments and tests have verified the distributed storage strategy proposed in this article, which proves that the Wiener distributed storage solution can save platform resources and configuration costs while ensuring Service Level Agreement (SLA).


Author(s):  
A. Joshi ◽  
E. Pebesma ◽  
R. Henriques ◽  
M. Appel

Abstract. Earth observation data of large part of the world is available at different temporal, spectral and spatial resolution. These data can be termed as big data as they fulfil the criteria of 3 Vs of big data: Volume, Velocity and Variety. The size of image in archives are multiple petabyte size, the size is growing continuously and the data have varied resolution and usages. These big data have variety of applications including climate change study, forestry application, agricultural application and urban planning. However, these big data also possess challenge of data storage, management and high computational requirement for processing. The solution to this computational and data management requirements is database system with distributed storage and parallel computation.In this study SciDB, an array-based database is used to store, manage and process multitemporal satellite imagery. The major aim of this study is to develop SciDB based scalable solution to store and perform time series analysis on multi-temporal satellite imagery. Total 148 scene of landsat image of 10 years period between 2006 and 2016 were stored as SciDB array. The data was then retrieved, processed and visualized. This study provides solution for storage of big RS data and also provides workflow for time series analysis of remote sensing data no matter how large is the size.


Transformation presents the second step in the ETL process that is responsible for extracting, transforming and loading data into a data warehouse. The role of transformation is to set up several operations to clean, to format and to unify types and data coming from multiple and different data sources. The goal is to get data to conform to the schema of the data warehouse to avoid any ambiguity problems during the data storage and analytical operations. Transforming data coming from structured, semi-structured and unstructured data sources need two levels of treatments: the first one is transformation schema to schema to get a unified schema for all selected data sources and the second treatment is transformation data to data to unify all types and data gathered. To ensure the setting up of these steps we propose in this paper a process switch from one database schema to another as a part of transformation schema to schema, and a meta-model based on MDA approach to describe the main operations of transformation data to data. The results of our transformations propose a data loading in one of the four schemas of NoSQL to best meet the constraints and requirements of Big Data.


2021 ◽  
pp. 1-13
Author(s):  
Setia Pramana ◽  
Siti Mariyah ◽  
Takdir

The rapid development of Big Data as result of increasing interactivity with online systems between humans (e.g., online shopping, marketplace) and machine (internet of things, mobile phone, etc.) has led to a measurement revolution. This massive data if being mined and analyzed correctly can provide valuable alternative data sources for official statistics, especially price statistics. Several studies for using diverse Big Data as new sources of price statistics in Indonesia have been initiated. This article would provide a comprehensive review of experiences in exploiting various Big Data sources for price statistics, followed by the current development and the near future plans. The development of system and IT infrastructure is also discussed. Based on this experience, limitations, challenges, and advances for each approach would be presented.


2021 ◽  
Author(s):  
Heinrich Peters ◽  
Zachariah Marrero ◽  
Samuel D. Gosling

As human interactions have shifted to virtual spaces and as sensing systems have become more affordable, an increasing share of peoples’ everyday lives can be captured in real time. The availability of such fine-grained behavioral data from billions of people has the potential to enable great leaps in our understanding of human behavior. However, such data also pose challenges to engineers and behavioral scientists alike, requiring a specialized set of tools and methodologies to generate psychologically relevant insights.In particular, researchers may need to utilize machine learning techniques to extract information from unstructured or semi-structured data, reduce high-dimensional data to a smaller number of variables, and efficiently deal with extremely large sample sizes. Such procedures can be computationally expensive, requiring researchers to balance computation time with processing power and memory capacity. Whereas modelling procedures on small datasets will usually take mere moments to execute, applying modeling procedures to big data can take much longer with typical execution times spanning hours, days, or even weeks depending on the complexity of the problem and the resources available. Seemingly subtle decisions regarding preprocessing and analytic strategy can end up having a huge impact on the viability of executing analyses within a reasonable timeframe. Consequently, researchers must anticipate potential pitfalls regarding the interplay of their analytic strategy with memory and computational constraints.Many researchers who are interested in using “big data” report having problems learning about new analytic methods or software, finding collaborators with the right skills and knowledge, and getting access to commercial or proprietary data for their research (Metzler et al. 2016). This chapter aims to serve as a practical introduction for psychologists who want to use large datasets and datasets from non-traditional data sources in their research (i.e., data not generated in the lab or through conventional surveys). First, we discuss the concept of big data and review some of the theoretical challenges and opportunities that arise with the availability of ever larger amounts of data. Second, we discuss practical implications and best practices with respect to data collection, data storage, data processing, and data modelling for psychological research in the age of big data.


Big data applications play an important role in real time data processing. Apache Spark is a data processing framework with in-memory data engine that quickly processes large data sets. It can also distribute data processing tasks across multiple computers, either on its own or in tandem with other distributed computing tools. Spark’s in-memory processing cannot share data between the applications and hence, the RAM memory will be insufficient for storing petabytes of data. Alluxio is a virtual distributed storage system that leverages memory for data storage and provides faster access to data in different storage systems. Alluxio helps to speed up data intensive Spark applications, with various storage systems. In this work, the performance of applications on Spark as well as Spark running over Alluxio have been studied with respect to several storage formats such as Parquet, ORC, CSV, and JSON; and four types of queries from Star Schema Benchmark (SSB). A benchmark is evolved to suggest the suitability of Spark Alluxio combination for big data applications. It is found that Alluxio is suitable for applications that use databases of size more than 2.6 GB storing data in JSON and CSV formats. Spark is found suitable for applications that use storage formats such as parquet and ORC with database sizes less than 2.6GB.


2018 ◽  
Vol 17 ◽  
pp. 03023
Author(s):  
Lei Wang ◽  
Weichun Ge ◽  
Zhao Li ◽  
Zhenjiang Lei ◽  
Shuo Chen

It is reportedi that the electricity cost to operate a cluster may well exceed its acquisition cost, and the processing of big data requires large scale cluster and long period. Therefore, energy efficient processing of big data is essential for the data owners and users. In this paper, we propose a novel algorithm MinBalance to processing I/O intensive big data tasks energy efficiently in heterogeneous cluster. In the former step, four greedy policies are used to select the proper nodes considering heterogeneity of the cluster. While in the latter step, the workloads of the selected nodes will be well balanced to avoid the energy wastes caused by waiting. MinBalance is a universal algorithm and cannot be affected by the data storage strategies. Experimental results indicate that MinBalance can achieve over 60% energy reduction for large sets over the traditional methods of powering down partial nodes.


Author(s):  
Wajid Ali ◽  
Muhammad Usman Shafique ◽  
Muhammad Arslan Majeed ◽  
Ali Raza

A key ingredient in the world of cloud computing is a database that can be used by a great number of users. Distributed storage mechanisms become the de-facto method for data storage used by companies for the new generation of web applications. In the world of data storage, NoSQL (usually interpreted as "not only SQL" by developers) database is a growing trend. It is said that NoSQL alternates with the most widely used relational databases for the data storage, but, as the name implies, it does not fully replace the SQL. In this paper we will discuss about SQL and NoSQL databases, comparison of traditional SQL with NoSQL databases for Big Data analytics, NoSQL data models, types of NoSQL data stores, characteristics and features of each data store, advantages and disadvantages of NoSQL and RDBMS.


Sign in / Sign up

Export Citation Format

Share Document