Open Source Software for Statistical Analysis of Big Data

2020 ◽  
2021 ◽  
Vol 12 (1) ◽  
pp. 1-20
Author(s):  
Gao Niu ◽  
Richard S. Segall ◽  
Zichen Zhao ◽  
Zhijian Wu

This paper discusses the definitions of open source software, free software and freeware, and the concept of big data. The authors then introduce R and Python as the two most popular open source statistical software (OSSS). Additional OSSS, such as JASP, PSPP, GRETL, SOFA Statistics, Octave, KNIME, and Scilab, are also introduced in this paper with function descriptions and modeling examples. They further discuss OSSS's capability in artificial intelligence application and modeling and Popular OSSS-based machine learning libraries and systems. The paper intends to provide a reference for readers to make proper selections of open source software when statistical analysis tasks are needed. In addition, working platform and selective numerical, descriptive and analysis examples are provided for each software. Readers could have a direct and in-depth understanding of each software and its functional highlights.


2021 ◽  
Author(s):  
Fabian Kovacs ◽  
Max Thonagel ◽  
Marion Ludwig ◽  
Alexander Albrecht ◽  
Manuel Hegner ◽  
...  

BACKGROUND Big data in healthcare must be exploited to achieve a substantial increase in efficiency and competitiveness. Especially the analysis of patient-related data possesses huge potential to improve decision-making processes. However, most analytical approaches used today are highly time- and resource-consuming. OBJECTIVE The presented software solution Conquery is an open-source software tool providing advanced, but intuitive data analysis without the need for specialized statistical training. Conquery aims to simplify big data analysis for novice database users in the medical sector. METHODS Conquery is a document-oriented distributed timeseries database and analysis platform. Its main application is the analysis of per-person medical records by non-technical medical professionals. Complex analyses are realized in the Conquery frontend by dragging tree nodes into the query editor. Queries are evaluated by a bespoke distributed query-engine for medical records in a column-oriented fashion. We present a custom compression scheme to facilitate low response times that uses online calculated as well as precomputed metadata and data statistics. RESULTS Conquery allows for easy navigation through the hierarchy and enables complex study cohort construction whilst reducing the demand on time and resources. The UI of Conquery and a query output is exemplified by the construction of a relevant clinical cohort. CONCLUSIONS Conquery is an efficient and intuitive open-source software for performant and secure data analysis and aims at supporting decision-making processes in the healthcare sector.


2015 ◽  
Author(s):  
Abhinav Nellore ◽  
Leonardo Collado-Torres ◽  
Andrew E Jaffe ◽  
José Alquicira-Hernández ◽  
Jacob Pritt ◽  
...  

RNA sequencing (RNA-seq) experiments now span hundreds to thousands of samples. Current spliced alignment software is designed to analyze each sample separately. Consequently, no information is gained from analyzing multiple samples together, and it is difficult to reproduce the exact analysis without access to original computing resources. We describe Rail-RNA, a cloud-enabled spliced aligner that analyzes many samples at once. Rail-RNA eliminates redundant work across samples, making it more efficient as samples are added. For many samples, Rail-RNA is more accurate than annotation-assisted aligners. We use Rail-RNA to align 667 RNA-seq samples from the GEUVADIS project on Amazon Web Services in under 16 hours for US$0.91 per sample. Rail-RNA produces alignments and base-resolution bigWig coverage files, ready for use with downstream packages for reproducible statistical analysis. We identify expressed regions in the GEUVADIS samples and show that both annotated and unannotated (novel) expressed regions exhibit consistent patterns of variation across populations and with respect to known confounders. Rail-RNA is open-source software available at http://rail.bio.


2021 ◽  
Vol 2021 ◽  
pp. 1-11
Author(s):  
Lingbin Zeng ◽  
Xin Guo ◽  
Cheng Yang ◽  
Yao Lu ◽  
Xiao Li

With the vigorous development of open-source software, a huge number of open-source projects and open-source codes have been accumulated in open-source big data, which contains a wealth of code resources. However, effectively and efficiently retrieving the relevant code snippets in such a large amount of open-source big data is an extremely difficult problem. There are usually large gaps between the user’s natural language description and the open-source code snippets. In this paper, we propose a novel code tag generation and code retrieval approach named TagNN, which combines software engineering empirical knowledge and a deep learning algorithm. The experimental results show that our method has good effects on code tag generation and code snippet retrieval.


Author(s):  
Andrew McCullum

In 2015, Central Asia made some vital enhancements in nature for cross-fringe e-business: Kazakhstan's promotion to the World Trade Organization (WTO) will help business straightforwardness, while the Kyrgyz Republic's enrollment in the Eurasian Customs Union grows its buyer base. Why e-business? Two reasons to begin with, e-trade diminishes the expense of separation. Focal Asia is the most elevated exchange cost locale on the planet: unlimited separations from real markets make discovering purchasers testing, shipping merchandise moderate, and fare costs high. Second, e-business can pull in populaces that are customarily under-spoke to in fare markets, for example, ladies, little organizations and rustic business visionaries.


Author(s):  
Richard S. Segall

This chapter discusses what Open Source Software is and its relationship to Big Data and how it differs from other types of software and its software development cycle. Open source software (OSS) is a type of computer software in which source code is released under a license in which the copyright holder grants users the rights to study, change, and distribute the software to anyone and for any purpose. Big Data are data sets that are so voluminous and complex that traditional data processing application software are inadequate to deal with them. Big data can be discrete or a continuous stream data and is accessible using many types of computing devices ranging from supercomputers and personal workstations to mobile devices and tablets. It is discussed how fog computing can be performed with cloud computing for visualization of Big Data. This chapter also presents a summary of additional web-based Big Data visualization software.


2017 ◽  
Vol 25 (2) ◽  
pp. 241-259 ◽  
Author(s):  
Gregory Eady

What explains why some survey respondents answer truthfully to a sensitive survey question, while others do not? This question is central to our understanding of a wide variety of attitudes, beliefs, and behaviors, but has remained difficult to investigate empirically due to the inherent problem of distinguishing those who are telling the truth from those who are misreporting. This article proposes a solution to this problem. It develops a method to model, within a multivariate regression context, whether survey respondents provide one response to a sensitive item in a list experiment, but answer otherwise when asked to reveal that belief openly in response to a direct question. As an empirical application, the method is applied to an original large-scale list experiment to investigate whether those on the ideological left are more likely to misreport their responses to questions about prejudice than those on the right. The method is implemented for researchers as open-source software.


Sign in / Sign up

Export Citation Format

Share Document