Open Source Software for Statistical Analysis of Big Data

This paper discusses the definitions of open source software, free software and freeware, and the concept of big data. The authors then introduce R and Python as the two most popular open source statistical software (OSSS). Additional OSSS, such as JASP, PSPP, GRETL, SOFA Statistics, Octave, KNIME, and Scilab, are also introduced in this paper with function descriptions and modeling examples. They further discuss OSSS's capability in artificial intelligence application and modeling and Popular OSSS-based machine learning libraries and systems. The paper intends to provide a reference for readers to make proper selections of open source software when statistical analysis tasks are needed. In addition, working platform and selective numerical, descriptive and analysis examples are provided for each software. Readers could have a direct and in-depth understanding of each software and its functional highlights.

Download Full-text

Industrial Big Data Platform Based on Open Source Software

Proceedings of the International Conference on Computer Networks and Communication Technology (CNCT 2016) ◽

10.2991/cnct-16.2017.90 ◽

2017 ◽

Cited By ~ 1

Author(s):

Wen YANG ◽

Syed Naeem Haider ◽

Jian-hong ZOU ◽

Qian-chuan ZHAO

Keyword(s):

Big Data ◽

Open Source ◽

Open Source Software ◽

Industrial Big Data ◽

Data Platform

Download Full-text

Computing remote sensing big data using local hardware and open-source software packages

Kart og plan ◽

10.18261/issn.2535-6003-2021-03-04-09 ◽

2021 ◽

Vol 114 (3-04) ◽

pp. 254-273

Author(s):

Misganu Debella-Gilo ◽

Jonathan Rizzi

Keyword(s):

Remote Sensing ◽

Big Data ◽

Open Source ◽

Open Source Software ◽

Software Packages

Download Full-text

Conquery: an Open Source Application to analyze High Content Healthcare Data (Preprint)

10.2196/preprints.32745 ◽

2021 ◽

Author(s):

Fabian Kovacs ◽

Max Thonagel ◽

Marion Ludwig ◽

Alexander Albrecht ◽

Manuel Hegner ◽

...

Keyword(s):

Decision Making ◽

Big Data ◽

Data Analysis ◽

Open Source ◽

Open Source Software ◽

Medical Records ◽

Healthcare Sector ◽

Study Cohort ◽

Decision Making Processes ◽

Analytical Approaches

BACKGROUND Big data in healthcare must be exploited to achieve a substantial increase in efficiency and competitiveness. Especially the analysis of patient-related data possesses huge potential to improve decision-making processes. However, most analytical approaches used today are highly time- and resource-consuming. OBJECTIVE The presented software solution Conquery is an open-source software tool providing advanced, but intuitive data analysis without the need for specialized statistical training. Conquery aims to simplify big data analysis for novice database users in the medical sector. METHODS Conquery is a document-oriented distributed timeseries database and analysis platform. Its main application is the analysis of per-person medical records by non-technical medical professionals. Complex analyses are realized in the Conquery frontend by dragging tree nodes into the query editor. Queries are evaluated by a bespoke distributed query-engine for medical records in a column-oriented fashion. We present a custom compression scheme to facilitate low response times that uses online calculated as well as precomputed metadata and data statistics. RESULTS Conquery allows for easy navigation through the hierarchy and enables complex study cohort construction whilst reducing the demand on time and resources. The UI of Conquery and a query output is exemplified by the construction of a relevant clinical cohort. CONCLUSIONS Conquery is an efficient and intuitive open-source software for performant and secure data analysis and aims at supporting decision-making processes in the healthcare sector.

Download Full-text

Statistical analysis of popular open source software projects and their communities

2014 6th International Conference on Information Technology and Electrical Engineering (ICITEE) ◽

10.1109/iciteed.2014.7007913 ◽

2014 ◽

Author(s):

Andi Wahju Rahardjo Emanuel

Keyword(s):

Statistical Analysis ◽

Open Source ◽

Open Source Software ◽

Software Projects

Download Full-text

Rail-RNA: Scalable analysis of RNA-seq splicing and coverage

10.1101/019067 ◽

2015 ◽

Cited By ~ 5

Author(s):

Abhinav Nellore ◽

Leonardo Collado-Torres ◽

Andrew E Jaffe ◽

José Alquicira-Hernández ◽

Jacob Pritt ◽

...

Keyword(s):

Statistical Analysis ◽

Web Services ◽

Open Source ◽

Rna Sequencing ◽

Open Source Software ◽

Rna Seq ◽

Spliced Alignment ◽

Amazon Web Services ◽

Scalable Analysis ◽

Multiple Samples

RNA sequencing (RNA-seq) experiments now span hundreds to thousands of samples. Current spliced alignment software is designed to analyze each sample separately. Consequently, no information is gained from analyzing multiple samples together, and it is difficult to reproduce the exact analysis without access to original computing resources. We describe Rail-RNA, a cloud-enabled spliced aligner that analyzes many samples at once. Rail-RNA eliminates redundant work across samples, making it more efficient as samples are added. For many samples, Rail-RNA is more accurate than annotation-assisted aligners. We use Rail-RNA to align 667 RNA-seq samples from the GEUVADIS project on Amazon Web Services in under 16 hours for US$0.91 per sample. Rail-RNA produces alignments and base-resolution bigWig coverage files, ready for use with downstream packages for reproducible statistical analysis. We identify expressed regions in the GEUVADIS samples and show that both annotated and unannotated (novel) expressed regions exhibit consistent patterns of variation across populations and with respect to known confounders. Rail-RNA is open-source software available at http://rail.bio.

Download Full-text

TagNN: A Code Tag Generation Technology for Resource Retrieval from Open-Source Big Data

Wireless Communications and Mobile Computing ◽

10.1155/2021/9956207 ◽

2021 ◽

Vol 2021 ◽

pp. 1-11

Author(s):

Lingbin Zeng ◽

Xin Guo ◽

Cheng Yang ◽

Yao Lu ◽

Xiao Li

Keyword(s):

Big Data ◽

Open Source ◽

Open Source Software ◽

Learning Algorithm ◽

Difficult Problem ◽

Empirical Knowledge ◽

Huge Number ◽

Source Codes ◽

Deep Learning Algorithm ◽

Generation Technology

With the vigorous development of open-source software, a huge number of open-source projects and open-source codes have been accumulated in open-source big data, which contains a wealth of code resources. However, effectively and efficiently retrieving the relevant code snippets in such a large amount of open-source big data is an extremely difficult problem. There are usually large gaps between the user’s natural language description and the open-source code snippets. In this paper, we propose a novel code tag generation and code retrieval approach named TagNN, which combines software engineering empirical knowledge and a deep learning algorithm. The experimental results show that our method has good effects on code tag generation and code snippet retrieval.

Download Full-text

Big data processing using Open Source Software- A Questionnaire on the data science

Scholedge International Journal of Multidisciplinary & Allied Studies ISSN 2394-336X ◽

10.19085/journal.sijmas030101 ◽

2016 ◽

Vol 3 (1) ◽

pp. 1

Author(s):

Andrew McCullum

Keyword(s):

Big Data ◽

Data Processing ◽

World Trade Organization ◽

Central Asia ◽

Open Source ◽

Open Source Software ◽

World Trade ◽

Data Science ◽

Customs Union ◽

The World

In 2015, Central Asia made some vital enhancements in nature for cross-fringe e-business: Kazakhstan's promotion to the World Trade Organization (WTO) will help business straightforwardness, while the Kyrgyz Republic's enrollment in the Eurasian Customs Union grows its buyer base. Why e-business? Two reasons to begin with, e-trade diminishes the expense of separation. Focal Asia is the most elevated exchange cost locale on the planet: unlimited separations from real markets make discovering purchasers testing, shipping merchandise moderate, and fare costs high. Second, e-business can pull in populaces that are customarily under-spoke to in fare markets, for example, ladies, little organizations and rustic business visionaries.

Download Full-text

What Is Open Source Software (OSS) and What Is Big Data?

Research Anthology on Usage and Development of Open Source Software ◽

10.4018/978-1-7998-9158-1.ch041 ◽

2021 ◽

pp. 817-857

Author(s):

Richard S. Segall

Keyword(s):

Big Data ◽

Open Source ◽

Open Source Software ◽

Fog Computing ◽

Computer Software ◽

Data Sets ◽

Copyright Holder ◽

Stream Data ◽

Big Data Visualization ◽

Continuous Stream

This chapter discusses what Open Source Software is and its relationship to Big Data and how it differs from other types of software and its software development cycle. Open source software (OSS) is a type of computer software in which source code is released under a license in which the copyright holder grants users the rights to study, change, and distribute the software to anyone and for any purpose. Big Data are data sets that are so voluminous and complex that traditional data processing application software are inadequate to deal with them. Big data can be discrete or a continuous stream data and is accessible using many types of computing devices ranging from supercomputers and personal workstations to mobile devices and tablets. It is discussed how fog computing can be performed with cloud computing for visualization of Big Data. This chapter also presents a summary of additional web-based Big Data visualization software.

Download Full-text

The Statistical Analysis of Misreporting on Sensitive Survey Questions

Political Analysis ◽

10.1017/pan.2017.8 ◽

2017 ◽

Vol 25 (2) ◽

pp. 241-259 ◽

Cited By ~ 10

Author(s):

Gregory Eady

Keyword(s):

Statistical Analysis ◽

Open Source ◽

Open Source Software ◽

Multivariate Regression ◽

Large Scale ◽

Survey Question ◽

Direct Question ◽

Inherent Problem ◽

List Experiment ◽

The Right

What explains why some survey respondents answer truthfully to a sensitive survey question, while others do not? This question is central to our understanding of a wide variety of attitudes, beliefs, and behaviors, but has remained difficult to investigate empirically due to the inherent problem of distinguishing those who are telling the truth from those who are misreporting. This article proposes a solution to this problem. It develops a method to model, within a multivariate regression context, whether survey respondents provide one response to a sensitive item in a list experiment, but answer otherwise when asked to reveal that belief openly in response to a direct question. As an empirical application, the method is applied to an original large-scale list experiment to investigate whether those on the ideological left are more likely to misreport their responses to questions about prejudice than those on the right. The method is implemented for researchers as open-source software.

Download Full-text

Open Source Software for Statistical Analysis of Big Data

A Survey of Open Source Statistical Software (OSSS) and Their Data Processing Functionalities

Industrial Big Data Platform Based on Open Source Software

Computing remote sensing big data using local hardware and open-source software packages

Conquery: an Open Source Application to analyze High Content Healthcare Data (Preprint)

Statistical analysis of popular open source software projects and their communities

Rail-RNA: Scalable analysis of RNA-seq splicing and coverage

TagNN: A Code Tag Generation Technology for Resource Retrieval from Open-Source Big Data

Big data processing using Open Source Software- A Questionnaire on the data science

What Is Open Source Software (OSS) and What Is Big Data?

The Statistical Analysis of Misreporting on Sensitive Survey Questions

Export Citation Format