Open Source Software for Statistical Analysis of Big Data - Advances in Computer and Electrical Engineering

This chapter compares the performances of multiple Big Data techniques applied for time series forecasting and traditional time series models on three Big Data sets. The traditional time series models, Autoregressive Integrated Moving Average (ARIMA), and exponential smoothing models are used as the baseline models against Big Data analysis methods in the machine learning. These Big Data techniques include regression trees, Support Vector Machines (SVM), Multilayer Perceptrons (MLP), Recurrent Neural Networks (RNN), and long short-term memory neural networks (LSTM). Across three time series data sets used (unemployment rate, bike rentals, and transportation), this study finds that LSTM neural networks performed the best. In conclusion, this study points out that Big Data machine learning algorithms applied in time series can outperform traditional time series models. The computations in this work are done by Python, one of the most popular open-sourced platforms for data science and Big Data analysis.

Download Full-text

Introduction to Python and Its Statistical Applications

Open Source Software for Statistical Analysis of Big Data - Advances in Computer and Electrical Engineering ◽

10.4018/978-1-7998-2768-9.ch006 ◽

2020 ◽

pp. 162-196

Author(s):

Siyu Shi

Keyword(s):

Statistical Analysis ◽

The Netherlands ◽

Programming Languages ◽

Programming Language ◽

Integrated Development Environment ◽

Development Environment ◽

Integrated Development ◽

Advantages And Disadvantages ◽

The World ◽

History Of

This chapter introduces the history of Python and its IDEs (integrated development environment) and code editors as developing environment. The history tells how Python started from ABC programming language in the Netherlands to a community with developers from different areas, and later became one of the most popular programming languages in the world. Popular IDEs and Code Editor for professional developers and beginners are also introduced with their advantages and disadvantages. Later in this chapter, the authors introduce Python libraries, which could be used in statistical analysis, and give out a simple case on how these methods can be applied.

Download Full-text

Generalized Linear Model for Automobile Fatality Rate Prediction in R

Open Source Software for Statistical Analysis of Big Data - Advances in Computer and Electrical Engineering ◽

10.4018/978-1-7998-2768-9.ch005 ◽

2020 ◽

pp. 137-161

Author(s):

Gao Niu ◽

Alan Olinsky

Keyword(s):

Linear Model ◽

Generalized Linear Model ◽

Descriptive Analysis ◽

Real Data ◽

The United States ◽

Contributing Factors ◽

Statistical Software ◽

Validation Metrics ◽

Fatal Crash ◽

Accident Data

This chapter demonstrates the descriptive and statistical modeling function in R. The automobile fatal accident data of the United States is extracted from the Fatality Analysis Reporting System (FARS). The model will be used to understand significant contributing factors of automobile accident death when a fatal crash happens. First, descriptive analysis is performed by basic R functions and packages. Then, generalized linear model (GLM) with logit link function is explored and constructed. Finally, multiple validation metrics are introduced and calculated to ensure the reasonability and accuracy of the predictions. The focus of this chapter is to demonstrate the power and flexibility of the most popular Open Source Statistical Software (OSSS) through a real data analysis.

Download Full-text

What Is Open Source Software (OSS) and What Is Big Data?

Open Source Software for Statistical Analysis of Big Data - Advances in Computer and Electrical Engineering ◽

10.4018/978-1-7998-2768-9.ch001 ◽

2020 ◽

pp. 1-49

Author(s):

Richard S. Segall

Keyword(s):

Big Data ◽

Open Source ◽

Open Source Software ◽

Fog Computing ◽

Computer Software ◽

Data Sets ◽

Copyright Holder ◽

Stream Data ◽

Big Data Visualization ◽

Continuous Stream

This chapter discusses what Open Source Software is and its relationship to Big Data and how it differs from other types of software and its software development cycle. Open source software (OSS) is a type of computer software in which source code is released under a license in which the copyright holder grants users the rights to study, change, and distribute the software to anyone and for any purpose. Big Data are data sets that are so voluminous and complex that traditional data processing application software are inadequate to deal with them. Big data can be discrete or a continuous stream data and is accessible using many types of computing devices ranging from supercomputers and personal workstations to mobile devices and tablets. It is discussed how fog computing can be performed with cloud computing for visualization of Big Data. This chapter also presents a summary of additional web-based Big Data visualization software.

Download Full-text

Introduction to the Popular Open Source Statistical Software (OSSS)

Open Source Software for Statistical Analysis of Big Data - Advances in Computer and Electrical Engineering ◽

10.4018/978-1-7998-2768-9.ch003 ◽

2020 ◽

pp. 73-110

Author(s):

Zhijian Wu ◽

Zichen Zhao ◽

Gao Niu

Keyword(s):

Statistical Analysis ◽

User Interface ◽

Open Source ◽

Graphical User Interface ◽

Proper Selection ◽

Development Environment ◽

Integrated Development ◽

Statistical Software ◽

Analysis Task ◽

Selection Of

This chapter first introduces the two most popular Open Source Statistical Software (OSSS), R and Python, along with their Integrated Development Environment (IDE) and Graphical User Interface (GUI). Secondly, additional OSSS, such as JASP, PSPP, GRETL, SOFA Statistics, Octave, KNIME, and Scilab, will also be introduced in this chapter with function descriptions and modeling examples. The chapter intends to create a reference for readers to make proper selection of the Open Source Software when a statistical analysis task is in demand. The chapter describes software explicitly in words. In addition, working platform and selective numerical, descriptive, and analysis examples are provided for each software. Readers could have a direct and in-depth understanding of each software and its functional highlights.

Download Full-text

Open Source Software (OSS) for Big Data

Open Source Software for Statistical Analysis of Big Data - Advances in Computer and Electrical Engineering ◽

10.4018/978-1-7998-2768-9.ch002 ◽

2020 ◽

pp. 50-72

Author(s):

Richard S. Segall

Keyword(s):

Big Data ◽

Data Processing ◽

Open Source ◽

Open Source Software ◽

Statistical Software ◽

Big Data Processing ◽

Open Source Data ◽

Source Data ◽

And Storage ◽

Aggregation Analysis

This chapter discusses Open Source Software and associated technologies for the processing of Big Data. This includes discussions of Hadoop-related projects, the current top open source data tools and frameworks such as SMACK that is acronym for open source technologies Spark, Mesos, Akka, Cassandra, and Kafka that together compose the ingestion, aggregation, analysis, and storage layers for Big Data processing. Tabular summaries and categories for 38 Open Source Statistical Software (OSSS) are provided that include for each listing of features and URLs for free downloads. The current challenges of Big Data and Open Source Software are also discussed.

Download Full-text

Cluster Analysis in R With Big Data Applications

Open Source Software for Statistical Analysis of Big Data - Advances in Computer and Electrical Engineering ◽

10.4018/978-1-7998-2768-9.ch004 ◽

2020 ◽

pp. 111-136

Author(s):

Alicia Taylor Lamere

Keyword(s):

Distance Measure ◽

Spatial Clustering ◽

The Cancer Genome Atlas ◽

Clustering Methods ◽

Model Based Clustering ◽

Big Data Applications ◽

Density Based Clustering ◽

Reduction Methods ◽

Cancer Genome Atlas ◽

Value Decomposition

This chapter discusses several popular clustering functions and open source software packages in R and their feasibility of use on larger datasets. These will include the kmeans() function, the pvclust package, and the DBSCAN (density-based spatial clustering of applications with noise) package, which implement K-means, hierarchical, and density-based clustering, respectively. Dimension reduction methods such as PCA (principle component analysis) and SVD (singular value decomposition), as well as the choice of distance measure, are explored as methods to improve the performance of hierarchical and model-based clustering methods on larger datasets. These methods are illustrated through an application to a dataset of RNA-sequencing expression data for cancer patients obtained from the Cancer Genome Atlas Kidney Clear Cell Carcinoma (TCGA-KIRC) data collection from The Cancer Imaging Archive (TCIA).

Download Full-text

Open Source Software for Statistical Analysis of Big Data - Advances in Computer and Electrical Engineering
Latest Publications

TOTAL DOCUMENTS

H-INDEX

Published By IGI Global

A Comparison of Machine Learning Algorithms of Big Data for Time Series Forecasting Using Python

Introduction to Python and Its Statistical Applications

Generalized Linear Model for Automobile Fatality Rate Prediction in R

What Is Open Source Software (OSS) and What Is Big Data?

Introduction to the Popular Open Source Statistical Software (OSSS)

Open Source Software (OSS) for Big Data

Cluster Analysis in R With Big Data Applications

Export Citation Format

Open Source Software for Statistical Analysis of Big Data - Advances in Computer and Electrical EngineeringLatest Publications

TOTAL DOCUMENTS

H-INDEX

Published By IGI Global

A Comparison of Machine Learning Algorithms of Big Data for Time Series Forecasting Using Python

Introduction to Python and Its Statistical Applications

Generalized Linear Model for Automobile Fatality Rate Prediction in R

What Is Open Source Software (OSS) and What Is Big Data?

Introduction to the Popular Open Source Statistical Software (OSSS)

Open Source Software (OSS) for Big Data

Cluster Analysis in R With Big Data Applications

Open Source Software for Statistical Analysis of Big Data - Advances in Computer and Electrical Engineering
Latest Publications