Open Source Software for Statistical Analysis of Big Data - Advances in Computer and Electrical Engineering
Latest Publications


TOTAL DOCUMENTS

7
(FIVE YEARS 7)

H-INDEX

0
(FIVE YEARS 0)

Published By IGI Global

9781799827689, 9781799827702

Author(s):  
Son Nguyen ◽  
Anthony Park

This chapter compares the performances of multiple Big Data techniques applied for time series forecasting and traditional time series models on three Big Data sets. The traditional time series models, Autoregressive Integrated Moving Average (ARIMA), and exponential smoothing models are used as the baseline models against Big Data analysis methods in the machine learning. These Big Data techniques include regression trees, Support Vector Machines (SVM), Multilayer Perceptrons (MLP), Recurrent Neural Networks (RNN), and long short-term memory neural networks (LSTM). Across three time series data sets used (unemployment rate, bike rentals, and transportation), this study finds that LSTM neural networks performed the best. In conclusion, this study points out that Big Data machine learning algorithms applied in time series can outperform traditional time series models. The computations in this work are done by Python, one of the most popular open-sourced platforms for data science and Big Data analysis.


Author(s):  
Siyu Shi

This chapter introduces the history of Python and its IDEs (integrated development environment) and code editors as developing environment. The history tells how Python started from ABC programming language in the Netherlands to a community with developers from different areas, and later became one of the most popular programming languages in the world. Popular IDEs and Code Editor for professional developers and beginners are also introduced with their advantages and disadvantages. Later in this chapter, the authors introduce Python libraries, which could be used in statistical analysis, and give out a simple case on how these methods can be applied.


Author(s):  
Gao Niu ◽  
Alan Olinsky

This chapter demonstrates the descriptive and statistical modeling function in R. The automobile fatal accident data of the United States is extracted from the Fatality Analysis Reporting System (FARS). The model will be used to understand significant contributing factors of automobile accident death when a fatal crash happens. First, descriptive analysis is performed by basic R functions and packages. Then, generalized linear model (GLM) with logit link function is explored and constructed. Finally, multiple validation metrics are introduced and calculated to ensure the reasonability and accuracy of the predictions. The focus of this chapter is to demonstrate the power and flexibility of the most popular Open Source Statistical Software (OSSS) through a real data analysis.


Author(s):  
Richard S. Segall

This chapter discusses what Open Source Software is and its relationship to Big Data and how it differs from other types of software and its software development cycle. Open source software (OSS) is a type of computer software in which source code is released under a license in which the copyright holder grants users the rights to study, change, and distribute the software to anyone and for any purpose. Big Data are data sets that are so voluminous and complex that traditional data processing application software are inadequate to deal with them. Big data can be discrete or a continuous stream data and is accessible using many types of computing devices ranging from supercomputers and personal workstations to mobile devices and tablets. It is discussed how fog computing can be performed with cloud computing for visualization of Big Data. This chapter also presents a summary of additional web-based Big Data visualization software.


Author(s):  
Zhijian Wu ◽  
Zichen Zhao ◽  
Gao Niu

This chapter first introduces the two most popular Open Source Statistical Software (OSSS), R and Python, along with their Integrated Development Environment (IDE) and Graphical User Interface (GUI). Secondly, additional OSSS, such as JASP, PSPP, GRETL, SOFA Statistics, Octave, KNIME, and Scilab, will also be introduced in this chapter with function descriptions and modeling examples. The chapter intends to create a reference for readers to make proper selection of the Open Source Software when a statistical analysis task is in demand. The chapter describes software explicitly in words. In addition, working platform and selective numerical, descriptive, and analysis examples are provided for each software. Readers could have a direct and in-depth understanding of each software and its functional highlights.


Author(s):  
Richard S. Segall

This chapter discusses Open Source Software and associated technologies for the processing of Big Data. This includes discussions of Hadoop-related projects, the current top open source data tools and frameworks such as SMACK that is acronym for open source technologies Spark, Mesos, Akka, Cassandra, and Kafka that together compose the ingestion, aggregation, analysis, and storage layers for Big Data processing. Tabular summaries and categories for 38 Open Source Statistical Software (OSSS) are provided that include for each listing of features and URLs for free downloads. The current challenges of Big Data and Open Source Software are also discussed.


Author(s):  
Alicia Taylor Lamere

This chapter discusses several popular clustering functions and open source software packages in R and their feasibility of use on larger datasets. These will include the kmeans() function, the pvclust package, and the DBSCAN (density-based spatial clustering of applications with noise) package, which implement K-means, hierarchical, and density-based clustering, respectively. Dimension reduction methods such as PCA (principle component analysis) and SVD (singular value decomposition), as well as the choice of distance measure, are explored as methods to improve the performance of hierarchical and model-based clustering methods on larger datasets. These methods are illustrated through an application to a dataset of RNA-sequencing expression data for cancer patients obtained from the Cancer Genome Atlas Kidney Clear Cell Carcinoma (TCGA-KIRC) data collection from The Cancer Imaging Archive (TCIA).


Sign in / Sign up

Export Citation Format

Share Document