A Novel Clustering-Based Sampling Approach for Minimum Sample Set in Big Data Environment

The data are rapidly expanding nowadays, which makes it very difficult to analyze valuable information from big data. Most of the existing data mining algorithms deal with big data problems at large time and space costs. This paper focuses on the sampling problem of big data and puts forward an efficient heuristic Cluster Sampling Arithmetic, called CSA. Many of the former researchers adopted random method to extract early sample set from the original data and then made a variety of different processing of the sample in order to obtain the corresponding minimum sample set, which is regarded as a representation of the original big data set. However, the final processing results of big data will be severely affected by the random sampling process at the beginning, resulting in lower comprehensiveness and quality of the final data results and longer processing time. Based on this view, CSA introduces the idea of clustering to obtain minimum sample set of big data, which is in contrast to the random sampling method in the current literature. CSA makes cluster analysis of the original data set and selects the center of each class as centralized members of the minimum sample set. It aims at ensuring that the sample distribution accords with the characteristics of the original data, guarantees the data integrity and reduces the processing time. The max–min distance means that the pattern recognition has been integrated into the clustering process in order to get the clustering center and prevent algorithm from local optimum. The final experimental results show that, compared with the existing work, CSA algorithm can efficiently reflect the characteristics of the original data and reduce the time of data processing. The obtained minimum sample set has also achieved good effects in the classification algorithm.

Download Full-text

On solving large p-median problems

Environment and Planning B Urban Analytics and City Science ◽

10.1177/2399808319892598 ◽

2019 ◽

Vol 47 (6) ◽

pp. 981-996

Author(s):

Wangshu Mu ◽

Daoqin Tong

Keyword(s):

Big Data ◽

Random Sampling ◽

Computing Time ◽

Scale Up ◽

Problem Solution ◽

Median Problem ◽

Data Set ◽

Location Models ◽

Spatial Voting ◽

Median Problems

Incorporating big data in urban planning has great potential for better modeling of urban dynamics and more efficiently allocating limited resources. However, big data may present new challenges for problem solutions. This research focuses on the p-median problem, one of the most widely used location models in urban and regional planning. Similar to many other location models, the p-median problem is non-deterministic polynomial-time hard (NP-hard), and solving large-sized p-median problems is difficult. This research proposes a high performance computing-based algorithm, random sampling and spatial voting, to solve large-sized p-median problems. Instead of solving a large p-median problem directly, a random sampling scheme is introduced to create smaller sub- p-median problems that can be solved in parallel efficiently. A spatial voting strategy is designed to evaluate the candidate facility sites for inclusion in obtaining the final problem solution. Tests with the Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) data set show that random sampling and spatial voting provides high-quality solutions and reduces computing time significantly. Tests also demonstrate the dynamic scalability of the algorithm; it can start with a small amount of computing resources and scale up and down flexibly depending on the availability of the computing resources.

Download Full-text

Big Data analytics for prediction: parallel processing of the big learning base with the possibility of improving the final result of the prediction

Information Discovery and Delivery ◽

10.1108/idd-02-2018-0002 ◽

2018 ◽

Vol 46 (3) ◽

pp. 147-160 ◽

Cited By ~ 2

Author(s):

Laouni Djafri ◽

Djamel Amar Bensaber ◽

Reda Adjoudj

Keyword(s):

Big Data ◽

Data Analytics ◽

Sampling Method ◽

New Technologies ◽

Predictive Analytics ◽

Big Data Analytics ◽

Sampling Strategy ◽

Original Data ◽

Data Set ◽

Content Type

Purpose This paper aims to solve the problems of big data analytics for prediction including volume, veracity and velocity by improving the prediction result to an acceptable level and in the shortest possible time. Design/methodology/approach This paper is divided into two parts. The first one is to improve the result of the prediction. In this part, two ideas are proposed: the double pruning enhanced random forest algorithm and extracting a shared learning base from the stratified random sampling method to obtain a representative learning base of all original data. The second part proposes to design a distributed architecture supported by new technologies solutions, which in turn works in a coherent and efficient way with the sampling strategy under the supervision of the Map-Reduce algorithm. Findings The representative learning base obtained by the integration of two learning bases, the partial base and the shared base, presents an excellent representation of the original data set and gives very good results of the Big Data predictive analytics. Furthermore, these results were supported by the improved random forests supervised learning method, which played a key role in this context. Originality/value All companies are concerned, especially those with large amounts of information and want to screen them to improve their knowledge for the customer and optimize their campaigns.

Download Full-text

The Use of an Artificial Neural Network to Process Hydrographic Big Data during Surface Modeling

Computers ◽

10.3390/computers8010026 ◽

2019 ◽

Vol 8 (1) ◽

pp. 26 ◽

Cited By ~ 8

Author(s):

Marta Wlodarczyk-Sielicka ◽

Jacek Lubczonek

Keyword(s):

Neural Networks ◽

Big Data ◽

Spatial Data ◽

Original Data ◽

Data Sets ◽

Data Set ◽

Bathymetric Data ◽

Data Points ◽

Artificial Neural ◽

Hidden Neurons

At the present time, spatial data are often acquired using varied remote sensing sensors and systems, which produce big data sets. One significant product from these data is a digital model of geographical surfaces, including the surface of the sea floor. To improve data processing, presentation, and management, it is often indispensable to reduce the number of data points. This paper presents research regarding the application of artificial neural networks to bathymetric data reductions. This research considers results from radial networks and self-organizing Kohonen networks. During reconstructions of the seabed model, the results show that neural networks with fewer hidden neurons than the number of data points can replicate the original data set, while the Kohonen network can be used for clustering during big geodata reduction. Practical implementations of neural networks capable of creating surface models and reducing bathymetric data are presented.

Download Full-text

Rotational Characteristics of the Green Solar Corona: 1947-1991

International Astronomical Union Colloquium ◽

10.1017/s0252921100025197 ◽

1994 ◽

Vol 144 ◽

pp. 139-141 ◽

Cited By ~ 2

Author(s):

J. Rybák ◽

V. Rušin ◽

M. Rybanský

Keyword(s):

Solar Corona ◽

World Wide ◽

Rotation Period ◽

Original Data ◽

Coronal Emission ◽

Time Intervals ◽

Data Set ◽

The World ◽

Coronal Emission Line ◽

Homogeneous Data

AbstractFe XIV 530.3 nm coronal emission line observations have been used for the estimation of the green solar corona rotation. A homogeneous data set, created from measurements of the world-wide coronagraphic network, has been examined with a help of correlation analysis to reveal the averaged synodic rotation period as a function of latitude and time over the epoch from 1947 to 1991.The values of the synodic rotation period obtained for this epoch for the whole range of latitudes and a latitude band ±30° are 27.52±0.12 days and 26.95±0.21 days, resp. A differential rotation of green solar corona, with local period maxima around ±60° and minimum of the rotation period at the equator, was confirmed. No clear cyclic variation of the rotation has been found for examinated epoch but some monotonic trends for some time intervals are presented.A detailed investigation of the original data and their correlation functions has shown that an existence of sufficiently reliable tracers is not evident for the whole set of examinated data. This should be taken into account in future more precise estimations of the green corona rotation period.

Download Full-text

Electing the Senate

10.23943/princeton/9780691163161.001.0001 ◽

2017 ◽

Cited By ~ 1

Author(s):

Wendy J. Schiller ◽

Charles Stewart III

Keyword(s):

Original Data ◽

State Legislators ◽

Internal Conflict ◽

Data Set ◽

Senate Elections ◽

Political Actors ◽

The Public ◽

Seventeenth Amendment ◽

The People ◽

Election Process

From 1789 to 1913, U.S. senators were not directly elected by the people—instead the Constitution mandated that they be chosen by state legislators. This radically changed in 1913, when the Seventeenth Amendment to the Constitution was ratified, giving the public a direct vote. This book investigates the electoral connections among constituents, state legislators, political parties, and U.S. senators during the age of indirect elections. The book finds that even though parties controlled the partisan affiliation of the winning candidate for Senate, they had much less control over the universe of candidates who competed for votes in Senate elections and the parties did not always succeed in resolving internal conflict among their rank and file. Party politics, money, and personal ambition dominated the election process, in a system originally designed to insulate the Senate from public pressure. The book uses an original data set of all the roll call votes cast by state legislators for U.S. senators from 1871 to 1913 and all state legislators who served during this time. Newspaper and biographical accounts uncover vivid stories of the political maneuvering, corruption, and partisanship—played out by elite political actors, from elected officials, to party machine bosses, to wealthy business owners—that dominated the indirect Senate elections process. The book raises important questions about the effectiveness of Constitutional reforms, such as the Seventeenth Amendment, that promised to produce a more responsive and accountable government.

Download Full-text

Construction of a multi-source heterogeneous hybrid platform for big data

Journal of Computational Methods in Sciences and Engineering ◽

10.3233/jcm-215138 ◽

2021 ◽

pp. 1-10

Author(s):

Ying Wang ◽

Yiding Liu ◽

Minna Xia

Keyword(s):

Big Data ◽

Data Analysis ◽

Forest Fire ◽

Original Data ◽

Big Data Analysis ◽

Multiple Sources ◽

Data Types ◽

Fire Monitoring ◽

Data Platform

Big data is featured by multiple sources and heterogeneity. Based on the big data platform of Hadoop and spark, a hybrid analysis on forest fire is built in this study. This platform combines the big data analysis and processing technology, and learns from the research results of different technical fields, such as forest fire monitoring. In this system, HDFS of Hadoop is used to store all kinds of data, spark module is used to provide various big data analysis methods, and visualization tools are used to realize the visualization of analysis results, such as Echarts, ArcGIS and unity3d. Finally, an experiment for forest fire point detection is designed so as to corroborate the feasibility and effectiveness, and provide some meaningful guidance for the follow-up research and the establishment of forest fire monitoring and visualized early warning big data platform. However, there are two shortcomings in this experiment: more data types should be selected. At the same time, if the original data can be converted to XML format, the compatibility is better. It is expected that the above problems can be solved in the follow-up research.

Download Full-text

Styles of Representation in Constituencies in the Homeland and Abroad: The Case of Italy

Parliamentary Affairs ◽

10.1093/pa/gsaa063 ◽

2020 ◽

Author(s):

Eva Østergaard-Nielsen ◽

Stefano Camatarri

Keyword(s):

Original Data ◽

Role Orientation ◽

International Mobility ◽

Data Set ◽

Democratic Representation ◽

Public Debates ◽

At Home

Abstract The role orientation of political representatives and candidates is a longstanding concern in studies of democratic representation. The growing trend in countries to allow citizens abroad to candidate in homeland elections from afar provides an interesting opportunity for understanding how international mobility and context influences ideas of representation among these emigrant candidates. In public debates, emigrant candidates are often portrayed as delegates of the emigrant constituencies. However, drawing on the paradigmatic case of Italy and an original data set comprising emigrant candidates, we show that the perceptions of styles of representation abroad are more complex. Systemic differences between electoral districts at home and abroad are relevant for explaining why and how candidates develop a trustee or delegate orientation.

Download Full-text

Do the Heterogeneous Determinants of Repayment Affect Differently across Borrowers of Diverse Credit Sources in Rural Assam? A Double Hurdle Approach

Journal of Development Policy and Practice ◽

10.1177/24551333211031667 ◽

2021 ◽

pp. 245513332110316

Author(s):

Tiken Das ◽

Pradyut Guha ◽

Diganta Das

Keyword(s):

Rural Areas ◽

Probit Model ◽

Original Data ◽

Community Based ◽

Data Set ◽

Brahmaputra Valley ◽

Double Hurdle ◽

Instrumental Variable Probit ◽

Instrumental Variable Probit Model

This study made an attempt to answer the question: Do the heterogeneous determinants of repayment affect the borrowers of diverse credit sources differently? The study is based on data collected from 240 households from three districts in the lower Brahmaputra valley of Assam through a carefully designed primary survey. Besides, the study uses the double hurdle approach and the instrumental variable probit model to reduce possible selection bias. It observes better repayment performance among formal borrowers, followed by semiformal borrowers, while occupation wise it is prominent among organised employees. It has been found that in general, the household characteristics, loan characteristics and location-specific characteristics significantly affect repayment performance of borrowers. However, the nature of impact of the factors influencing repayment performance is remarkably different across credit sources. It ignores the role of traditional community-based organisations in rural Assam while analysing the determinants of repayment performance. The study also recommends for ensuring productive opportunities and efficient market linkages in rural areas of Assam. The study is based on an original data set that has specially been collected to examine question that—do the heterogeneous determinants of repayment affect the borrowers of diverse credit sources differently in the lower Brahmaputra valley of Assam—which has not been studied before.

Download Full-text

DV-DVFS: merging data variety and DVFS technique to manage the energy consumption of big data processing

Journal Of Big Data ◽

10.1186/s40537-021-00437-7 ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Hossein Ahmadvand ◽

Fouzhan Foroutan ◽

Mahmood Fathy

Keyword(s):

Big Data ◽

Energy Consumption ◽

Processing Time ◽

Experimental Results ◽

The Other ◽

Data Sets ◽

Multiple Sources ◽

Evaluation Phase ◽

Dynamic Voltage ◽

Processing Resources

AbstractData variety is one of the most important features of Big Data. Data variety is the result of aggregating data from multiple sources and uneven distribution of data. This feature of Big Data causes high variation in the consumption of processing resources such as CPU consumption. This issue has been overlooked in previous works. To overcome the mentioned problem, in the present work, we used Dynamic Voltage and Frequency Scaling (DVFS) to reduce the energy consumption of computation. To this goal, we consider two types of deadlines as our constraint. Before applying the DVFS technique to computer nodes, we estimate the processing time and the frequency needed to meet the deadline. In the evaluation phase, we have used a set of data sets and applications. The experimental results show that our proposed approach surpasses the other scenarios in processing real datasets. Based on the experimental results in this paper, DV-DVFS can achieve up to 15% improvement in energy consumption.

Download Full-text

Data Augmentation Using Generative Adversarial Network for Automatic Machine Fault Detection Based on Vibration Signals

Applied Sciences ◽

10.3390/app11052166 ◽

2021 ◽

Vol 11 (5) ◽

pp. 2166

Author(s):

Van Bui ◽

Tung Lam Pham ◽

Huy Nguyen ◽

Yeong Min Jang

Keyword(s):

Fault Detection ◽

Data Augmentation ◽

Model Performance ◽

Original Data ◽

Fault Classification ◽

Training Process ◽

Generative Adversarial Network ◽

Data Set ◽

Adversarial Network ◽

Machine Fault

In the last decade, predictive maintenance has attracted a lot of attention in industrial factories because of its wide use of the Internet of Things and artificial intelligence algorithms for data management. However, in the early phases where the abnormal and faulty machines rarely appeared in factories, there were limited sets of machine fault samples. With limited fault samples, it is difficult to perform a training process for fault classification due to the imbalance of input data. Therefore, data augmentation was required to increase the accuracy of the learning model. However, there were limited methods to generate and evaluate the data applied for data analysis. In this paper, we introduce a method of using the generative adversarial network as the fault signal augmentation method to enrich the dataset. The enhanced data set could increase the accuracy of the machine fault detection model in the training process. We also performed fault detection using a variety of preprocessing approaches and classified the models to evaluate the similarities between the generated data and authentic data. The generated fault data has high similarity with the original data and it significantly improves the accuracy of the model. The accuracy of fault machine detection reaches 99.41% with 20% original fault machine data set and 93.1% with 0% original fault machine data set (only use generate data only). Based on this, we concluded that the generated data could be used to mix with original data and improve the model performance.

Download Full-text