A Novel Clustering-Based Sampling Approach for Minimum Sample Set in Big Data Environment

Author(s):  
Jia Zhao ◽  
Jia Sun ◽  
Yunan Zhai ◽  
Yan Ding ◽  
Chunyi Wu ◽  
...  

The data are rapidly expanding nowadays, which makes it very difficult to analyze valuable information from big data. Most of the existing data mining algorithms deal with big data problems at large time and space costs. This paper focuses on the sampling problem of big data and puts forward an efficient heuristic Cluster Sampling Arithmetic, called CSA. Many of the former researchers adopted random method to extract early sample set from the original data and then made a variety of different processing of the sample in order to obtain the corresponding minimum sample set, which is regarded as a representation of the original big data set. However, the final processing results of big data will be severely affected by the random sampling process at the beginning, resulting in lower comprehensiveness and quality of the final data results and longer processing time. Based on this view, CSA introduces the idea of clustering to obtain minimum sample set of big data, which is in contrast to the random sampling method in the current literature. CSA makes cluster analysis of the original data set and selects the center of each class as centralized members of the minimum sample set. It aims at ensuring that the sample distribution accords with the characteristics of the original data, guarantees the data integrity and reduces the processing time. The max–min distance means that the pattern recognition has been integrated into the clustering process in order to get the clustering center and prevent algorithm from local optimum. The final experimental results show that, compared with the existing work, CSA algorithm can efficiently reflect the characteristics of the original data and reduce the time of data processing. The obtained minimum sample set has also achieved good effects in the classification algorithm.

2019 ◽  
Vol 47 (6) ◽  
pp. 981-996
Author(s):  
Wangshu Mu ◽  
Daoqin Tong

Incorporating big data in urban planning has great potential for better modeling of urban dynamics and more efficiently allocating limited resources. However, big data may present new challenges for problem solutions. This research focuses on the p-median problem, one of the most widely used location models in urban and regional planning. Similar to many other location models, the p-median problem is non-deterministic polynomial-time hard (NP-hard), and solving large-sized p-median problems is difficult. This research proposes a high performance computing-based algorithm, random sampling and spatial voting, to solve large-sized p-median problems. Instead of solving a large p-median problem directly, a random sampling scheme is introduced to create smaller sub- p-median problems that can be solved in parallel efficiently. A spatial voting strategy is designed to evaluate the candidate facility sites for inclusion in obtaining the final problem solution. Tests with the Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) data set show that random sampling and spatial voting provides high-quality solutions and reduces computing time significantly. Tests also demonstrate the dynamic scalability of the algorithm; it can start with a small amount of computing resources and scale up and down flexibly depending on the availability of the computing resources.


2018 ◽  
Vol 46 (3) ◽  
pp. 147-160 ◽  
Author(s):  
Laouni Djafri ◽  
Djamel Amar Bensaber ◽  
Reda Adjoudj

Purpose This paper aims to solve the problems of big data analytics for prediction including volume, veracity and velocity by improving the prediction result to an acceptable level and in the shortest possible time. Design/methodology/approach This paper is divided into two parts. The first one is to improve the result of the prediction. In this part, two ideas are proposed: the double pruning enhanced random forest algorithm and extracting a shared learning base from the stratified random sampling method to obtain a representative learning base of all original data. The second part proposes to design a distributed architecture supported by new technologies solutions, which in turn works in a coherent and efficient way with the sampling strategy under the supervision of the Map-Reduce algorithm. Findings The representative learning base obtained by the integration of two learning bases, the partial base and the shared base, presents an excellent representation of the original data set and gives very good results of the Big Data predictive analytics. Furthermore, these results were supported by the improved random forests supervised learning method, which played a key role in this context. Originality/value All companies are concerned, especially those with large amounts of information and want to screen them to improve their knowledge for the customer and optimize their campaigns.


Computers ◽  
2019 ◽  
Vol 8 (1) ◽  
pp. 26 ◽  
Author(s):  
Marta Wlodarczyk-Sielicka ◽  
Jacek Lubczonek

At the present time, spatial data are often acquired using varied remote sensing sensors and systems, which produce big data sets. One significant product from these data is a digital model of geographical surfaces, including the surface of the sea floor. To improve data processing, presentation, and management, it is often indispensable to reduce the number of data points. This paper presents research regarding the application of artificial neural networks to bathymetric data reductions. This research considers results from radial networks and self-organizing Kohonen networks. During reconstructions of the seabed model, the results show that neural networks with fewer hidden neurons than the number of data points can replicate the original data set, while the Kohonen network can be used for clustering during big geodata reduction. Practical implementations of neural networks capable of creating surface models and reducing bathymetric data are presented.


1994 ◽  
Vol 144 ◽  
pp. 139-141 ◽  
Author(s):  
J. Rybák ◽  
V. Rušin ◽  
M. Rybanský

AbstractFe XIV 530.3 nm coronal emission line observations have been used for the estimation of the green solar corona rotation. A homogeneous data set, created from measurements of the world-wide coronagraphic network, has been examined with a help of correlation analysis to reveal the averaged synodic rotation period as a function of latitude and time over the epoch from 1947 to 1991.The values of the synodic rotation period obtained for this epoch for the whole range of latitudes and a latitude band ±30° are 27.52±0.12 days and 26.95±0.21 days, resp. A differential rotation of green solar corona, with local period maxima around ±60° and minimum of the rotation period at the equator, was confirmed. No clear cyclic variation of the rotation has been found for examinated epoch but some monotonic trends for some time intervals are presented.A detailed investigation of the original data and their correlation functions has shown that an existence of sufficiently reliable tracers is not evident for the whole set of examinated data. This should be taken into account in future more precise estimations of the green corona rotation period.


Author(s):  
Wendy J. Schiller ◽  
Charles Stewart III

From 1789 to 1913, U.S. senators were not directly elected by the people—instead the Constitution mandated that they be chosen by state legislators. This radically changed in 1913, when the Seventeenth Amendment to the Constitution was ratified, giving the public a direct vote. This book investigates the electoral connections among constituents, state legislators, political parties, and U.S. senators during the age of indirect elections. The book finds that even though parties controlled the partisan affiliation of the winning candidate for Senate, they had much less control over the universe of candidates who competed for votes in Senate elections and the parties did not always succeed in resolving internal conflict among their rank and file. Party politics, money, and personal ambition dominated the election process, in a system originally designed to insulate the Senate from public pressure. The book uses an original data set of all the roll call votes cast by state legislators for U.S. senators from 1871 to 1913 and all state legislators who served during this time. Newspaper and biographical accounts uncover vivid stories of the political maneuvering, corruption, and partisanship—played out by elite political actors, from elected officials, to party machine bosses, to wealthy business owners—that dominated the indirect Senate elections process. The book raises important questions about the effectiveness of Constitutional reforms, such as the Seventeenth Amendment, that promised to produce a more responsive and accountable government.


Author(s):  
Ying Wang ◽  
Yiding Liu ◽  
Minna Xia

Big data is featured by multiple sources and heterogeneity. Based on the big data platform of Hadoop and spark, a hybrid analysis on forest fire is built in this study. This platform combines the big data analysis and processing technology, and learns from the research results of different technical fields, such as forest fire monitoring. In this system, HDFS of Hadoop is used to store all kinds of data, spark module is used to provide various big data analysis methods, and visualization tools are used to realize the visualization of analysis results, such as Echarts, ArcGIS and unity3d. Finally, an experiment for forest fire point detection is designed so as to corroborate the feasibility and effectiveness, and provide some meaningful guidance for the follow-up research and the establishment of forest fire monitoring and visualized early warning big data platform. However, there are two shortcomings in this experiment: more data types should be selected. At the same time, if the original data can be converted to XML format, the compatibility is better. It is expected that the above problems can be solved in the follow-up research.


2020 ◽  
Author(s):  
Eva Østergaard-Nielsen ◽  
Stefano Camatarri

Abstract The role orientation of political representatives and candidates is a longstanding concern in studies of democratic representation. The growing trend in countries to allow citizens abroad to candidate in homeland elections from afar provides an interesting opportunity for understanding how international mobility and context influences ideas of representation among these emigrant candidates. In public debates, emigrant candidates are often portrayed as delegates of the emigrant constituencies. However, drawing on the paradigmatic case of Italy and an original data set comprising emigrant candidates, we show that the perceptions of styles of representation abroad are more complex. Systemic differences between electoral districts at home and abroad are relevant for explaining why and how candidates develop a trustee or delegate orientation.


2021 ◽  
pp. 245513332110316
Author(s):  
Tiken Das ◽  
Pradyut Guha ◽  
Diganta Das

This study made an attempt to answer the question: Do the heterogeneous determinants of repayment affect the borrowers of diverse credit sources differently? The study is based on data collected from 240 households from three districts in the lower Brahmaputra valley of Assam through a carefully designed primary survey. Besides, the study uses the double hurdle approach and the instrumental variable probit model to reduce possible selection bias. It observes better repayment performance among formal borrowers, followed by semiformal borrowers, while occupation wise it is prominent among organised employees. It has been found that in general, the household characteristics, loan characteristics and location-specific characteristics significantly affect repayment performance of borrowers. However, the nature of impact of the factors influencing repayment performance is remarkably different across credit sources. It ignores the role of traditional community-based organisations in rural Assam while analysing the determinants of repayment performance. The study also recommends for ensuring productive opportunities and efficient market linkages in rural areas of Assam. The study is based on an original data set that has specially been collected to examine question that—do the heterogeneous determinants of repayment affect the borrowers of diverse credit sources differently in the lower Brahmaputra valley of Assam—which has not been studied before.


2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Hossein Ahmadvand ◽  
Fouzhan Foroutan ◽  
Mahmood Fathy

AbstractData variety is one of the most important features of Big Data. Data variety is the result of aggregating data from multiple sources and uneven distribution of data. This feature of Big Data causes high variation in the consumption of processing resources such as CPU consumption. This issue has been overlooked in previous works. To overcome the mentioned problem, in the present work, we used Dynamic Voltage and Frequency Scaling (DVFS) to reduce the energy consumption of computation. To this goal, we consider two types of deadlines as our constraint. Before applying the DVFS technique to computer nodes, we estimate the processing time and the frequency needed to meet the deadline. In the evaluation phase, we have used a set of data sets and applications. The experimental results show that our proposed approach surpasses the other scenarios in processing real datasets. Based on the experimental results in this paper, DV-DVFS can achieve up to 15% improvement in energy consumption.


2021 ◽  
Vol 11 (5) ◽  
pp. 2166
Author(s):  
Van Bui ◽  
Tung Lam Pham ◽  
Huy Nguyen ◽  
Yeong Min Jang

In the last decade, predictive maintenance has attracted a lot of attention in industrial factories because of its wide use of the Internet of Things and artificial intelligence algorithms for data management. However, in the early phases where the abnormal and faulty machines rarely appeared in factories, there were limited sets of machine fault samples. With limited fault samples, it is difficult to perform a training process for fault classification due to the imbalance of input data. Therefore, data augmentation was required to increase the accuracy of the learning model. However, there were limited methods to generate and evaluate the data applied for data analysis. In this paper, we introduce a method of using the generative adversarial network as the fault signal augmentation method to enrich the dataset. The enhanced data set could increase the accuracy of the machine fault detection model in the training process. We also performed fault detection using a variety of preprocessing approaches and classified the models to evaluate the similarities between the generated data and authentic data. The generated fault data has high similarity with the original data and it significantly improves the accuracy of the model. The accuracy of fault machine detection reaches 99.41% with 20% original fault machine data set and 93.1% with 0% original fault machine data set (only use generate data only). Based on this, we concluded that the generated data could be used to mix with original data and improve the model performance.


Sign in / Sign up

Export Citation Format

Share Document