Statistical Evaluation of Large-Scale Data Logistics System

Data recording is struggling with the occurrence of errors, which worsen the accuracy of follow-up calculations. Achievement of satisfactory results requires the data processing to eliminate the influence of errors. This paper applies a data reconciliation technique for mining of data from ecording movement vehicles. The database collects information about the start and end point of the route (GPS coordinates) and total duration.The presented methodology smooths available data and allows to obtain an estimation of transportation time through individual parts of the entire recorded route. This process allows obtaining valuable information which can be used for further transportation planning. First, the proposed mathematical model is tested on simplifled example. The real data application requires necessary preprocessing within which anticipated routes are designed. Thus, the database is supplemented with information on the probable speed of the vehicle. The mathematical model is based on weighted least squares data reconciliation which is organized iteratively. Due to the time-consuming calculation, the linearised model is computed to initialize the values for a complex model. The attention is also paid to the weight setting. The weighing system is designed to reflect the quality of specific data and the dependence on the frequency of trafic. In this respect, the model is not strict, which leaves the possibility to adapt to the current data. The case study focuses on the GPS data of shipping vehicles in the particular city in the Czech Republic with several types of roads.

Download Full-text

Self-Adaptive K-Means Based on a Covering Algorithm

Complexity ◽

10.1155/2018/7698274 ◽

2018 ◽

Vol 2018 ◽

pp. 1-16 ◽

Cited By ~ 1

Author(s):

Yiwen Zhang ◽

Yuanyuan Zhou ◽

Xing Guo ◽

Jintao Wu ◽

Qiang He ◽

...

Keyword(s):

Large Scale ◽

Clustering Algorithm ◽

Real Data ◽

Second Phase ◽

Data Sets ◽

Number Of Clusters ◽

Large Scale Data ◽

Long Time ◽

Two Phases ◽

Selection Of

The K-means algorithm is one of the ten classic algorithms in the area of data mining and has been studied by researchers in numerous fields for a long time. However, the value of the clustering number k in the K-means algorithm is not always easy to be determined, and the selection of the initial centers is vulnerable to outliers. This paper proposes an improved K-means clustering algorithm called the covering K-means algorithm (C-K-means). The C-K-means algorithm can not only acquire efficient and accurate clustering results but also self-adaptively provide a reasonable numbers of clusters based on the data features. It includes two phases: the initialization of the covering algorithm (CA) and the Lloyd iteration of the K-means. The first phase executes the CA. CA self-organizes and recognizes the number of clusters k based on the similarities in the data, and it requires neither the number of clusters to be prespecified nor the initial centers to be manually selected. Therefore, it has a “blind” feature, that is, k is not preselected. The second phase performs the Lloyd iteration based on the results of the first phase. The C-K-means algorithm combines the advantages of CA and K-means. Experiments are carried out on the Spark platform, and the results verify the good scalability of the C-K-means algorithm. This algorithm can effectively solve the problem of large-scale data clustering. Extensive experiments on real data sets show that the accuracy and efficiency of the C-K-means algorithm outperforms the existing algorithms under both sequential and parallel conditions.

Download Full-text

Modern Subsampling Methods for Large-Scale Least Squares Regression

International Journal of Cyber-Physical Systems ◽

10.4018/ijcps.2020070101 ◽

2020 ◽

Vol 2 (2) ◽

pp. 1-28

Author(s):

Tao Li ◽

Cheng Meng

Keyword(s):

Least Squares ◽

Large Scale ◽

Computing Time ◽

Real Data ◽

Optimality Criteria ◽

Estimation Accuracy ◽

Least Squares Regression ◽

Large Scale Data ◽

Coefficient Estimation ◽

Sampling Probability

Subsampling methods aim to select a subsample as a surrogate for the observed sample. As a powerful technique for large-scale data analysis, various subsampling methods are developed for more effective coefficient estimation and model prediction. This review presents some cutting-edge subsampling methods based on the large-scale least squares estimation. Two major families of subsampling methods are introduced: the randomized subsampling approach and the optimal subsampling approach. The former aims to develop a more effective data-dependent sampling probability while the latter aims to select a deterministic subsample in accordance with certain optimality criteria. Real data examples are provided to compare these methods empirically, respecting both the estimation accuracy and the computing time.

Download Full-text

Edge-Based Missing Data Imputation in Large-Scale Environments

Information ◽

10.3390/info12050195 ◽

2021 ◽

Vol 12 (5) ◽

pp. 195

Author(s):

Davide Andrea Guastella ◽

Guilhem Marcillaud ◽

Cesare Valenti

Keyword(s):

Urban Environment ◽

Large Scale ◽

Smart Cities ◽

Low Cost ◽

Real Data ◽

Missing Data Imputation ◽

Large Scale Data ◽

Iot Devices ◽

Edge Based ◽

The One

Smart cities leverage large amounts of data acquired in the urban environment in the context of decision support tools. These tools enable monitoring the environment to improve the quality of services offered to citizens. The increasing diffusion of personal Internet of things devices capable of sensing the physical environment allows for low-cost solutions to acquire a large amount of information within the urban environment. On the one hand, the use of mobile and intermittent sensors implies new scenarios of large-scale data analysis; on the other hand, it involves different challenges such as intermittent sensors and integrity of acquired data. To this effect, edge computing emerges as a methodology to distribute computation among different IoT devices to analyze data locally. We present here a new methodology for imputing environmental information during the acquisition step, due to missing or otherwise out of order sensors, by distributing the computation among a variety of fixed and mobile devices. Numerous experiments have been carried out on real data to confirm the validity of the proposed method.

Download Full-text

MapReduce Algorithm for Location Recommendation by Using Area Skyline Query

Algorithms ◽

10.3390/a11120191 ◽

2018 ◽

Vol 11 (12) ◽

pp. 191 ◽

Cited By ~ 2

Author(s):

Chen Li ◽

Annisa Annisa ◽

Asif Zaman ◽

Mahboob Qaosar ◽

Saleh Ahmed ◽

...

Keyword(s):

Parallel Algorithm ◽

Mobile Applications ◽

Large Scale ◽

Real Data ◽

Location Based Services ◽

Mobile Users ◽

Skyline Query ◽

Location Recommendation ◽

Large Scale Data ◽

Scale Data

Location recommendation is essential for various map-based mobile applications. However, it is not easy to generate location-based recommendations with the changing contexts and locations of mobile users. Skyline operation is one of the most well-established techniques for location-based services. Our previous work proposed a new query method, called “area skyline query”, to select areas in a map. However, it is not efficient for large-scale data. In this paper, we propose a parallel algorithm for processing the area skyline using MapReduce. Intensive experiments on both synthetic and real data confirm that our proposed algorithm is sufficiently efficient for large-scale data.

Download Full-text

Linear Discriminant Analysis for Large-Scale Data: Application on Text and Image Data

2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA) ◽

10.1109/icmla.2016.0173 ◽

2016 ◽

Cited By ~ 2

Author(s):

Elhadji Ille Gado Nassara ◽

Edith Grall-Maes ◽

Malika Kharouf

Keyword(s):

Discriminant Analysis ◽

Linear Discriminant Analysis ◽

Large Scale ◽

Image Data ◽

Text And Image ◽

Linear Discriminant ◽

Large Scale Data ◽

Data Application ◽

Scale Data

Download Full-text

The Extended Inverse Weibull Distribution: Properties and Applications

Complexity ◽

10.1155/2020/3297693 ◽

2020 ◽

Vol 2020 ◽

pp. 1-11

Author(s):

Said Alkarni ◽

Ahmed Z. Afify ◽

I. Elbatal ◽

M. Elgarhy

Keyword(s):

Least Squares ◽

Weighted Least Squares ◽

Real Data ◽

Weibull Model ◽

Estimation Methods ◽

Type I ◽

Data Application ◽

Inverse Weibull Distribution ◽

Simulation Results ◽

Von Mises

This paper proposes the new three-parameter type I half-logistic inverse Weibull (TIHLIW) distribution which generalizes the inverse Weibull model. The density function of the TIHLIW can be expressed as a linear combination of the inverse Weibull densities. Some mathematical quantities of the proposed TIHLIW model are derived. Four estimation methods, namely, the maximum likelihood, least squares, weighted least squares, and Cramér–von Mises methods, are utilized to estimate the TIHLIW parameters. Simulation results are presented to assess the performance of the proposed estimation methods. The importance of the TIHLIW model is studied via a real data application.

Download Full-text

Parallelization of Eigenvalue-Based Dimensional Reductions via Homotopy Continuation

Mathematical Problems in Engineering ◽

10.1155/2016/5815429 ◽

2016 ◽

Vol 2016 ◽

pp. 1-9

Author(s):

Size Bi ◽

Xiaoyu Han ◽

Jing Tian ◽

Xiao Liang ◽

Yang Wang ◽

...

Keyword(s):

Large Scale ◽

Distance Matrix ◽

Real Data ◽

Homotopy Continuation ◽

Data Sets ◽

Dimensional Representation ◽

Large Scale Data ◽

Continuation Algorithm ◽

Low Dimensional ◽

Scale Data

This paper investigates a homotopy-based method for embedding with hundreds of thousands of data items that yields a parallel algorithm suitable for running on a distributed system. Current eigenvalue-based embedding algorithms attempt to use a sparsification of the distance matrix to approximate a low-dimensional representation when handling large-scale data sets. The main reason of taking approximation is that it is still hindered by the eigendecomposition bottleneck for high-dimensional matrices in the embedding process. In this study, a homotopy continuation algorithm is applied for improving this embedding model by parallelizing the corresponding eigendecomposition. The eigenvalue solution is converted to the operation of ordinary differential equations with initialized values, and all isolated positive eigenvalues and corresponding eigenvectors can be obtained in parallel according to predicting eigenpaths. Experiments on the real data sets show that the homotopy-based approach is potential to be implemented for millions of data sets.

Download Full-text

Time-lapse tomographic inversion using a Gaussian parameterization of the velocity changes

Geophysics ◽

10.1190/1.3442573 ◽

2010 ◽

Vol 75 (4) ◽

pp. U29-U38 ◽

Cited By ~ 8

Author(s):

Andreas Kjelsrud Evensen ◽

Martin Landrø

Keyword(s):

Large Scale ◽

Weighted Least Squares ◽

Real Data ◽

Time Lapse ◽

Two Dimensions ◽

Inversion Method ◽

Data Set ◽

Tomographic Inversion ◽

Gaussian Functions ◽

Velocity Changes

Most seismic studies of changes in traveltimes are of a qualitative nature and a major challenge in four dimensions is to use the information contained in time shifts to quantify the nature of velocity changes in the subsurface layers. We propose a 4D tomographic inversion method that uses time shifts from prestack seismic data to estimate parameters describing the 2D velocity field after changes have occurred. Prestack data allow for the usage of many offsets, thus increasing the information input for the inversion. The velocity changes are parameterized by a chosen number of Gaussian functions in two dimensions and weighted least-squares inversion is used to estimate the parameters describing these functions. We have found that the parameters describing the position and shape of the Gaussian velocity anomalies can be estimated with this method for simple synthetic cases. For more complex cases with overlapping Gaussian functions, resolution of the parameters can be difficult and in these cases our recommendation is to find the best fit for a simple smooth anomaly to a more complex real world. The method is tested on a real data set from a [Formula: see text] injection project above the Sleipner field in the North Sea, where quantification of changes is important for monitoring purposes. We have found that the noise levels in prestack traveltime data are on the high side for large-scale analysis; however, we estimate reasonable [Formula: see text] layer thickness and velocity compared to previous work in a nearby area.

Download Full-text

Introduction to Bigdata and Relation with IoT

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i3.8.16851 ◽

2018 ◽

Vol 7 (3.8) ◽

pp. 151

Author(s):

Anjali Deore ◽

. .

Keyword(s):

Big Data ◽

Large Scale ◽

Fault Tolerant ◽

Big Data Analytics ◽

File Systems ◽

Large Scale Data ◽

Data Application ◽

Big Data Application ◽

Set Up ◽

Data Acquiring

Big Data consist of large scale data which is complicated and diverse, so that new and different types of integration of techniques and technologies are required to uncover various hidden values from such big datasets. Big Data surrounding is used to set up and examine the diverse sorts of information. Big Data be data that is so massive in volume, so various in range or moving with excessive speed is referred to as Big Data. Acquiring and analysing Big Data be a challenging job because it consists of large dispersed file systems which must be bendy, fault tolerant and scalable. Diverse technologies used by big data application toward hold the huge quantity of data are Hadoop, Map Reduce, and so on. In this paper, firstly the description of big dataset is provided. In next section the different technologies are described which are used for managing Big Data. After that, Big Data method application and in last section we discuss the relation of Big Data and IoT as well as IoT for Big Data analytics.

Download Full-text