scholarly journals Analysis of large volume data processing using clustering algorithms

2018 ◽  
Vol 7 (4.5) ◽  
pp. 685
Author(s):  
Sarada. B ◽  
Vinayaka Murthy. M ◽  
Udaya Rani. V

The study of large dataset with velocity, variety and volume which is also known as Big data. When the dataset has limited number of clusters, low dimensions and small number of data points the existing traditional clustering algorithms can be used.. As we know this is the internet age, the data is growing very fast and existing clustering algorithms are not giving the acceptable results in terms of time complexity and spatial complexity. So there is a need to develop a new approach of applying clustering of large volume of data processing with low time and spatial complexity through MapReduce and Hadoop frame work applying to different clustering algorithms, k-means, Canopy clustering and proposed algorithm .The analysis shows that the large volume of data processing will take low time and spatial complexity when compared to small volume of data.   

2018 ◽  
Vol 7 (4.5) ◽  
pp. 689
Author(s):  
Sarada. B ◽  
Vinayaka Murthy. M ◽  
Udaya Rani. V

Now a days data is increasing exponentially daily in terms of velocity, variety and volume which is also known as Big data. When the dataset has small number of dimensions, limited number of clusters and less number of data points the existing traditional clustering al- gorithms will give the expected results. As we know this is the Big Data age, with large volume of data sets through the traditional clus- tering algorithms we will not be able to get expected results. So there is a need to develop a new approach which gives better accuracy and computational time for large volume of data processing. The Proposed new System Architecture is a combination of canopy, Kmeans and RK sorting algorithm through Map Reduce Hadoop frame work platform. The analysis shows that the large volume of data processing will take less computational time and higher accuracy, and the RK sorting does not require swapping of elements and stack spaces. 


2021 ◽  
Vol 8 ◽  
Author(s):  
Murtaza Saifee ◽  
Jian Wu ◽  
Yingna Liu ◽  
Ping Ma ◽  
Jutima Patlidanon ◽  
...  

Purpose: To introduce and validate hvf_extraction_script, an open-source software script for the automated extraction and structuring of metadata, value plot data, and percentile plot data from Humphrey visual field (HVF) report images.Methods: Validation was performed on 90 HVF reports over three different report layouts, including a total of 1,530 metadata fields, 15,536 value plot data points, and 10,210 percentile data points, between the computer script and four human extractors, compared against DICOM reference data. Computer extraction and human extraction were compared on extraction time as well as accuracy of extraction for metadata, value plot data, and percentile plot data.Results: Computer extraction required 4.9-8.9 s per report, compared to the 6.5-19 min required by human extractors, representing a more than 40-fold difference in extraction speed. Computer metadata extraction error rate varied from an aggregate 1.2-3.5%, compared to 0.2-9.2% for human metadata extraction across all layouts. Computer value data point extraction had an aggregate error rate of 0.9% for version 1, <0.01% in version 2, and 0.15% in version 3, compared to 0.8-9.2% aggregate error rate for human extraction. Computer percentile data point extraction similarly had very low error rates, with no errors occurring in version 1 and 2, and 0.06% error rate in version 3, compared to 0.06-12.2% error rate for human extraction.Conclusions: This study introduces and validates hvf_extraction_script, an open-source tool for fast, accurate, automated data extraction of HVF reports to facilitate analysis of large-volume HVF datasets, and demonstrates the value of image processing tools in facilitating faster and cheaper large-volume data extraction in research settings.


2021 ◽  
pp. 000276422110216
Author(s):  
Kazimierz M. Slomczynski ◽  
Irina Tomescu-Dubrow ◽  
Ilona Wysmulek

This article proposes a new approach to analyze protest participation measured in surveys of uneven quality. Because single international survey projects cover only a fraction of the world’s nations in specific periods, researchers increasingly turn to ex-post harmonization of different survey data sets not a priori designed as comparable. However, very few scholars systematically examine the impact of the survey data quality on substantive results. We argue that the variation in source data, especially deviations from standards of survey documentation, data processing, and computer files—proposed by methodologists of Total Survey Error, Survey Quality Monitoring, and Fitness for Intended Use—is important for analyzing protest behavior. In particular, we apply the Survey Data Recycling framework to investigate the extent to which indicators of attending demonstrations and signing petitions in 1,184 national survey projects are associated with measures of data quality, controlling for variability in the questionnaire items. We demonstrate that the null hypothesis of no impact of measures of survey quality on indicators of protest participation must be rejected. Measures of survey documentation, data processing, and computer records, taken together, explain over 5% of the intersurvey variance in the proportions of the populations attending demonstrations or signing petitions.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Baicheng Lyu ◽  
Wenhua Wu ◽  
Zhiqiang Hu

AbstractWith the widely application of cluster analysis, the number of clusters is gradually increasing, as is the difficulty in selecting the judgment indicators of cluster numbers. Also, small clusters are crucial to discovering the extreme characteristics of data samples, but current clustering algorithms focus mainly on analyzing large clusters. In this paper, a bidirectional clustering algorithm based on local density (BCALoD) is proposed. BCALoD establishes the connection between data points based on local density, can automatically determine the number of clusters, is more sensitive to small clusters, and can reduce the adjusted parameters to a minimum. On the basis of the robustness of cluster number to noise, a denoising method suitable for BCALoD is proposed. Different cutoff distance and cutoff density are assigned to each data cluster, which results in improved clustering performance. Clustering ability of BCALoD is verified by randomly generated datasets and city light satellite images.


Author(s):  
И.В. Бычков ◽  
Г.М. Ружников ◽  
В.В. Парамонов ◽  
А.С. Шумилов ◽  
Р.К. Фёдоров

Рассмотрен инфраструктурный подход обработки пространственных данных для решения задач управления территориальным развитием, который основан на сервис-ориентированной парадигме, стандартах OGC, web-технологиях, WPS-сервисах и геопортале. The development of territories is a multi-dimensional and multi-aspect process, which can be characterized by large volumes of financial, natural resources, social, ecological and economic data. The data is highly localized and non-coordinated, which limits its complex analysis and usage. One of the methods of large volume data processing is information-analytical environments. The architecture and implementation of the information-analytical environment of the territorial development in the form of Geoportal is presented. Geoportal provides software instruments for spatial and thematic data exchange for its users, as well as OGC-based distributed services that deal with the data processing. Implementation of the processing and storing of the data in the form of services located on distributed servers allows simplifying their updating and maintenance. In addition, it allows publishing and makes processing to be more open and controlled process. Geoportal consists of following modules: content management system Calipso (presentation of user interface, user management, data visualization), RDBMS PostgreSQL with spatial data processing extension, services of relational data entry and editing, subsystem of launching and execution of WPS-services, as well as services of spatial data processing, deployed at the local cloud environment. The presented article states the necessity of using the infrastructural approach when creating the information-analytical environment for the territory management, which is characterized by large volumes of spatial and thematical data that needs to be processed. The data is stored in various formats and applications of service-oriented paradigm, OGC standards, web-technologies, Geoportal and distributed WPS-services. The developed software system was tested on a number of tasks that arise during the territory development.


2016 ◽  
Vol 2016 ◽  
pp. 1-13 ◽  
Author(s):  
Tianjin Zhang ◽  
Zongrui Yi ◽  
Jinta Zheng ◽  
Dong C. Liu ◽  
Wai-Mai Pang ◽  
...  

The two-dimensional transfer functions (TFs) designed based on intensity-gradient magnitude (IGM) histogram are effective tools for the visualization and exploration of 3D volume data. However, traditional design methods usually depend on multiple times of trial-and-error. We propose a novel method for the automatic generation of transfer functions by performing the affinity propagation (AP) clustering algorithm on the IGM histogram. Compared with previous clustering algorithms that were employed in volume visualization, the AP clustering algorithm has much faster convergence speed and can achieve more accurate clustering results. In order to obtain meaningful clustering results, we introduce two similarity measurements: IGM similarity and spatial similarity. These two similarity measurements can effectively bring the voxels of the same tissue together and differentiate the voxels of different tissues so that the generated TFs can assign different optical properties to different tissues. Before performing the clustering algorithm on the IGM histogram, we propose to remove noisy voxels based on the spatial information of voxels. Our method does not require users to input the number of clusters, and the classification and visualization process is automatic and efficient. Experiments on various datasets demonstrate the effectiveness of the proposed method.


Sign in / Sign up

Export Citation Format

Share Document