scholarly journals Extended Jaccard Indexive Buffalo Optimized Clustering on Geo-social Networks with Big Data

Webology ◽  
2021 ◽  
Vol 18 (2) ◽  
pp. 166-182
Author(s):  
M. Anoop ◽  
P. Sripriya

Clustering is a general task of data mining where partitioning a large dataset into dissimilar groups is done. The enormous growth of Geo-Social Networks (GeoSNs) includes users, who create millions of heterogeneous data with a variety of information. Analyzing such volume of data is a challenging task. The clustering of large volume of data is used to identify the frequently visited location information of the users in Geo-Social Networks. In order to improve the clustering of a large volume of data, a novel technique called Extended Jaccard Indexive Buffalo Optimized Data Clustering (EJIBODC) is introduced for grouping the data with high accuracy and less time consumption. The main aim of EJIBODC technique is to partition the big dataset into different groups. In this technique, many clusters with centroids are initialized to group the data. After that, Extended Jaccard Indexive Buffalo Optimization technique is applied to find the fittest cluster for grouping the data. The Extended Jaccard Index is applied in the Buffalo Optimization to measure the fitness between the data and the centroid. Based on the similarity value, using a gradient ascent function, the data finds the fittest cluster centroid for grouping. After that, the fitness value of cluster is updated and all the data gets grouped into a suitable cluster with high accuracy and minimum error rate. An experimental procedure is involved with big geo-social dataset and testing of different clustering algorithms. The series discussion is carried out on factors such as clustering accuracy, error rate, clustering time and space complexity with respect to a number of data. Experimental outcomes demonstrate that the proposed EJIBODC technique achieves improved performance in terms of higher clustering accuracy, less error rate, time consumption and space complexity when compared to previous related clustering techniques.

2020 ◽  
Vol 4 ◽  
pp. 97-100
Author(s):  
A.P. Pronichev ◽  

The article discusses the architecture of a system for collecting and analyzing heterogeneous data from social networks. This architecture is a distributed system of subsystem modules, each of which is responsible for a separate task. The system also allows you to use external systems for data analysis, providing the necessary interface abstraction for connection. This allows for more flexible customization of the data analysis process and reduces development, implementation and support costs.


Symmetry ◽  
2021 ◽  
Vol 13 (3) ◽  
pp. 443
Author(s):  
Chyan-long Jan

Because of the financial information asymmetry, the stakeholders usually do not know a company’s real financial condition until financial distress occurs. Financial distress not only influences a company’s operational sustainability and damages the rights and interests of its stakeholders, it may also harm the national economy and society; hence, it is very important to build high-accuracy financial distress prediction models. The purpose of this study is to build high-accuracy and effective financial distress prediction models by two representative deep learning algorithms: Deep neural networks (DNN) and convolutional neural networks (CNN). In addition, important variables are selected by the chi-squared automatic interaction detector (CHAID). In this study, the data of Taiwan’s listed and OTC sample companies are taken from the Taiwan Economic Journal (TEJ) database during the period from 2000 to 2019, including 86 companies in financial distress and 258 not in financial distress, for a total of 344 companies. According to the empirical results, with the important variables selected by CHAID and modeling by CNN, the CHAID-CNN model has the highest financial distress prediction accuracy rate of 94.23%, and the lowest type I error rate and type II error rate, which are 0.96% and 4.81%, respectively.


2016 ◽  
Vol 6 (1) ◽  
Author(s):  
Rustam Rafikovich Mussabayev ◽  
Maksat N. Kalimoldayev ◽  
Yedilkhan N. Amirgaliyev ◽  
Timur R. Mussabayev

Abstract This work considers one of the approaches to the solution of the task of discrete speech signal automatic segmentation. The aim of this work is to construct such an algorithm which should meet the following requirements: segmentation of a signal into acoustically homogeneous segments, high accuracy and segmentation speed, unambiguity and reproducibility of segmentation results, lack of necessity of preliminary training with the use of a special set consisting of manually segmented signals. Development of the algorithm which corresponds to the given requirements was conditioned by the necessity of formation of automatically segmented speech databases that have a large volume. One of the new approaches to the solution of this task is viewed in this article. For this purpose we use the new type of informative features named TAC-coefficients (Throat-Acoustic Correlation coefficients) which provide sufficient segmentation accuracy and effi- ciency.


Author(s):  
Usman Akhtar ◽  
Mehdi Hassan

The availability of a huge amount of heterogeneous data from different sources to the Internet has been termed as the problem of Big Data. Clustering is widely used as a knowledge discovery tool that separate the data into manageable parts. There is a need of clustering algorithms that scale on big databases. In this chapter we have explored various schemes that have been used to tackle the big databases. Statistical features have been extracted and most important and relevant features have been extracted from the given dataset. Reduce and irrelevant features have been eliminated and most important features have been selected by genetic algorithms (GA).Clustering with reduced feature sets requires lower computational time and resources. Experiments have been performed at standard datasets and results indicate that the proposed scheme based clustering offers high clustering accuracy. To check the clustering quality various quality measures have been computed and it has been observed that the proposed methodology results improved significantly. It has been observed that the proposed technique offers high quality clustering.


Web Services ◽  
2019 ◽  
pp. 413-430
Author(s):  
Usman Akhtar ◽  
Mehdi Hassan

The availability of a huge amount of heterogeneous data from different sources to the Internet has been termed as the problem of Big Data. Clustering is widely used as a knowledge discovery tool that separate the data into manageable parts. There is a need of clustering algorithms that scale on big databases. In this chapter we have explored various schemes that have been used to tackle the big databases. Statistical features have been extracted and most important and relevant features have been extracted from the given dataset. Reduce and irrelevant features have been eliminated and most important features have been selected by genetic algorithms (GA). Clustering with reduced feature sets requires lower computational time and resources. Experiments have been performed at standard datasets and results indicate that the proposed scheme based clustering offers high clustering accuracy. To check the clustering quality various quality measures have been computed and it has been observed that the proposed methodology results improved significantly. It has been observed that the proposed technique offers high quality clustering.


2019 ◽  
Vol 8 (4) ◽  
pp. 84-100
Author(s):  
Akarsh Goyal ◽  
Rahul Chowdhury

In recent times, an enumerable number of clustering algorithms have been developed whose main function is to make sets of objects have almost the same features. But due to the presence of categorical data values, these algorithms face a challenge in their implementation. Also, some algorithms which are able to take care of categorical data are not able to process uncertainty in the values and therefore have stability issues. Thus, handling categorical data along with uncertainty has been made necessary owing to such difficulties. So, in 2007 an MMR algorithm was developed which was based on basic rough set theory. MMeR was proposed in 2009 which surpassed the results of MMR in taking care of categorical data but cannot be used robustly for hybrid data. In this article, the authors generalize the MMeR algorithm with neighborhood relations and make it a neighborhood rough set model which this article calls MMeNR (Min Mean Neighborhood Roughness). It takes care of the heterogeneous data. Also, the authors have extended the MMeNR method to make it suitable for various applications like geospatial data analysis and epidemiology.


2019 ◽  
Vol 2019 ◽  
pp. 1-15 ◽  
Author(s):  
Hisham A. Kholidy ◽  
Abdelkarim Erradi

The Cypher Physical Power Systems (CPPS) became vital targets for intruders because of the large volume of high speed heterogeneous data provided from the Wide Area Measurement Systems (WAMS). The Nonnested Generalized Exemplars (NNGE) algorithm is one of the most accurate classification techniques that can work with such data of CPPS. However, NNGE algorithm tends to produce rules that test a large number of input features. This poses some problems for the large volume data and hinders the scalability of any detection system. In this paper, we introduce VHDRA, a Vertical and Horizontal Data Reduction Approach, to improve the classification accuracy and speed of the NNGE algorithm and reduce the computational resource consumption. VHDRA provides the following functionalities: (1) it vertically reduces the dataset features by selecting the most significant features and by reducing the NNGE’s hyperrectangles. (2) It horizontally reduces the size of data while preserving original key events and patterns within the datasets using an approach called STEM, State Tracking and Extraction Method. The experiments show that the overall performance of VHDRA using both the vertical and the horizontal reduction reduces the NNGE hyperrectangles by 29.06%, 37.34%, and 26.76% and improves the accuracy of the NNGE by 8.57%, 4.19%, and 3.78% using the Multi-, Binary, and Triple class datasets, respectively.


Sign in / Sign up

Export Citation Format

Share Document