scholarly journals CSFC: A New Centroid Based Clustering Method to Improve the Efficiency of Storing and Accessing Small Files in Hadoop

In day to day life, the computer plays a major role, due to this advancement of technology collection of data from various fields are increasing. A large amount of data is produced by various fields for every second and is not easy to process. This large amount of data is called as Big data. A large number of small files also considered as Big data. It's not easy to process and store the small files in Hadoop. In the existing methods Merging technologies and Clustering Techniques are used to combine smaller files to large files up to 128 MB before sending it to HDFS in Hadoop. In the Proposed system CSFC (Clustering Small Files based on Centroid) Clustering Technique is used without mentioning the number of Clusters previously because if the clusters are mentioned before, all the files are clubbed within the limited number of clusters. In proposing system clusters are generated by depending on the number of related files in the dataset. The relevant files are combined up to 128 MB in a cluster. If any file is not relevant to the existing cluster or if the memory size reached 128MB then-new cluster will be generated and the file will be stored. It is easy to process the related files, comparing two irrelevant files. By using this method fetching data from the data node, it produces efficient result when comparing with other clustering techniques.

2021 ◽  
Vol 9 (2) ◽  
pp. 835-842
Author(s):  
Mrs. Bhawna Janghel, Et. al.

In this paper using clustering method for student’s school academic performance are measured from same district. By using data clustering technique we can predict which school is best. And try to identify the weak student of particular school and will identify the result of best school. This will show which school is better for observing the techniques in disrict.The best school will be help us to making the quality education.  


Author(s):  
İbrahim Yazici ◽  
Ömer Faruk Beyca ◽  
Selim Zaim

Due to big data availability in markets recently, processing and making predictions with data have been becoming more difficult, and this difficulty has been affecting management decisions. As a result, competitiveness for companies are related to analyze and utilize big data in order to achieve company targets. Transforming big data into business advantage has become a vital management tool across all industries. There are many data mining techniques that are being applied to plenty of problems. One of the frequently utilized data mining technique is clustering method. Clustering techniques aim to group a set of objects in clusters that more similar objects are in the same cluster. Main utilization aim of clustering techniques is segmenting or clustering or grouping objects. Clustering techniques and their utilization within service sector by aim of clustering technique and their methodologies are presented. Energy, social media and bank sectors are found that the mostly user of clustering techniques within service sector for segmenting customers based on searched papers.


2021 ◽  
Vol 30 (2) ◽  
pp. 205-237
Author(s):  
Sukanya Mukherjee ◽  
◽  
Kamalika Bhattacharjee ◽  
Sukanta Das ◽  
◽  
...  

This paper introduces a cycle-based clustering technique using the cyclic spaces of reversible cellular automata (CAs). Traditionally, a cluster consists of close objects, which in the case of CAs necessarily means that the objects belong to the same cycle; that is, they are reachable from each other. Each of the cyclic spaces of a cellular automaton (CA) forms a unique cluster. This paper identifies CA properties based on “reachability” that make the clustering effective. To do that, we first figure out which CA rules contribute to maintaining the minimum intracluster distance. Our CA is then designed with such rules to ensure that a limited number of cycles exist in the configuration space. An iterative strategy is also introduced that can generate a desired number of clusters by merging objects of closely reachable clusters from a previous level in the present level using a unique auxiliary CA. Finally, the performance of our algorithm is measured using some standard benchmark validation indices and compared with existing well-known clustering techniques. It is found that our algorithm is at least on a par with the best algorithms existing today on the metric of these standard validation indices.


Complexity ◽  
2019 ◽  
Vol 2019 ◽  
pp. 1-13
Author(s):  
Cristina Sánchez-Rebollo ◽  
Cristina Puente ◽  
Rafael Palacios ◽  
Claudia Piriz ◽  
Juan P. Fuentes ◽  
...  

Social networks are being used by terrorist organizations to distribute messages with the intention of influencing people and recruiting new members. The research presented in this paper focuses on the analysis of Twitter messages to detect the leaders orchestrating terrorist networks and their followers. A big data architecture is proposed to analyze messages in real time in order to classify users according to different parameters like level of activity, the ability to influence other users, and the contents of their messages. Graphs have been used to analyze how the messages propagate through the network, and this involves a study of the followers based on retweets and general impact on other users. Then, fuzzy clustering techniques were used to classify users in profiles, with the advantage over other classifications techniques of providing a probability for each profile instead of a binary categorization. Algorithms were tested using public database from Kaggle and other Twitter extraction techniques. The resulting profiles detected automatically by the system were manually analyzed, and the parameters that describe each profile correspond to the type of information that any expert may expect. Future applications are not limited to detecting terrorist activism. Human resources departments can apply the power of profile identification to automatically classify candidates, security teams can detect undesirable clients in the financial or insurance sectors, and immigration officers can extract additional insights with these techniques.


Author(s):  
Muhamad Alias Md. Jedi ◽  
Robiah Adnan

TCLUST is a method in statistical clustering technique which is based on modification of trimmed k-means clustering algorithm. It is called “crisp” clustering approach because the observation is can be eliminated or assigned to a group. TCLUST strengthen the group assignment by putting constraint to the cluster scatter matrix. The emphasis in this paper is to restrict on the eigenvalues, λ of the scatter matrix. The idea of imposing constraints is to maximize the log-likelihood function of spurious-outlier model. A review of different robust clustering approach is presented as a comparison to TCLUST methods. This paper will discuss the nature of TCLUST algorithm and how to determine the number of cluster or group properly and measure the strength of group assignment. At the end of this paper, R-package on TCLUST implement the types of scatter restriction, making the algorithm to be more flexible for choosing the number of clusters and the trimming proportion.


Author(s):  
Alexander Troussov ◽  
Sergey Maruev ◽  
Sergey Vinogradov ◽  
Mikhail Zhizhin

Techno-social systems generate data, which are rather different, than data, traditionally studied in social network analysis and other fields. In massive social networks agents simultaneously participate in several contexts, in different communities. Network models of many real data from techno-social systems reflect various dimensionalities and rationales of actor's actions and interactions. The data are inherently multidimensional, where “everything is deeply intertwingled”. The multidimensional nature of Big Data and the emergence of typical network characteristics in Big Data, makes it reasonable to address the challenges of structure detection in network models, including a) development of novel methods for local overlapping clustering with outliers, b) with near linear performance, c) preferably combined with the computation of the structural importance of nodes. In this chapter the spreading connectivity based clustering method is introduced. The viability of the approach and its advantages are demonstrated on the data from the largest European social network VK.


Sign in / Sign up

Export Citation Format

Share Document