scholarly journals Machine learning-driven automatic storage space recommendation for object-based cloud storage system

Author(s):  
Anindita Sarkar Mondal ◽  
Anirban Mukhopadhyay ◽  
Samiran Chattopadhyay

AbstractAn object-based cloud storage system is a storage platform where big data is managed through the internet and data is considered as an object. A smart storage system should be able to handle the big data variety property by recommending the storage space for each data type automatically. Machine learning can help make a storage system automatic. This article proposes a classification engine framework for this purpose by utilizing a machine learning strategy. A feature selection approach wrapped with a classifier is proposed to automatically predict the proper storage space for the incoming big data. It helps build an automatic storage space recommendation system for an object-based cloud storage platform. To find out a suitable combination of feature selection algorithms and classifiers for the proposed classification engine, a comparative study of different supervised feature selection algorithms (i.e., Fisher score, F-score, Lll21) from three categories (similarity, statistical, sparse learning) associated with various classifiers (i.e., SVM, K-NN, Neural Network) is performed. We illustrate our study using RSoS system as it provides a cloud storage platform for the healthcare data as experimental big data by considering its variety property. The experiments confirm that Lll21 feature selection combined with K-NN classifier provides better performance than the others.

2020 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Noura AlNuaimi ◽  
Mohammad Mehedy Masud ◽  
Mohamed Adel Serhani ◽  
Nazar Zaki

Organizations in many domains generate a considerable amount of heterogeneous data every day. Such data can be processed to enhance these organizations’ decisions in real time. However, storing and processing large and varied datasets (known as big data) is challenging to do in real time. In machine learning, streaming feature selection has always been considered a superior technique for selecting the relevant subset features from highly dimensional data and thus reducing learning complexity. In the relevant literature, streaming feature selection refers to the features that arrive consecutively over time; despite a lack of exact figure on the number of features, numbers of instances are well-established. Many scholars in the field have proposed streaming-feature-selection algorithms in attempts to find the proper solution to this problem. This paper presents an exhaustive and methodological introduction of these techniques. This study provides a review of the traditional feature-selection algorithms and then scrutinizes the current algorithms that use streaming feature selection to determine their strengths and weaknesses. The survey also sheds light on the ongoing challenges in big-data research.


2020 ◽  
Vol 1486 ◽  
pp. 052014
Author(s):  
Jianbao Zhu ◽  
Jing Fu ◽  
Yuwei Sun ◽  
Ye Shi ◽  
Yu Chen ◽  
...  

2013 ◽  
Vol 22 (04) ◽  
pp. 1350027
Author(s):  
JAGANATHAN PALANICHAMY ◽  
KUPPUCHAMY RAMASAMY

Feature selection is essential in data mining and pattern recognition, especially for database classification. During past years, several feature selection algorithms have been proposed to measure the relevance of various features to each class. A suitable feature selection algorithm normally maximizes the relevancy and minimizes the redundancy of the selected features. The mutual information measure can successfully estimate the dependency of features on the entire sampling space, but it cannot exactly represent the redundancies among features. In this paper, a novel feature selection algorithm is proposed based on maximum relevance and minimum redundancy criterion. The mutual information is used to measure the relevancy of each feature with class variable and calculate the redundancy by utilizing the relationship between candidate features, selected features and class variables. The effectiveness is tested with ten benchmarked datasets available in UCI Machine Learning Repository. The experimental results show better performance when compared with some existing algorithms.


This chapter describes several methodologies and proposed models used to examine the accuracy and efficiency of high-performance colon-cancer feature selection and classification algorithms to solve the problems identified in Chapter 2. An elaboration of the diverse methods of gene/feature selection algorithms and the related classification algorithms implemented throughout this study are presented. A prototypical methodology blueprint for each experiment is developed to answer the research questions in Chapter 1. Each system model is also presented, and the measures used to validate the performance of the model's outcome are discussed.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Peng-fei Ke ◽  
Dong-sheng Xiong ◽  
Jia-hui Li ◽  
Zhi-lin Pan ◽  
Jing Zhou ◽  
...  

AbstractFinding effective and objective biomarkers to inform the diagnosis of schizophrenia is of great importance yet remains challenging. Relatively little work has been conducted on multi-biological data for the diagnosis of schizophrenia. In this cross-sectional study, we extracted multiple features from three types of biological data, including gut microbiota data, blood data, and electroencephalogram data. Then, an integrated framework of machine learning consisting of five classifiers, three feature selection algorithms, and four cross validation methods was used to discriminate patients with schizophrenia from healthy controls. Our results show that the support vector machine classifier without feature selection using the input features of multi-biological data achieved the best performance, with an accuracy of 91.7% and an AUC of 96.5% (p < 0.05). These results indicate that multi-biological data showed better discriminative capacity for patients with schizophrenia than single biological data. The top 5% discriminative features selected from the optimal model include the gut microbiota features (Lactobacillus, Haemophilus, and Prevotella), the blood features (superoxide dismutase level, monocyte-lymphocyte ratio, and neutrophil count), and the electroencephalogram features (nodal local efficiency, nodal efficiency, and nodal shortest path length in the temporal and frontal-parietal brain areas). The proposed integrated framework may be helpful for understanding the pathophysiology of schizophrenia and developing biomarkers for schizophrenia using multi-biological data.


Webology ◽  
2021 ◽  
Vol 18 (Special Issue 01) ◽  
pp. 288-301
Author(s):  
G. Sujatha ◽  
Dr. Jeberson Retna Raj

Data storage is one of the significant cloud services available to the cloud users. Since the magnitude of information outsourced grows extremely high, there is a need of implementing data deduplication technique in the cloud storage space for efficient utilization. The cloud storage space supports all kind of digital data like text, audio, video and image. In the hash-based deduplication system, cryptographic hash value should be calculated for all data irrespective of its type and stored in the memory for future reference. Using these hash value only, duplicate copies can be identified. The problem in this existing scenario is size of the hash table. To find a duplicate copy, all the hash values should be checked in the worst case irrespective of its data type. At the same time, all kind of digital data does not suit with same structure of hash table. In this study we proposed an approach to have multiple hash tables for different digital data. By having dedicated hash table for each digital data type will improve the searching time of duplicate data.


2021 ◽  
Author(s):  
Nastaran Chakani ◽  
Seyed Masoud Mirrezaei ◽  
Ghosheh Abed Hodtani

Abstract Outsourcing data on the cloud storage services has already attracted great attention due to prospect of rapid data growth and storing efficiencies for customers. The coding-based cloud storage approach can offer more reliable and faster solution with less storage space in comparison with replication-based cloud storage. LT codes as a famous member of rateless codes family can improve performance of storage systems utilizing good degree distributions. Since degree distribution plays key role in LT codes performance, recently introduced Poisson Robust Soliton Distribution (PRSD) and Combined Poisson Robust Soliton Distribution (CPRSD) motivate us to investigate LT codes-based cloud storage system. So, we exploit LT codes with new degree distributions in order to provide lower average degree and higher decoding efficiency, specifically when receiving fewer encoding symbols, comparing with popular degree distribution, Robust Soliton Distribution (RSD). In this paper, we show that proposed cloud storage outperforms traditional ones in terms of storage space and robustness encountering unavailability of encoding symbols, due to compatible properties of PRSD and CPRSD with cloud storage essence. Furthermore, modified decoding process based on required encoding symbols behavior is presented to reduce data retrieval time. Numerical results confirm improvement of cloud storage performance.


Sign in / Sign up

Export Citation Format

Share Document