Machine learning-driven automatic storage space recommendation for object-based cloud storage system

AbstractAn object-based cloud storage system is a storage platform where big data is managed through the internet and data is considered as an object. A smart storage system should be able to handle the big data variety property by recommending the storage space for each data type automatically. Machine learning can help make a storage system automatic. This article proposes a classification engine framework for this purpose by utilizing a machine learning strategy. A feature selection approach wrapped with a classifier is proposed to automatically predict the proper storage space for the incoming big data. It helps build an automatic storage space recommendation system for an object-based cloud storage platform. To find out a suitable combination of feature selection algorithms and classifiers for the proposed classification engine, a comparative study of different supervised feature selection algorithms (i.e., Fisher score, F-score, Lll21) from three categories (similarity, statistical, sparse learning) associated with various classifiers (i.e., SVM, K-NN, Neural Network) is performed. We illustrate our study using RSoS system as it provides a cloud storage platform for the healthcare data as experimental big data by considering its variety property. The experiments confirm that Lll21 feature selection combined with K-NN classifier provides better performance than the others.

Download Full-text

Streaming feature selection algorithms for big data: A survey

Applied Computing and Informatics ◽

10.1016/j.aci.2019.01.001 ◽

2020 ◽

Vol ahead-of-print (ahead-of-print) ◽

Cited By ~ 5

Author(s):

Noura AlNuaimi ◽

Mohammad Mehedy Masud ◽

Mohamed Adel Serhani ◽

Nazar Zaki

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Big Data ◽

Real Time ◽

Relevant Literature ◽

Heterogeneous Data ◽

Proper Solution ◽

Exact Figure ◽

Selection Algorithms ◽

Over Time

Organizations in many domains generate a considerable amount of heterogeneous data every day. Such data can be processed to enhance these organizations’ decisions in real time. However, storing and processing large and varied datasets (known as big data) is challenging to do in real time. In machine learning, streaming feature selection has always been considered a superior technique for selecting the relevant subset features from highly dimensional data and thus reducing learning complexity. In the relevant literature, streaming feature selection refers to the features that arrive consecutively over time; despite a lack of exact figure on the number of features, numbers of instances are well-established. Many scholars in the field have proposed streaming-feature-selection algorithms in attempts to find the proper solution to this problem. This paper presents an exhaustive and methodological introduction of these techniques. This study provides a review of the traditional feature-selection algorithms and then scrutinizes the current algorithms that use streaming feature selection to determine their strengths and weaknesses. The survey also sheds light on the ongoing challenges in big-data research.

Download Full-text

A Scalable Feature Selection and Model Updating Approach for Big Data Machine Learning

2016 IEEE International Conference on Smart Cloud (SmartCloud) ◽

10.1109/smartcloud.2016.32 ◽

2016 ◽

Cited By ~ 3

Author(s):

Baijian Yang ◽

Tonglin Zhang

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Big Data ◽

Model Updating

Download Full-text

Customer Churn Prediction in Telecom Sector with Machine Learning and Information Gain Filter Feature Selection Algorithms

10.1109/icdabi53623.2021.9655792 ◽

2021 ◽

Author(s):

Yakub K. Saheed ◽

Moshood A. Hambali

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Information Gain ◽

Churn Prediction ◽

Customer Churn ◽

Customer Churn Prediction ◽

Telecom Sector ◽

Selection Algorithms

Download Full-text

Big Data Security in the Web-Based Cloud Storage System Using 3D-AES Block Cipher Cryptography Algorithm

Communications in Computer and Information Science - Soft Computing in Data Science ◽

10.1007/978-981-13-3441-2_24 ◽

2018 ◽

pp. 309-321 ◽

Cited By ~ 2

Author(s):

Nur Afifah Nadzirah Adnan ◽

Suriyani Ariffin

Keyword(s):

Big Data ◽

Data Security ◽

Cloud Storage ◽

Block Cipher ◽

Storage System ◽

Web Based ◽

The Web

Download Full-text

Security control heterogeneous big data cloud storage system based on adaptive cache

Journal of Physics Conference Series ◽

10.1088/1742-6596/1486/5/052014 ◽

2020 ◽

Vol 1486 ◽

pp. 052014

Author(s):

Jianbao Zhu ◽

Jing Fu ◽

Yuwei Sun ◽

Ye Shi ◽

Yu Chen ◽

...

Keyword(s):

Big Data ◽

Cloud Storage ◽

Storage System ◽

Security Control ◽

Adaptive Cache

Download Full-text

A NOVEL FEATURE SELECTION ALGORITHM WITH SUPERVISED MUTUAL INFORMATION FOR CLASSIFICATION

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213013500279 ◽

2013 ◽

Vol 22 (04) ◽

pp. 1350027

Author(s):

JAGANATHAN PALANICHAMY ◽

KUPPUCHAMY RAMASAMY

Keyword(s):

Machine Learning ◽

Data Mining ◽

Feature Selection ◽

Mutual Information ◽

Selection Algorithm ◽

Feature Selection Algorithm ◽

Class A ◽

Selection Algorithms ◽

The Relationship ◽

Class Variable

Feature selection is essential in data mining and pattern recognition, especially for database classification. During past years, several feature selection algorithms have been proposed to measure the relevance of various features to each class. A suitable feature selection algorithm normally maximizes the relevancy and minimizes the redundancy of the selected features. The mutual information measure can successfully estimate the dependency of features on the entire sampling space, but it cannot exactly represent the redundancies among features. In this paper, a novel feature selection algorithm is proposed based on maximum relevance and minimum redundancy criterion. The mutual information is used to measure the relevancy of each feature with class variable and calculate the redundancy by utilizing the relationship between candidate features, selected features and class variables. The effectiveness is tested with ten benchmarked datasets available in UCI Machine Learning Repository. The experimental results show better performance when compared with some existing algorithms.

Download Full-text

Research Approach With Machine Learning Underpinned

Machine Learning in Cancer Research With Applications in Colon Cancer and Big Data Analysis - Advances in Medical Technologies and Clinical Practice ◽

10.4018/978-1-7998-7316-7.ch003 ◽

2021 ◽

pp. 63-97

Keyword(s):

Machine Learning ◽

Colon Cancer ◽

Feature Selection ◽

High Performance ◽

Classification Algorithms ◽

Research Approach ◽

Research Questions ◽

Selection Algorithms ◽

Gene Feature

This chapter describes several methodologies and proposed models used to examine the accuracy and efficiency of high-performance colon-cancer feature selection and classification algorithms to solve the problems identified in Chapter 2. An elaboration of the diverse methods of gene/feature selection algorithms and the related classification algorithms implemented throughout this study are presented. A prototypical methodology blueprint for each experiment is developed to answer the research questions in Chapter 1. Each system model is also presented, and the measures used to validate the performance of the model's outcome are discussed.

Download Full-text

An integrated machine learning framework for a discriminative analysis of schizophrenia using multi-biological data

Scientific Reports ◽

10.1038/s41598-021-94007-9 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Peng-fei Ke ◽

Dong-sheng Xiong ◽

Jia-hui Li ◽

Zhi-lin Pan ◽

Jing Zhou ◽

...

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Gut Microbiota ◽

Cross Sectional Study ◽

Biological Data ◽

Support Vector ◽

Cross Sectional ◽

Integrated Framework ◽

Learning Framework ◽

Selection Algorithms

AbstractFinding effective and objective biomarkers to inform the diagnosis of schizophrenia is of great importance yet remains challenging. Relatively little work has been conducted on multi-biological data for the diagnosis of schizophrenia. In this cross-sectional study, we extracted multiple features from three types of biological data, including gut microbiota data, blood data, and electroencephalogram data. Then, an integrated framework of machine learning consisting of five classifiers, three feature selection algorithms, and four cross validation methods was used to discriminate patients with schizophrenia from healthy controls. Our results show that the support vector machine classifier without feature selection using the input features of multi-biological data achieved the best performance, with an accuracy of 91.7% and an AUC of 96.5% (p < 0.05). These results indicate that multi-biological data showed better discriminative capacity for patients with schizophrenia than single biological data. The top 5% discriminative features selected from the optimal model include the gut microbiota features (Lactobacillus, Haemophilus, and Prevotella), the blood features (superoxide dismutase level, monocyte-lymphocyte ratio, and neutrophil count), and the electroencephalogram features (nodal local efficiency, nodal efficiency, and nodal shortest path length in the temporal and frontal-parietal brain areas). The proposed integrated framework may be helpful for understanding the pathophysiology of schizophrenia and developing biomarkers for schizophrenia using multi-biological data.

Download Full-text

Improving the Efficiency of Deduplication Process by Dedicated Hash Table for each Digital Data Type in Cloud Storage System

Webology ◽

10.14704/web/v18si01/web18060 ◽

2021 ◽

Vol 18 (Special Issue 01) ◽

pp. 288-301

Author(s):

G. Sujatha ◽

Dr. Jeberson Retna Raj

Keyword(s):

Data Storage ◽

Cloud Storage ◽

Storage System ◽

Hash Table ◽

Cloud Services ◽

Data Type ◽

Digital Data ◽

Storage Space ◽

Data Deduplication ◽

Worst Case

Data storage is one of the significant cloud services available to the cloud users. Since the magnitude of information outsourced grows extremely high, there is a need of implementing data deduplication technique in the cloud storage space for efficient utilization. The cloud storage space supports all kind of digital data like text, audio, video and image. In the hash-based deduplication system, cryptographic hash value should be calculated for all data irrespective of its type and stored in the memory for future reference. Using these hash value only, duplicate copies can be identified. The problem in this existing scenario is size of the hash table. To find a duplicate copy, all the hash values should be checked in the worst case irrespective of its data type. At the same time, all kind of digital data does not suit with same structure of hash table. In this study we proposed an approach to have multiple hash tables for different digital data. By having dedicated hash table for each digital data type will improve the searching time of duplicate data.

Download Full-text

Performance and Time Improvement of LT Codes-based Cloud Storage

10.21203/rs.3.rs-598840/v1 ◽

2021 ◽

Author(s):

Nastaran Chakani ◽

Seyed Masoud Mirrezaei ◽

Ghosheh Abed Hodtani

Keyword(s):

Degree Distribution ◽

Cloud Storage ◽

Storage System ◽

Data Retrieval ◽

Storage Space ◽

Improve Performance ◽

Lt Codes ◽

Storage Performance ◽

Decoding Efficiency ◽

Cloud Storage Services

Abstract Outsourcing data on the cloud storage services has already attracted great attention due to prospect of rapid data growth and storing efficiencies for customers. The coding-based cloud storage approach can offer more reliable and faster solution with less storage space in comparison with replication-based cloud storage. LT codes as a famous member of rateless codes family can improve performance of storage systems utilizing good degree distributions. Since degree distribution plays key role in LT codes performance, recently introduced Poisson Robust Soliton Distribution (PRSD) and Combined Poisson Robust Soliton Distribution (CPRSD) motivate us to investigate LT codes-based cloud storage system. So, we exploit LT codes with new degree distributions in order to provide lower average degree and higher decoding efficiency, specifically when receiving fewer encoding symbols, comparing with popular degree distribution, Robust Soliton Distribution (RSD). In this paper, we show that proposed cloud storage outperforms traditional ones in terms of storage space and robustness encountering unavailability of encoding symbols, due to compatible properties of PRSD and CPRSD with cloud storage essence. Furthermore, modified decoding process based on required encoding symbols behavior is presented to reduce data retrieval time. Numerical results confirm improvement of cloud storage performance.

Download Full-text