Knowledge management overview of feature selection problem in high-dimensional financial data: cooperative co-evolution and MapReduce perspectives

The term “big data” characterizes the massive amounts of data generation by the advanced technologies in different domains using 4Vs – volume, velocity, variety, and veracity - to indicate the amount of data that can only be processed via computationally intensive analysis, the speed of their creation, the different types of data, and their accuracy. High-dimensional financial data, such as time-series and space-time data, contain a large number of features (variables) while having a small number of samples, which are used to measure various real-time business situations for financial organizations. Such datasets are normally noisy, and complex correlations may exist between their features, and many domains, including financial, lack the al analytic tools to mine the data for knowledge discovery because of the high-dimensionality. Feature selection is an optimization problem to find a minimal subset of relevant features that maximizes the classification accuracy and reduces the computations. Traditional statistical-based feature selection approaches are not adequate to deal with the curse of dimensionality associated with big data. Cooperative co-evolution, a meta-heuristic algorithm and a divide-and-conquer approach, decomposes high-dimensional problems into smaller sub-problems. Further, MapReduce, a programming model, offers a ready-to-use distributed, scalable, and fault-tolerant infrastructure for parallelizing the developed algorithm. This article presents a knowledge management overview of evolutionary feature selection approaches, state-of-the-art cooperative co-evolution and MapReduce-based feature selection techniques, and future research directions.

Download Full-text

Cooperative co-evolution for feature selection in Big Data with random feature grouping

Journal Of Big Data ◽

10.1186/s40537-020-00381-y ◽

2020 ◽

Vol 7 (1) ◽

Cited By ~ 1

Author(s):

A. N. M. Bazlur Rashid ◽

Mohiuddin Ahmed ◽

Leslie F. Sikos ◽

Paul Haskell-Dowland

Keyword(s):

Feature Selection ◽

Big Data ◽

Nearest Neighbor ◽

Optimization Problems ◽

Poor Performance ◽

Good Choice ◽

Divide And Conquer ◽

Support Vector ◽

Data Generation ◽

Feature Grouping

AbstractA massive amount of data is generated with the evolution of modern technologies. This high-throughput data generation results in Big Data, which consist of many features (attributes). However, irrelevant features may degrade the classification performance of machine learning (ML) algorithms. Feature selection (FS) is a technique used to select a subset of relevant features that represent the dataset. Evolutionary algorithms (EAs) are widely used search strategies in this domain. A variant of EAs, called cooperative co-evolution (CC), which uses a divide-and-conquer approach, is a good choice for optimization problems. The existing solutions have poor performance because of some limitations, such as not considering feature interactions, dealing with only an even number of features, and decomposing the dataset statically. In this paper, a novel random feature grouping (RFG) has been introduced with its three variants to dynamically decompose Big Data datasets and to ensure the probability of grouping interacting features into the same subcomponent. RFG can be used in CC-based FS processes, hence called Cooperative Co-Evolutionary-Based Feature Selection with Random Feature Grouping (CCFSRFG). Experiment analysis was performed using six widely used ML classifiers on seven different datasets from the UCI ML repository and Princeton University Genomics repository with and without FS. The experimental results indicate that in most cases [i.e., with naïve Bayes (NB), support vector machine (SVM), k-Nearest Neighbor (k-NN), J48, and random forest (RF)] the proposed CCFSRFG-1 outperforms an existing solution (a CC-based FS, called CCEAFS) and CCFSRFG-2, and also when using all features in terms of accuracy, sensitivity, and specificity.

Download Full-text

Opportunities and Challenges of Feature Selection Methods for High Dimensional Data: A Review

Ingénierie des systèmes d information ◽

10.18280/isi.260107 ◽

2021 ◽

Vol 26 (1) ◽

pp. 67-77

Author(s):

Siva Sankari Subbiah ◽

Jayakumar Chinnappan

Keyword(s):

Feature Selection ◽

Big Data ◽

Large Scale ◽

High Dimensional Data ◽

Research Work ◽

Basic Feature ◽

High Dimensional ◽

Selection Methods ◽

Fast Development ◽

Improved Accuracy

Now a day, all the organizations collecting huge volume of data without knowing its usefulness. The fast development of Internet helps the organizations to capture data in many different formats through Internet of Things (IoT), social media and from other disparate sources. The dimension of the dataset increases day by day at an extraordinary rate resulting in large scale dataset with high dimensionality. The present paper reviews the opportunities and challenges of feature selection for processing the high dimensional data with reduced complexity and improved accuracy. In the modern big data world the feature selection has a significance in reducing the dimensionality and overfitting of the learning process. Many feature selection methods have been proposed by researchers for obtaining more relevant features especially from the big datasets that helps to provide accurate learning results without degradation in performance. This paper discusses the importance of feature selection, basic feature selection approaches, centralized and distributed big data processing using Hadoop and Spark, challenges of feature selection and provides the summary of the related research work done by various researchers. As a result, the big data analysis with the feature selection improves the accuracy of the learning.

Download Full-text

How the Internet of Things can help knowledge management: a case study from the automotive domain

Journal of Knowledge Management ◽

10.1108/jkm-07-2015-0291 ◽

2017 ◽

Vol 21 (1) ◽

pp. 57-70 ◽

Cited By ~ 33

Author(s):

Lorna Uden ◽

Wu He

Keyword(s):

Big Data ◽

Knowledge Management ◽

Internet Of Things ◽

Current Knowledge ◽

Time Data ◽

Content Type ◽

Useful Knowledge ◽

Iot Devices ◽

The Internet Of Things

Purpose Current knowledge management (KM) systems cannot be used effectively for decision-making because of the lack of real-time data. This study aims to discuss how KM can benefit by embedding Internet of Things (IoT). Design/methodology/approach The paper discusses how IoT can help KM to capture data and convert data into knowledge to improve the parking service in transportation using a case study. Findings This case study related to intelligent parking service supported by IoT devices of vehicles shows that KM can play a role in turning the incoming big data collected from IoT devices into useful knowledge more quickly and effectively. Originality/value The literature review shows that there are few papers discussing how KM can benefit by embedding IoT and processing incoming big data collected from IoT devices. The case study developed in this study provides evidence to explain how IoT can help KM to capture big data and convert big data into knowledge to improve the parking service in transportation.

Download Full-text

Fast Transient Detection as a Prototypical “Big Data” Problem

Proceedings of the International Astronomical Union ◽

10.1017/s1743921312000993 ◽

2011 ◽

Vol 7 (S285) ◽

pp. 340-341

Author(s):

Dayton L. Jones ◽

Kiri Wagstaff ◽

David Thompson ◽

Larry D'Addario ◽

Robert Navarro ◽

...

Keyword(s):

Big Data ◽

Large Scale ◽

High Time Resolution ◽

Data Generation ◽

Time Data ◽

Transient Detection ◽

Data Archiving ◽

Large Scale Data ◽

Data Problem ◽

Scale Data

AbstractThe detection of fast (< 1 second) transient signals requires a challenging balance between the need to examine vast quantities of high time-resolution data and the impracticality of storing all the data for later analysis. This is the epitome of a “big data” issue—far more data will be produced by next generation-astronomy facilities than can be analyzed, distributed, or archived using traditional methods. JPL is developing technologies to deal with “big data” problems from initial data generation through real-time data triage algorithms to large-scale data archiving and mining. Although most current work is focused on the needs of large radio arrays, the technologies involved are widely applicable in other areas.

Download Full-text

An Intelligent Metaheuristic Binary Pigeon Optimization-Based Feature Selection and Big Data Classification in a MapReduce Environment

Mathematics ◽

10.3390/math9202627 ◽

2021 ◽

Vol 9 (20) ◽

pp. 2627

Author(s):

Felwa Abukhodair ◽

Wafaa Alsaggaf ◽

Amani Tariq Jamal ◽

Sayed Abdel-Khalek ◽

Romany F. Mansour

Keyword(s):

Feature Selection ◽

Big Data ◽

Short Term Memory ◽

Programming Model ◽

Data Classification ◽

Massive Data ◽

Effective Performance ◽

Different Dimensions ◽

Long Short Term Memory ◽

Big Data Classification

Big Data are highly effective for systematically extracting and analyzing massive data. It can be useful to manage data proficiently over the conventional data handling approaches. Recently, several schemes have been developed for handling big datasets with several features. At the same time, feature selection (FS) methodologies intend to eliminate repetitive, noisy, and unwanted features that degrade the classifier results. Since conventional methods have failed to attain scalability under massive data, the design of new Big Data classification models is essential. In this aspect, this study focuses on the design of metaheuristic optimization based on big data classification in a MapReduce (MOBDC-MR) environment. The MOBDC-MR technique aims to choose optimal features and effectively classify big data. In addition, the MOBDC-MR technique involves the design of a binary pigeon optimization algorithm (BPOA)-based FS technique to reduce the complexity and increase the accuracy. Beetle antenna search (BAS) with long short-term memory (LSTM) model is employed for big data classification. The presented MOBDC-MR technique has been realized on Hadoop with the MapReduce programming model. The effective performance of the MOBDC-MR technique was validated using a benchmark dataset and the results were investigated under several measures. The MOBDC-MR technique demonstrated promising performance over the other existing techniques under different dimensions.

Download Full-text

AN EFFECTIVE FEATURE SELECTION TECHNIQUE FOR MINING HIGH DIMENSIONAL DATA ON BIG DATA

i-manager’s Journal on Cloud Computing ◽

10.26634/jcc.3.1.8075 ◽

2016 ◽

Vol 3 (1) ◽

pp. 18

Author(s):

NAIK K. BHASKAR ◽

SINDHUJA S.P. ◽

◽

Keyword(s):

Feature Selection ◽

Big Data ◽

High Dimensional Data ◽

High Dimensional ◽

Feature Selection Technique ◽

Selection Technique

Download Full-text

Efficient Feature Selection Algorithm for High-Dimensional Non-equilibrium Big Data Set

Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering - Advanced Hybrid Information Processing ◽

10.1007/978-3-030-67871-5_36 ◽

2021 ◽

pp. 399-408

Author(s):

Shuang-cheng Jia ◽

Feng-ping Yang

Keyword(s):

Feature Selection ◽

Big Data ◽

High Dimensional ◽

Selection Algorithm ◽

Feature Selection Algorithm ◽

Data Set ◽

Non Equilibrium

Download Full-text

A Survey of Fog Computing-Based Healthcare Big Data Analytics and Its Security

International Journal of Ambient Computing and Intelligence ◽

10.4018/ijaci.2021040104 ◽

2021 ◽

Vol 12 (2) ◽

pp. 53-72

Author(s):

Rojalina Priyadarshini ◽

Rabindra Kumar Barik ◽

Harish Chandra Dubey ◽

Brojo Kishore Mishra

Keyword(s):

Big Data ◽

Data Processing ◽

Fog Computing ◽

Big Data Analytics ◽

Machine Intelligence ◽

Future Research ◽

Data Generation ◽

Local Data ◽

Research Areas ◽

Computing Paradigm

Growing use of wearables within internet of things (IoT) creates ever-increasing multi-modal data from various smart health applications. The enormous volume of data generation creates new challenges in transmission, storage, and processing. There were challenges such as communication latency and data security associated with processing medical big data in cloud backend. Fog computing (FC) is an emerging distributed computing paradigm that solved these problems by leveraging local data processing, storage, filtering, and machine intelligence within an intermediate fog layer that resides between cloud and wearables devices. This paper focuses on doing survey on two major aspects of deploying fog computing for smart and connected health. Firstly, the role of machine learning-based edge intelligence in fog layer for data processing is investigated. A comprehensive analysis is provided during the survey, highlighting the strength and improvements in the existing literature. The paper ends with some open challenges and future research areas in the domain of fog-based healthcare.

Download Full-text

Facilitating knowledge management through filtered big data: SME competitiveness in an agri-food sector

Journal of Knowledge Management ◽

10.1108/jkm-08-2016-0357 ◽

2017 ◽

Vol 21 (1) ◽

pp. 156-179 ◽

Cited By ~ 37

Author(s):

Christina O’Connor ◽

Stephen Kelly

Keyword(s):

Big Data ◽

Knowledge Management ◽

Relevant Information ◽

Future Research ◽

Food Sector ◽

Relationship Dynamics ◽

Data Set ◽

Content Type ◽

The Impact ◽

Data Consumer

Purpose This paper aims to critique a facilitated knowledge management (KM) process that utilises filtered big data and, specifically, the process effectiveness in overcoming barriers to small and medium-sized enterprises’ (SMEs’) use of big data, the processes enablement of SME engagement with and use of big data and the process effect on SME competitiveness within an agri-food sector. Design/methodology/approach From 300 participant firms, SME owner-managers representing seven longitudinal case studies were contacted by the facilitator at least once-monthly over six months. Findings Results indicate that explicit and tacit knowledge can be enhanced when SMEs have access to a facilitated programme that analyses, packages and explains big data consumer analytics captured by a large pillar firm in a food network. Additionally, big data and knowledge are mutually exclusive unless effective KM processes are implemented. Several barriers to knowledge acquisition and application stem from SME resource limitations, strategic orientation and asymmetrical power relationships within a network. Research limitations/implications By using Dunnhumby data, this study captured the impact of only one form of big data, consumer analytics. However, this is a significant data set for SME agri-food businesses. Additionally, although the SMEs were based in only one UK region, Northern Ireland, there is wide scope for future research across multiple UK regions with the same Dunnhumby data set. Originality/value The study demonstrates the potential relevance of big data to SMEs’ activities and developments, explicitly identifying that realising this potential requires the data to be filtered and presented as market-relevant information that engages SMEs, recognises relationship dynamics and supports learning through feedback and two-way dialogue. This is the first study that empirically analyses filtered big data and SME competitiveness. The examination of relationship dynamics also overcomes existing literature limitations where SMEs’ constraints are seen as the prime factor restricting knowledge transfer.

Download Full-text

Constraint Programming Based Biomarker Optimization

BioMed Research International ◽

10.1155/2015/910515 ◽

2015 ◽

Vol 2015 ◽

pp. 1-5 ◽

Cited By ~ 2

Author(s):

Manli Zhou ◽

Youxi Luo ◽

Guoquan Sun ◽

Guoqin Mai ◽

Fengfeng Zhou

Keyword(s):

Feature Selection ◽

Big Data ◽

Programming Model ◽

Feature Subset ◽

Biomedical Knowledge ◽

User Input ◽

Optimization Function ◽

Wet Lab ◽

Constrained Programming ◽

Selection Algorithms

Efficient and intuitive characterization of biological big data is becoming a major challenge for modern bio-OMIC based scientists. Interactive visualization and exploration of big data is proven to be one of the successful solutions. Most of the existing feature selection algorithms do not allow the interactive inputs from users in the optimizing process of feature selection. This study investigates this question as fixing a few user-input features in the finally selected feature subset and formulates these user-input features as constraints for a programming model. The proposed algorithm, fsCoP (feature selection based on constrained programming), performs well similar to or much better than the existing feature selection algorithms, even with the constraints from both literature and the existing algorithms. An fsCoP biomarker may be intriguing for further wet lab validation, since it satisfies both the classification optimization function and the biomedical knowledge. fsCoP may also be used for the interactive exploration of bio-OMIC big data by interactively adding user-defined constraints for modeling.

Download Full-text