Feature Engineering for Various Data Types in Data Science

The precision of any machine learning algorithm depends on the data set, its suitability, and its volume. Therefore, data and its characteristics have currently become the predominant components of any predictive or precision-based domain like machine learning. Feature engineering refers to the process of changing and preparing this input data so that it is ready for training machine learning models. Several features such as categorical, numerical, mixed, date, and time are to be considered for feature extraction in feature engineering. Datasets containing characteristics such as cardinality, missing data, and rare labels for categorical features, distribution, outliers, and magnitude are currently considered as features. This chapter discusses various data types and their techniques for applying to feature engineering. This chapter also focuses on the implementation of various data techniques for feature extraction.

Download Full-text

A REVIEW OF FEATURE EXTRACTION METHODS ON MACHINE LEARNING

Journal of Information System and Technology Management ◽

10.35631/jistm.622005 ◽

2021 ◽

Vol 6 (22) ◽

pp. 51-59

Author(s):

Mustazzihim Suhaidi ◽

Rabiah Abdul Kadir ◽

Sabrina Tiun

Keyword(s):

Machine Learning ◽

Feature Extraction ◽

Feature Selection ◽

Input Data ◽

Feature Vector ◽

Learning Algorithm ◽

Extraction Methods ◽

Machine Learning Algorithm ◽

Learning Tasks ◽

Low Dimensional

Extracting features from input data is vital for successful classification and machine learning tasks. Classification is the process of declaring an object into one of the predefined categories. Many different feature selection and feature extraction methods exist, and they are being widely used. Feature extraction, obviously, is a transformation of large input data into a low dimensional feature vector, which is an input to classification or a machine learning algorithm. The task of feature extraction has major challenges, which will be discussed in this paper. The challenge is to learn and extract knowledge from text datasets to make correct decisions. The objective of this paper is to give an overview of methods used in feature extraction for various applications, with a dataset containing a collection of texts taken from social media.

Download Full-text

Feature extraction and prediction of Dengue Outbreaks

International Journal of Scientific Research in Computer Science Engineering and Information Technology ◽

10.32628/cseit206544 ◽

2020 ◽

pp. 216-222

Author(s):

Kunal Parikh ◽

Tanvi Makadia ◽

Harshil Patel

Keyword(s):

Public Health ◽

Machine Learning ◽

Developing Countries ◽

Feature Extraction ◽

Predictive Analytics ◽

Learning Algorithm ◽

Machine Learning Algorithm ◽

Health Concerns ◽

The World ◽

Dengue Outbreaks

Dengue is unquestionably one of the biggest health concerns in India and for many other developing countries. Unfortunately, many people have lost their lives because of it. Every year, approximately 390 million dengue infections occur around the world among which 500,000 people are seriously infected and 25,000 people have died annually. Many factors could cause dengue such as temperature, humidity, precipitation, inadequate public health, and many others. In this paper, we are proposing a method to perform predictive analytics on dengue’s dataset using KNN: a machine-learning algorithm. This analysis would help in the prediction of future cases and we could save the lives of many.

Download Full-text

Big Data for Health Care Analytics using Extreme Machine Learning Based on Map Reduce

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.c5808.029320 ◽

2020 ◽

Vol 9 (3) ◽

pp. 2758-2762

Keyword(s):

Machine Learning ◽

Big Data ◽

Data Storage ◽

Clinical Data ◽

Disease Risk ◽

Learning Algorithm ◽

Information Storage ◽

Support Vector ◽

Machine Learning Algorithm ◽

Data Set

A large volume of datasets is available in various fields that are stored to be somewhere which is called big data. Big Data healthcare has clinical data set of every patient records in huge amount and they are maintained by Electronic Health Records (EHR). More than 80 % of clinical data is the unstructured format and reposit in hundreds of forms. The challenges and demand for data storage, analysis is to handling large datasets in terms of efficiency and scalability. Hadoop Map reduces framework uses big data to store and operate any kinds of data speedily. It is not solely meant for storage system however conjointly a platform for information storage moreover as processing. It is scalable and fault-tolerant to the systems. Also, the prediction of the data sets is handled by machine learning algorithm. This work focuses on the Extreme Machine Learning algorithm (ELM) that can utilize the optimized way of finding a solution to find disease risk prediction by combining ELM with Cuckoo Search optimization-based Support Vector Machine (CS-SVM). The proposed work also considers the scalability and accuracy of big data models, thus the proposed algorithm greatly achieves the computing work and got good results in performance of both veracity and efficiency.

Download Full-text

Review on Application of Machine learning Algorithm for Data Science

2018 3rd International Conference on Inventive Computation Technologies (ICICT) ◽

10.1109/icict43934.2018.9034256 ◽

2018 ◽

Author(s):

Kushal Rashmikant Dalal

Keyword(s):

Machine Learning ◽

Data Science ◽

Learning Algorithm ◽

Machine Learning Algorithm

Download Full-text

A SYNTHETIC DATA SET OF 3D OOCYTE IMAGES AND MACHINE LEARNING ALGORITHM AS A MODEL TO ASSESS THE REPRODUCTIVE POTENTIAL OF OOCYTES

Fertility and Sterility ◽

10.1016/j.fertnstert.2020.08.424 ◽

2020 ◽

Vol 114 (3) ◽

pp. e145

Author(s):

Gerard Letterie ◽

Nathan Kundtz

Keyword(s):

Machine Learning ◽

Learning Algorithm ◽

Reproductive Potential ◽

Synthetic Data ◽

Machine Learning Algorithm ◽

Data Set

Download Full-text

Plural marking patterns of nouns and their associates in the world’s languages

Studies in Language ◽

10.1075/sl.16001.che ◽

2020 ◽

Vol 44 (1) ◽

pp. 231-269

Author(s):

Rong Chen

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Learning Algorithm ◽

Machine Learning Algorithm ◽

Data Set ◽

Component Structure ◽

Universal Distribution ◽

Two Component ◽

The World ◽

Plural Marking

Abstract Plural marking reaches most corners of languages. When a noun occurs with another linguistic element, which is called associate in this paper, plural marking on the two-component structure has four logically possible patterns: doubly unmarked, noun-marked, associate-marked and doubly marked. These four patterns do not distribute homogeneously in the world’s languages, because they are motivated by two competing motivations iconicity and economy. Some patterns are preferred over others, and this preference is consistently found in languages across the world. In other words, there exists a universal distribution of the four plural marking patterns. Furthermore, holding the view that plural marking on associates expresses plurality of nouns, I propose a hypothetical universal which uses the number of pluralized associates to predict plural marking on nouns. A data set collected from a sample of 100 languages is used to test the hypothetical universal, by employing the machine learning algorithm logistic regression.

Download Full-text

Interpretation of Neural Networks Is Fragile

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33013681 ◽

2019 ◽

Vol 33 ◽

pp. 3681-3688 ◽

Cited By ~ 20

Author(s):

Amirata Ghorbani ◽

Abubakar Abid ◽

James Zou

Keyword(s):

Neural Network ◽

Machine Learning ◽

Neural Networks ◽

Input Data ◽

Learning Algorithm ◽

Hessian Matrix ◽

Machine Learning Algorithm ◽

Feature Importance ◽

Adversarial Attack ◽

Measurement Biases

In order for machine learning to be trusted in many applications, it is critical to be able to reliably explain why the machine learning algorithm makes certain predictions. For this reason, a variety of methods have been developed recently to interpret neural network predictions by providing, for example, feature importance maps. For both scientific robustness and security reasons, it is important to know to what extent can the interpretations be altered by small systematic perturbations to the input data, which might be generated by adversaries or by measurement biases. In this paper, we demonstrate how to generate adversarial perturbations that produce perceptively indistinguishable inputs that are assigned the same predicted label, yet have very different interpretations. We systematically characterize the robustness of interpretations generated by several widely-used feature importance interpretation methods (feature importance maps, integrated gradients, and DeepLIFT) on ImageNet and CIFAR-10. In all cases, our experiments show that systematic perturbations can lead to dramatically different interpretations without changing the label. We extend these results to show that interpretations based on exemplars (e.g. influence functions) are similarly susceptible to adversarial attack. Our analysis of the geometry of the Hessian matrix gives insight on why robustness is a general challenge to current interpretation approaches.

Download Full-text

A Surveillance on Machine Learning Algorithms and Its Applications

Journal of Computational and Theoretical Nanoscience ◽

10.1166/jctn.2020.9064 ◽

2020 ◽

Vol 17 (9) ◽

pp. 4294-4298

Author(s):

B. R. Sunil Kumar ◽

B. S. Siddhartha ◽

S. N. Shwetha ◽

K. Arpitha

Keyword(s):

Machine Learning ◽

Health Care ◽

Sentiment Analysis ◽

Learning Algorithm ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Machine Learning Algorithm ◽

Data Set ◽

Pros And Cons ◽

Primary Advantage

This paper intends to use distinct machine learning algorithms and exploring its multi-features. The primary advantage of machine learning is, a machine learning algorithm can predict its work automatically by learning what to do with information. This paper reveals the concept of machine learning and its algorithms which can be used for different applications such as health care, sentiment analysis and many more. Sometimes the programmers will get confused which algorithm to apply for their applications. This paper provides an idea related to the algorithm used on the basis of how accurately it fits. Based on the collected data, one of the algorithms can be selected based upon its pros and cons. By considering the data set, the base model is developed, trained and tested. Then the trained model is ready for prediction and can be deployed on the basis of feasibility.

Download Full-text

Image Classification using CNN and Machine Learning

International Journal of Scientific Research in Computer Science Engineering and Information Technology ◽

10.32628/cseit195298 ◽

2019 ◽

pp. 575-580

Author(s):

G. Keerthi Devipriya ◽

E. Chandana ◽

B. Prathyusha ◽

T. Seshu Chakravarthy

Keyword(s):

Machine Learning ◽

Image Classification ◽

Learning Algorithm ◽

Machine Learning Algorithm ◽

Data Set ◽

Training Models ◽

Classification Of Images ◽

Respective Category

Here by in this paper we are interested for classification of Images and Recognition. We expose the performance of training models by using a classifier algorithm and an API that contains set of images where we need to compare the uploaded image with the set of images available in the data set that we have taken. After identifying its respective category the image need to be placed in it. In order to classify images we are using a machine learning algorithm that comparing and placing the images.

Download Full-text

Feature Extraction for Emotion Recognition in Speech with Machine Learning Algorithm

International Journal of Advanced Trends in Computer Science and Engineering ◽

10.30534/ijatcse/2020/116942020 ◽

2020 ◽

Vol 9 (4) ◽

pp. 4998-5002

Author(s):

Aishwarya R.

Keyword(s):

Machine Learning ◽

Feature Extraction ◽

Emotion Recognition ◽

Learning Algorithm ◽

Machine Learning Algorithm

Download Full-text