scholarly journals Survey on Preprocessing Techniques for Big Data Projects

2021 ◽  
Vol 7 (1) ◽  
pp. 14
Author(s):  
Ignacio D. Lopez-Miguel

In the era of big data, a vast amount of data are being produced. This results in two main issues when trying to discover knowledge from these data. There is a lot of information that is not relevant to the problem we want to solve, and there are many imperfections and errors in the data. Therefore, preprocessing these data is a key step before applying any kind of learning algorithm. Reducing the number of features to a relevant subset (feature selection) and reducing the possible values of continuous variables (discretisation) are two of the main preprocessing techniques. This paper will review different methods for completing these two steps, focusing on the big data context and giving examples of projects where they have been applied.

2020 ◽  
Vol 39 (6) ◽  
pp. 8867-8875
Author(s):  
Yile Wang ◽  
Dashuai Zeng

Based on big data, this paper studies the influence of new type of filling pneumonia on the development of sports industry. When selecting the typical economic indicators that reflect the development trend of sports industry, it is found that the data is huge according to the big industrial data, but the information that can be reflected is poor and complex. Therefore, it is necessary to process these big economic data in order to obtain the impact of new coronary pneumonia on the development of sports industry. This paper studies the feature selection algorithm of big data samples, so as to select typical economic indicators from many economic indicators of sports industry to reflect the development trend of sports industry. A deep learning algorithm based on feature selection of big data is proposed. Firstly, a feature selection framework for big data is constructed, and then data fusion and deep learning are carried out. Experiments show that the algorithm can solve the contradiction between large data and poor information. This method has a certain forward-looking, and has a certain reference value for the information discrimination of the development trend of sports industry.


We are in the information age there by collecting very huge volume of data from diverse sources in structured, unstructured and semi structured form ranging to petabytes to exabytes of data. Data is an asset as valuable knowledge and information is hidden in such massive volumes of data. Data analytics is required to have a deeper insights and identify fine grained patterns so as to make accurate predictions enabling the improvement of decision making. Extracting knowledge from data is done by data analytics, Machine learning forms the core of it. The increase in the dimensionality of data both in terms of number of tuples and also in terms of number of features poses several challenges to the machine learning algorithms . Preprocessing of data is done as a prior step to machine learning, so feature selection is done as a preprocessing step to have the dimensionality reduction of the data and thereby removing the irrelevant features and improving the efficiency and accuracy of a machine learning algorithm. In this paper we are studying various feature selection mechanisms and analyze them whether they can be adopted to sentiment analysis of big data.


Author(s):  
Zuohong Xu ◽  
Zhou Zhang ◽  
Shilian Wang ◽  
Alireza Jolfaei ◽  
Ali Kashif Bashir ◽  
...  

2021 ◽  
Vol 558 ◽  
pp. 124-139
Author(s):  
D. López ◽  
S. Ramírez-Gallego ◽  
S. García ◽  
N. Xiong ◽  
F. Herrera

2020 ◽  
Vol 11 (1) ◽  
pp. 96
Author(s):  
Wen-Lan Wu ◽  
Meng-Hua Lee ◽  
Hsiu-Tao Hsu ◽  
Wen-Hsien Ho ◽  
Jing-Min Liang

Background: In this study, an automatic scoring system for the functional movement screen (FMS) was developed. Methods: Thirty healthy adults fitted with full-body inertial measurement unit sensors completed six FMS exercises. The system recorded kinematics data, and a professional athletic trainer graded each participant. To reduce the number of input variables for the predictive model, ordinal logistic regression was used for subset feature selection. The ensemble learning algorithm AdaBoost.M1 was used to construct classifiers. Accuracy and F score were used for classification model evaluation. The consistency between automatic and manual scoring was assessed using a weighted kappa statistic. Results: When all the features were used, the predict model presented moderate to high accuracy, with kappa values between fair to very good agreement. After feature selection, model accuracy decreased about 10%, with kappa values between poor to moderate agreement. Conclusions: The results indicate that higher prediction accuracy was achieved using the full feature set compared with using the reduced feature set.


Methods ◽  
2016 ◽  
Vol 111 ◽  
pp. 21-31 ◽  
Author(s):  
Lipo Wang ◽  
Yaoli Wang ◽  
Qing Chang

Sign in / Sign up

Export Citation Format

Share Document