scholarly journals The Implication of Statistical Analysis and Feature Engineering for Model Building Using Machine Learning Algorithms

2019 ◽  
Vol 10 (03) ◽  
pp. 01-11
Author(s):  
Swayanshu Shanti Pragnya ◽  
Shashwat Priyadarshi
2021 ◽  
Vol 2021 ◽  
pp. 1-12
Author(s):  
Muhammad Waqar ◽  
Hassan Dawood ◽  
Hussain Dawood ◽  
Nadeem Majeed ◽  
Ameen Banjar ◽  
...  

Cardiac disease treatments are often being subjected to the acquisition and analysis of vast quantity of digital cardiac data. These data can be utilized for various beneficial purposes. These data’s utilization becomes more important when we are dealing with critical diseases like a heart attack where patient life is often at stake. Machine learning and deep learning are two famous techniques that are helping in making the raw data useful. Some of the biggest problems that arise from the usage of the aforementioned techniques are massive resource utilization, extensive data preprocessing, need for features engineering, and ensuring reliability in classification results. The proposed research work presents a cost-effective solution to predict heart attack with high accuracy and reliability. It uses a UCI dataset to predict the heart attack via various machine learning algorithms without the involvement of any feature engineering. Moreover, the given dataset has an unequal distribution of positive and negative classes which can reduce performance. The proposed work uses a synthetic minority oversampling technique (SMOTE) to handle given imbalance data. The proposed system discarded the need of feature engineering for the classification of the given dataset. This led to an efficient solution as feature engineering often proves to be a costly process. The results show that among all machine learning algorithms, SMOTE-based artificial neural network when tuned properly outperformed all other models and many existing systems. The high reliability of the proposed system ensures that it can be effectively used in the prediction of the heart attack.


In a large distributed virtualized environment, predicting the alerting source from its text seems to be daunting task. This paper explores the option of using machine learning algorithm to solve this problem. Unfortunately, our training dataset is highly imbalanced. Where 96% of alerting data is reported by 24% of alerting sources. This is the expected dataset in any live distributed virtualized environment, where new version of device will have relatively less alert compared to older devices. Any classification effort with such imbalanced dataset present different set of challenges compared to binary classification. This type of skewed data distribution makes conventional machine learning less effective, especially while predicting the minority device type alerts. Our challenge is to build a robust model which can cope with this imbalanced dataset and achieves relative high level of prediction accuracy. This research work stared with traditional regression and classification algorithms using bag of words model. Then word2vec and doc2vec models are used to represent the words in vector formats, which preserve the sematic meaning of the sentence. With this alerting text with similar message will have same vector form representation. This vectorized alerting text is used with Logistic Regression for model building. This yields better accuracy, but the model is relatively complex and demand more computational resources. Finally, simple neural network is used for this multi-class text classification problem domain by using keras and tensorflow libraries. A simple two layered neural network yielded 99 % accuracy, even though our training dataset was not balanced. This paper goes through the qualitative evaluation of the different machine learning algorithms and their respective result. Finally, two layered deep learning algorithms is selected as final solution, since it takes relatively less resource and time with better accuracy values.


Electronics ◽  
2019 ◽  
Vol 8 (12) ◽  
pp. 1461 ◽  
Author(s):  
Taeheum Cho ◽  
Unang Sunarya ◽  
Minsoo Yeo ◽  
Bosun Hwang ◽  
Yong Seo Koo ◽  
...  

Sleep scoring is the first step for diagnosing sleep disorders. A variety of chronic diseases related to sleep disorders could be identified using sleep-state estimation. This paper presents an end-to-end deep learning architecture using wrist actigraphy, called Deep-ACTINet, for automatic sleep-wake detection using only noise canceled raw activity signals recorded during sleep and without a feature engineering method. As a benchmark test, the proposed Deep-ACTINet is compared with two conventional fixed model based sleep-wake scoring algorithms and four feature engineering based machine learning algorithms. The datasets were recorded from 10 subjects using three-axis accelerometer wristband sensors for eight hours in bed. The sleep recordings were analyzed using Deep-ACTINet and conventional approaches, and the suggested end-to-end deep learning model gained the highest accuracy of 89.65%, recall of 92.99%, and precision of 92.09% on average. These values were approximately 4.74% and 4.05% higher than those for the traditional model based and feature based machine learning algorithms, respectively. In addition, the neuron outputs of Deep-ACTINet contained the most significant information for separating the asleep and awake states, which was demonstrated by their high correlations with conventional significant features. Deep-ACTINet was designed to be a general model and thus has the potential to replace current actigraphy algorithms equipped in wristband wearable devices.


2021 ◽  
Author(s):  
Yew Kee Wong

In the information era, enormous amounts of data have become available on hand to decision makers. Big data refers to datasets that are not only big, but also high in variety and velocity, which makes them difficult to handle using traditional tools and techniques. Due to the rapid growth of such data, solutions need to be studiedand provided in order to handle and extract value and knowledge from these datasets. Machine learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention. Such minimal human intervention can be provided using big data analytics, which is the application of advanced analytics techniques on big data. This paper aims to analyse some of the different machine learning algorithms and methods which can be applied to big data analysis, as well as the opportunities provided by the application of big data analytics in various decision making domains.


2021 ◽  
Author(s):  
Sunil Sazawal ◽  
Kelli Ryckman ◽  
Sayan Das ◽  
Rasheda Khanam ◽  
Imran Nisar ◽  
...  

Abstract Background: Babies born early and/or small for gestational age in Low- and Middle-income countries (LMIC) contribute substantially to global neonatal and infant mortality. Tracking this metric is critical at a population level for informed policy, advocacy, resources allocation and program evaluation and at an individual level for targeted care. Early prenatal ultrasound is not available in these settings, gestational age (GA) is estimated using newborn assessment, LMP recalls and birth weight, which are unreliable. Algorithms in developed settings, using metabolic screen data, provided GA estimates within 1-2 weeks of ultrasound-based GA. We sought to leverage machine learning algorithms to improve accuracy and applicability of this approach to LMIC settings.Methods: This study uses data from AMANHI-ACT prospective pregnancy cohorts in Asia and Africa where early pregnancy ultrasound estimated GA and birth weight are available and metabolite screening data in a subset of 1318 newborn are available. We utilized this opportunity to develop machine learning (ML) algorithms. Random Forest Regressor was used where data was randomly split into model-building and model-testing dataset. Mean absolute error (MAE) and root mean square error (RSME) were used to evaluate performance. Bootstrap procedures were used to estimate confidence intervals (CI) for RMSE and MAE. For pre-term birth identification ROC analysis with bootstrap and exact estimation of CI for area under curve (AUC) were performed.Results: Overall model estimated GA, had MAE of 5.8 days (95%CI 5.6-6.3), which was similar to performance in SGA, MAE 6.3 days (95%CI 5.6-7.0). GA was correctly estimated to within 1 week for 70.9% (95%CI 67.9-73.7). For preterm birth classification, AUC in ROC analysis was 92.6% (95%CI 87.5-96.1; p<0.001). This model performed better than Iowa regression, AUC Difference 2.8% (95%CI 0.9-11.8%; p=0.021).Conclusions: Machine learning algorithms and models applied to metabolomic gestational age dating offer a ladder of opportunity for providing accurate population-level gestational age estimates in LMIC settings. These findings also point to an opportunity for investigation of region-specific models, more focused feasible analyte models, and broad untargeted metabolome investigation.


Author(s):  
Supun Nakandala ◽  
Marta M. Jankowska ◽  
Fatima Tuz-Zahra ◽  
John Bellettiere ◽  
Jordan A. Carlson ◽  
...  

Background: Machine learning has been used for classification of physical behavior bouts from hip-worn accelerometers; however, this research has been limited due to the challenges of directly observing and coding human behavior “in the wild.” Deep learning algorithms, such as convolutional neural networks (CNNs), may offer better representation of data than other machine learning algorithms without the need for engineered features and may be better suited to dealing with free-living data. The purpose of this study was to develop a modeling pipeline for evaluation of a CNN model on a free-living data set and compare CNN inputs and results with the commonly used machine learning random forest and logistic regression algorithms. Method: Twenty-eight free-living women wore an ActiGraph GT3X+ accelerometer on their right hip for 7 days. A concurrently worn thigh-mounted activPAL device captured ground truth activity labels. The authors evaluated logistic regression, random forest, and CNN models for classifying sitting, standing, and stepping bouts. The authors also assessed the benefit of performing feature engineering for this task. Results: The CNN classifier performed best (average balanced accuracy for bout classification of sitting, standing, and stepping was 84%) compared with the other methods (56% for logistic regression and 76% for random forest), even without performing any feature engineering. Conclusion: Using the recent advancements in deep neural networks, the authors showed that a CNN model can outperform other methods even without feature engineering. This has important implications for both the model’s ability to deal with the complexity of free-living data and its potential transferability to new populations.


Sign in / Sign up

Export Citation Format

Share Document