real dataset
Recently Published Documents


TOTAL DOCUMENTS

97
(FIVE YEARS 58)

H-INDEX

6
(FIVE YEARS 2)

2022 ◽  
Vol 22 (1) ◽  
Author(s):  
Usha Govindarajulu ◽  
Sandeep Bedi

Abstract Background The purpose of this research was to see how the k-means algorithm can be applied to survival analysis with single events per subject for defining groups, which can then be modeled in a shared frailty model to further allow the capturing the unmeasured confounding not already explained by the covariates in the model. Methods For this purpose we developed our own k-means survival grouping algorithm to handle this approach. We compared a regular shared frailty model with a regular grouping variable and a shared frailty model with a k-means grouping variable in simulations as well as analysis on a real dataset. Results We found that in both simulations as well as real data showed that our k-means clustering is no different than the typical frailty clustering even under different situations of varied case rates and censoring. It appeared our k-means algorithm could be a trustworthy mechanism of creating groups from data when no grouping term exists for including in a frailty term in a survival model or comparing to an existing grouping variable available in the current data to use in a frailty model.


2021 ◽  
Author(s):  
Thien Pham ◽  
Loi Truong ◽  
Mao Nguyen ◽  
Akhil Garg ◽  
Liang Gao ◽  
...  

State-of-Health (SOH) prediction of a Lithium-ion battery is essential for preventing malfunction and maintaining efficient working behaviors for the battery. In practice, this task is difficult due to the high level of noise and complexity. There are many machine learning methods, especially deep learning approaches, that have been proposed to address this problem recently. However, there is much room for improvement because the nature of the battery data is highly non-linear and exhibits higher dependence on multidisciplinary parameters such as resistance, voltage and external conditions the battery is subjected to. In this paper, we propose an approach known as bidirectional sequence-in-sequence, which exploits the dependency of nested cycle-wise and channel-wise battery data. Experimented with real dataset acquired from NASA, our method results in significant reduction of error of approximately up to 32.5%.


2021 ◽  
Vol 13 (24) ◽  
pp. 13834
Author(s):  
Guk-Jin Son ◽  
Dong-Hoon Kwak ◽  
Mi-Kyung Park ◽  
Young-Duk Kim ◽  
Hee-Chul Jung

Supervised deep learning-based foreign object detection algorithms are tedious, costly, and time-consuming because they usually require a large number of training datasets and annotations. These disadvantages make them frequently unsuitable for food quality evaluation and food manufacturing processes. However, the deep learning-based foreign object detection algorithm is an effective method to overcome the disadvantages of conventional foreign object detection methods mainly used in food inspection. For example, color sorter machines cannot detect foreign objects with a color similar to food, and the performance is easily degraded by changes in illuminance. Therefore, to detect foreign objects, we use a deep learning-based foreign object detection algorithm (model). In this paper, we present a synthetic method to efficiently acquire a training dataset of deep learning that can be used for food quality evaluation and food manufacturing processes. Moreover, we perform data augmentation using color jitter on a synthetic dataset and show that this approach significantly improves the illumination invariance features of the model trained on synthetic datasets. The F1-score of the model that trained the synthetic dataset of almonds at 360 lux illumination intensity achieved a performance of 0.82, similar to the F1-score of the model that trained the real dataset. Moreover, the F1-score of the model trained with the real dataset combined with the synthetic dataset achieved better performance than the model trained with the real dataset in the change of illumination. In addition, compared with the traditional method of using color sorter machines to detect foreign objects, the model trained on the synthetic dataset has obvious advantages in accuracy and efficiency. These results indicate that the synthetic dataset not only competes with the real dataset, but they also complement each other.


2021 ◽  
Vol 134 ◽  
pp. 105366
Author(s):  
Benyamin Khadem ◽  
Mohammad Reza Saberi ◽  
Per Avseth
Keyword(s):  

2021 ◽  
Vol 2021 ◽  
pp. 1-13
Author(s):  
Ahmed Ali ◽  
Ahmed Fathalla ◽  
Ahmad Salah ◽  
Mahmoud Bekhit ◽  
Esraa Eldesouky

Nowadays, ocean observation technology continues to progress, resulting in a huge increase in marine data volume and dimensionality. This volume of data provides a golden opportunity to train predictive models, as the more the data is, the better the predictive model is. Predicting marine data such as sea surface temperature (SST) and Significant Wave Height (SWH) is a vital task in a variety of disciplines, including marine activities, deep-sea, and marine biodiversity monitoring. The literature has efforts to forecast such marine data; these efforts can be classified into three classes: machine learning, deep learning, and statistical predictive models. To the best of the authors’ knowledge, no study compared the performance of these three approaches on a real dataset. This paper focuses on the prediction of two critical marine features: the SST and SWH. In this work, we proposed implementing statistical, deep learning, and machine learning models for predicting the SST and SWH on a real dataset obtained from the Korea Hydrographic and Oceanographic Agency. Then, we proposed comparing these three predictive approaches on four different evaluation metrics. Experimental results have revealed that the deep learning model slightly outperformed the machine learning models for overall performance, and both of these approaches greatly outperformed the statistical predictive model.


Algorithms ◽  
2021 ◽  
Vol 14 (11) ◽  
pp. 301
Author(s):  
Umberto Michelucci ◽  
Michela Sperti ◽  
Dario Piga ◽  
Francesca Venturini ◽  
Marco A. Deriu

This paper presents the intrinsic limit determination algorithm (ILD Algorithm), a novel technique to determine the best possible performance, measured in terms of the AUC (area under the ROC curve) and accuracy, that can be obtained from a specific dataset in a binary classification problem with categorical features regardless of the model used. This limit, namely, the Bayes error, is completely independent of any model used and describes an intrinsic property of the dataset. The ILD algorithm thus provides important information regarding the prediction limits of any binary classification algorithm when applied to the considered dataset. In this paper, the algorithm is described in detail, its entire mathematical framework is presented and the pseudocode is given to facilitate its implementation. Finally, an example with a real dataset is given.


2021 ◽  
Vol 2021 ◽  
pp. 1-10
Author(s):  
Runzi Chen ◽  
Shuliang Zhao ◽  
Zhenzhen Tian

Multiscale brings great benefits for people to observe objects or problems from different perspectives. Multiscale clustering has been widely studied in various disciplines. However, most of the research studies are only for the numerical dataset, which is a lack of research on the clustering of nominal dataset, especially the data are nonindependent and identically distributed (Non-IID). Aiming at the current research situation, this paper proposes a multiscale clustering framework based on Non-IID nominal data. Firstly, the benchmark-scale dataset is clustered based on coupled metric similarity measure. Secondly, it is proposed to transform the clustering results from benchmark scale to target scale that the two algorithms are named upscaling based on single chain and downscaling based on Lanczos kernel, respectively. Finally, experiments are performed using five public datasets and one real dataset of the Hebei province of China. The results showed that the method can provide us not only competitive performance but also reduce computational cost.


PeerJ ◽  
2021 ◽  
Vol 9 ◽  
pp. e11884
Author(s):  
Kévin Da Silva ◽  
Nicolas Pons ◽  
Magali Berland ◽  
Florian Plaza Oñate ◽  
Mathieu Almeida ◽  
...  

Current studies are shifting from the use of single linear references to representation of multiple genomes organised in pangenome graphs or variation graphs. Meanwhile, in metagenomic samples, resolving strain-level abundances is a major step in microbiome studies, as associations between strain variants and phenotype are of great interest for diagnostic and therapeutic purposes. We developed StrainFLAIR with the aim of showing the feasibility of using variation graphs for indexing highly similar genomic sequences up to the strain level, and for characterizing a set of unknown sequenced genomes by querying this graph. On simulated data composed of mixtures of strains from the same bacterial species Escherichia coli, results show that StrainFLAIR was able to distinguish and estimate the abundances of close strains, as well as to highlight the presence of a new strain close to a referenced one and to estimate its abundance. On a real dataset composed of a mix of several bacterial species and several strains for the same species, results show that in a more complex configuration StrainFLAIR correctly estimates the abundance of each strain. Hence, results demonstrated how graph representation of multiple close genomes can be used as a reference to characterize a sample at the strain level.


2021 ◽  
Vol 11 (15) ◽  
pp. 7140
Author(s):  
Radko Mesiar ◽  
Ayyub Sheikhi

In this work, we use a copula-based approach to select the most important features for a random forest classification. Based on associated copulas between these features, we carry out this feature selection. We then embed the selected features to a random forest algorithm to classify a label-valued outcome. Our algorithm enables us to select the most relevant features when the features are not necessarily connected by a linear function; also, we can stop the classification when we reach the desired level of accuracy. We apply this method on a simulation study as well as a real dataset of COVID-19 and for a diabetes dataset.


Author(s):  
Riccardo Ievoli ◽  
Aldo Gardini ◽  
Lucio Palazzo

AbstractPasses are undoubtedly the more frequent events in football and other team sports. Passing networks and their structural features can be useful to evaluate the style of play in terms of passing behavior, analyzing and quantifying interactions among players. The present paper aims to show how information retrieved from passing networks can have a relevant impact on predicting the match outcome. In particular, we focus on modeling both the scored goals by two competing teams and the goal difference between them. With this purpose, we fit these outcomes using Bayesian hierarchical models, including both in-match and network-based covariates to cover many aspects of the offensive actions on the pitch. Furthermore, we review and compare different approaches to include covariates in modeling football outcomes. The presented methodology is applied to a real dataset containing information on 125 matches of the 2016–2017 UEFA Champions League, involving 32 among the best European teams. From our results, shots on target, corners, and such passing network indicators are the main determinants of the considered football outcomes.


Sign in / Sign up

Export Citation Format

Share Document