under sampling
Recently Published Documents





2022 ◽  
Vol 22 (1) ◽  
pp. 1-18
Alessio Pagani ◽  
Zhuangkun Wei ◽  
Ricardo Silva ◽  
Weisi Guo

Infrastructure monitoring is critical for safe operations and sustainability. Like many networked systems, water distribution networks (WDNs) exhibit both graph topological structure and complex embedded flow dynamics. The resulting networked cascade dynamics are difficult to predict without extensive sensor data. However, ubiquitous sensor monitoring in underground situations is expensive, and a key challenge is to infer the contaminant dynamics from partial sparse monitoring data. Existing approaches use multi-objective optimization to find the minimum set of essential monitoring points but lack performance guarantees and a theoretical framework. Here, we first develop a novel Graph Fourier Transform (GFT) operator to compress networked contamination dynamics to identify the essential principal data collection points with inference performance guarantees. As such, the GFT approach provides the theoretical sampling bound. We then achieve under-sampling performance by building auto-encoder (AE) neural networks (NN) to generalize the GFT sampling process and under-sample further from the initial sampling set, allowing a very small set of data points to largely reconstruct the contamination dynamics over real and artificial WDNs. Various sources of the contamination are tested, and we obtain high accuracy reconstruction using around 5%–10% of the network nodes for known contaminant sources, and 50%–75% for unknown source cases, which although larger than that of the schemes for contaminant detection and source identifications, is smaller than the current sampling schemes for contaminant data recovery. This general approach of compression and under-sampled recovery via NN can be applied to a wide range of networked infrastructures to enable efficient data sampling for digital twins.

2022 ◽  
Vol 12 (1) ◽  
Belal Alsinglawi ◽  
Osama Alshari ◽  
Mohammed Alorjani ◽  
Omar Mubin ◽  
Fady Alnajjar ◽  

AbstractThis work introduces a predictive Length of Stay (LOS) framework for lung cancer patients using machine learning (ML) models. The framework proposed to deal with imbalanced datasets for classification-based approaches using electronic healthcare records (EHR). We have utilized supervised ML methods to predict lung cancer inpatients LOS during ICU hospitalization using the MIMIC-III dataset. Random Forest (RF) Model outperformed other models and achieved predicted results during the three framework phases. With clinical significance features selection, over-sampling methods (SMOTE and ADASYN) achieved the highest AUC results (98% with CI 95%: 95.3–100%, and 100% respectively). The combination of Over-sampling and under-sampling achieved the second-highest AUC results (98%, with CI 95%: 95.3–100%, and 97%, CI 95%: 93.7–100% SMOTE-Tomek, and SMOTE-ENN respectively). Under-sampling methods reported the least important AUC results (50%, with CI 95%: 40.2–59.8%) for both (ENN and Tomek- Links). Using ML explainable technique called SHAP, we explained the outcome of the predictive model (RF) with SMOTE class balancing technique to understand the most significant clinical features that contributed to predicting lung cancer LOS with the RF model. Our promising framework allows us to employ ML techniques in-hospital clinical information systems to predict lung cancer admissions into ICU.

2022 ◽  
Willson B Gaul ◽  
Dinara Sadykova ◽  
Hannah J White ◽  
Lupe León-Sánchez ◽  
Paul Caplat ◽  

Aim: Soil arthropods are important decomposers and nutrient cyclers, but are poorly represented on national and international conservation Red Lists. Opportunistic biological records for soil invertebrates are often sparse, and contain few observations of rare species but a relatively large number of non-detection observations (a problem known as class imbalance). Robinson et al. (2018) proposed a method for sub-sampling non-detection data using a spatial grid to improve class balance and spatial bias in bird data. For taxa that are less intensively sampled, datasets are smaller, which poses a challenge because under-sampling data removes information. We tested whether spatial under-sampling improved prediction performance of species distribution models for millipedes, for which large datasets are not available. We also tested whether using environmental predictor variables provided additional information beyond what is captured by spatial position for predicting species distributions. Location: Island of Ireland. Methods: We tested the spatial under-sampling method of Robinson et al. (2018) by using biological records to train species distribution models of rare millipedes. Results: Using spatially under-sampled training data improved species distribution model sensitivity (true positive rate) but decreased model specificity (true negative rate). The decrease in specificity was minimal for rarer species and was accompanied by substantial increases in sensitivity. For common species, specificity decreased more, and sensitivity increased less, making spatial under-sampling most useful for rare species. Geographic coordinates were as good as or better than environmental variables for predicting distributions of two out of six species. Main Conclusions: Spatial under-sampling improved prediction performance of species distribution models for rare soil arthropod species. Spatial under-sampling was most effective for rarer species. The good prediction performance of models using geographic coordinates is promising for modeling distributions of poorly studied species for which little is known about ecological or physiological determinants of occurrence.

2022 ◽  
Vol 2161 (1) ◽  
pp. 012072
Konduri Praveen Mahesh ◽  
Shaik Ashar Afrouz ◽  
Anu Shaju Areeckal

Abstract Every year there is an increasing loss of a huge amount of money due to fraudulent credit card transactions. Recently there is a focus on using machine learning algorithms to identify fraud transactions. The number of fraud cases to non-fraud transactions is very low. This creates a skewed or unbalanced data, which poses a challenge to training the machine learning models. The availability of a public dataset for this research problem is scarce. The dataset used for this work is obtained from Kaggle. In this paper, we explore different sampling techniques such as under-sampling, Synthetic Minority Oversampling Technique (SMOTE) and SMOTE-Tomek, to work on the unbalanced data. Classification models, such as k-Nearest Neighbour (KNN), logistic regression, random forest and Support Vector Machine (SVM), are trained on the sampled data to detect fraudulent credit card transactions. The performance of the various machine learning approaches are evaluated for its precision, recall and F1-score. The classification results obtained is promising and can be used for credit card fraud detection.

2022 ◽  
Vol 10 (1) ◽  
pp. 0-0

Heterogeneous CPDP (HCPDP) attempts to forecast defects in a software application having insufficient previous defect data. Nonetheless, with a Class Imbalance Problem (CIP) perspective, one should have a clear view of data distribution in the training dataset otherwise the trained model would lead to biased classification results. Class Imbalance Learning (CIL) is the method of achieving an equilibrium ratio between two classes in imbalanced datasets. There are a range of effective solutions to manage CIP such as resampling techniques like Over-Sampling (OS) & Under-Sampling (US) methods. The proposed research work employs Synthetic Minority Oversampling TEchnique (SMOTE) and Random Under Sampling (RUS) technique to handle CIP. In addition to this, the paper proposes a novel four-phase HCPDP model and contrasts the efficiency of basic HCPDP model with CIP and after handling CIP using SMOTE & RUS with three prediction pairs. Results show that training performance with SMOTE is substantially improved but RUS displays variations in relation to HCPDP for all three prediction pairs.

2021 ◽  
Alejandro Fernandez-Vega ◽  
Federica Farabegoli ◽  
Maria Mercedes Alonso-Martinez ◽  
Ignacio Ortea

Data-independent acquisition (DIA) methods have gained great popularity in bottom-up quantitative proteomics, as they overcome the irreproducibility and under-sampling limitations of data-dependent acquisition (DDA). diaPASEF, recently developed for the timsTOF Pro mass spectrometers, has brought improvements to DIA, providing additional ion separation (in the ion mobility dimension) and increasing sensitivity. Several studies have benchmarked different workflows for DIA quantitative proteomics, but mostly using instruments from Sciex and Thermo, and therefore, the results are not extrapolable to diaPASEF data. In this work, using a real-life sample set like the one that can be found in any proteomics experiment, we compared the results of analyzing PASEF data with different combinations of library-based and library-free analysis, combining the tools of the FragPipe suite, DIA-NN and including MS1-level LFQ with DDA-PASEF data, and also comparing with the workflows possible in Spectronaut. We verified that library-independent workflows, not so efficient not so long ago, have greatly improved in the recent versions of the software tools, and now perform as well or even better than library-based ones. We report here information so that the user who is going to conduct a relative quantitative proteomics study using a timsTOF Pro mass spectrometer can make an informed decision on how to acquire (diaPASEF for DIA analysis, or DDA-PASEF for MS1-level LFQ) the samples, and what can be expected depending on the data analysis tool used, among the different alternatives offered by the recently optimized tools for TIMS-PASEF data analysis.

2021 ◽  
Vol 29 (4) ◽  
Jean-Michel Bichain ◽  
Julien Ryelandt

We report here the first record of Mediterranea depressa (Sterki, 1880) in the north-eastern quarter of France, in the Vosges and Jura massifs. After the fortuitous discovery of some shells attributed to M. depressa in the southern Vosges Mts., an extensive sampling campaign was carried out both in the Vosges and in the Jura Mts. In total, about 20 shells and seven live specimens were found at eight localities, which, according to the present state of our knowledge, represent its north-western range limit. The species was found exclusively under stones of rocky slope screes on siliceous and calcareous substrates. Some of these habitats could be described as Mesovoid Shallow Substratum. It is not clear whether the rarity of the species is an effect of under-sampling or of its small size and unusual habitat or/and to intrinsic rarity due to isolated populations at the distribution limits of the species. The extreme north-eastern quarter of France constitutes an oceanic-continental transition zone where about thirty gastropod species from Central and Eastern Europe are currently documented at the western limit of their ranges.

2021 ◽  
Vol 14 (1) ◽  
Mahyar Sharifi ◽  
Toktam Khatibi ◽  
Mohammad Hassan Emamian ◽  
Somayeh Sadat ◽  
Hassan Hashemi ◽  

Abstract Objectives To develop and to propose a machine learning model for predicting glaucoma and identifying its risk factors. Method Data analysis pipeline is designed for this study based on Cross-Industry Standard Process for Data Mining (CRISP-DM) methodology. The main steps of the pipeline include data sampling, preprocessing, classification and evaluation and validation. Data sampling for providing the training dataset was performed with balanced sampling based on over-sampling and under-sampling methods. Data preprocessing steps were missing value imputation and normalization. For classification step, several machine learning models were designed for predicting glaucoma including Decision Trees (DTs), K-Nearest Neighbors (K-NN), Support Vector Machines (SVM), Random Forests (RFs), Extra Trees (ETs) and Bagging Ensemble methods. Moreover, in the classification step, a novel stacking ensemble model is designed and proposed using the superior classifiers. Results The data were from Shahroud Eye Cohort Study including demographic and ophthalmology data for 5190 participants aged 40-64 living in Shahroud, northeast Iran. The main variables considered in this dataset were 67 demographics, ophthalmologic, optometric, perimetry, and biometry features for 4561 people, including 4474 non-glaucoma participants and 87 glaucoma patients. Experimental results show that DTs and RFs trained based on under-sampling of the training dataset have superior performance for predicting glaucoma than the compared single classifiers and bagging ensemble methods with the average accuracy of 87.61 and 88.87, the sensitivity of 73.80 and 72.35, specificity of 87.88 and 89.10 and area under the curve (AUC) of 91.04 and 94.53, respectively. The proposed stacking ensemble has an average accuracy of 83.56, a sensitivity of 82.21, a specificity of 81.32, and an AUC of 88.54. Conclusions In this study, a machine learning model is proposed and developed to predict glaucoma disease among persons aged 40-64. Top predictors in this study considered features for discriminating and predicting non-glaucoma persons from glaucoma patients include the number of the visual field detect on perimetry, vertical cup to disk ratio, white to white diameter, systolic blood pressure, pupil barycenter on Y coordinate, age, and axial length.

2021 ◽  
Vol 3 (4) ◽  
pp. 249-259
Joy Iong-Zong Chen ◽  
Lu-Tsou Yeh

In power systems, electrical losses can be categorized into two types, namely, Technical Losses (TLs) and Non-Technical Losses (NTLs). It has been identified that NTL is more hazardous when compared to TL, primarily due to the factors such as billing errors, faulty meters, electricity theft etc. This proves to be crucial in the power system and will result in heavy financial loss for the utility companies. To identify theft, both academia and industry, use a mechanism known as Electricity Theft Detection (ETD). However, ETD is not used efficiently because of handling high-dimensional data, overfitting issues and imbalanced data. Hence, in this paper, a means of addressing this issue using Random Under-Sampling Boosting (RUSBoost) technique and Long Short-Term Memory (LSTM) technique is proposed. Here, parameter optimization is performed using RUSBoost and abnormal electricity patterns are detected by LSTM technique. Electricity data are pre-processed in the proposed methodology, using interpolation and normalization methods. The data thus obtained are then sent to the LSTM module where feature extraction takes place. These features are then classified using RUSBoost algorithm. Based on the output simulated, it is identified that this methodology addresses several issues such as handling and overfitting of massive time series data and data imbalancing. Moreover, this technique also proves to be more efficient than several other methodologies such as Logistic Regression (LR), Convolutional Neural Network (CNN) and Support Vector Machine (SVM). A comparison is also drawn, taking into consideration the parameters such as Receiver operating characteristics, recall, precision and F1-score.

Sign in / Sign up

Export Citation Format

Share Document