data sampling
Recently Published Documents





Jesmeen Mohd Zebaral Hoque ◽  
Jakir Hossen ◽  
Shohel Sayeed ◽  
Chy. Mohammed Tawsif K. ◽  
Jaya Ganesan ◽  

Recently, the industry of healthcare started generating a large volume of datasets. If hospitals can employ the data, they could easily predict the outcomes and provide better treatments at early stages with low cost. Here, data analytics (DA) was used to make correct decisions through proper analysis and prediction. However, inappropriate data may lead to flawed analysis and thus yield unacceptable conclusions. Hence, transforming the improper data from the entire data set into useful data is essential. Machine learning (ML) technique was used to overcome the issues due to incomplete data. A new architecture, automatic missing value imputation (AMVI) was developed to predict missing values in the dataset, including data sampling and feature selection. Four prediction models (i.e., logistic regression, support vector machine (SVM), AdaBoost, and random forest algorithms) were selected from the well-known classification. The complete AMVI architecture performance was evaluated using a structured data set obtained from the UCI repository. Accuracy of around 90% was achieved. It was also confirmed from cross-validation that the trained ML model is suitable and not over-fitted. This trained model is developed based on the dataset, which is not dependent on a specific environment. It will train and obtain the outperformed model depending on the data available.

2022 ◽  
Vol 40 (4) ◽  
pp. 1-28
Peng Zhang ◽  
Baoxi Liu ◽  
Tun Lu ◽  
Xianghua Ding ◽  
Hansu Gu ◽  

User-generated contents (UGC) in social media are the direct expression of users’ interests, preferences, and opinions. User behavior prediction based on UGC has increasingly been investigated in recent years. Compared to learning a person’s behavioral patterns in each social media site separately, jointly predicting user behavior in multiple social media sites and complementing each other (cross-site user behavior prediction) can be more accurate. However, cross-site user behavior prediction based on UGC is a challenging task due to the difficulty of cross-site data sampling, the complexity of UGC modeling, and uncertainty of knowledge sharing among different sites. For these problems, we propose a Cross-Site Multi-Task (CSMT) learning method to jointly predict user behavior in multiple social media sites. CSMT mainly derives from the hierarchical attention network and multi-task learning. Using this method, the UGC in each social media site can obtain fine-grained representations in terms of words, topics, posts, hashtags, and time slices as well as the relevances among them, and prediction tasks in different social media sites can be jointly implemented and complement each other. By utilizing two cross-site datasets sampled from Weibo, Douban, Facebook, and Twitter, we validate our method’s superiority on several classification metrics compared with existing related methods.

2022 ◽  
Vol 22 (1) ◽  
pp. 1-18
Alessio Pagani ◽  
Zhuangkun Wei ◽  
Ricardo Silva ◽  
Weisi Guo

Infrastructure monitoring is critical for safe operations and sustainability. Like many networked systems, water distribution networks (WDNs) exhibit both graph topological structure and complex embedded flow dynamics. The resulting networked cascade dynamics are difficult to predict without extensive sensor data. However, ubiquitous sensor monitoring in underground situations is expensive, and a key challenge is to infer the contaminant dynamics from partial sparse monitoring data. Existing approaches use multi-objective optimization to find the minimum set of essential monitoring points but lack performance guarantees and a theoretical framework. Here, we first develop a novel Graph Fourier Transform (GFT) operator to compress networked contamination dynamics to identify the essential principal data collection points with inference performance guarantees. As such, the GFT approach provides the theoretical sampling bound. We then achieve under-sampling performance by building auto-encoder (AE) neural networks (NN) to generalize the GFT sampling process and under-sample further from the initial sampling set, allowing a very small set of data points to largely reconstruct the contamination dynamics over real and artificial WDNs. Various sources of the contamination are tested, and we obtain high accuracy reconstruction using around 5%–10% of the network nodes for known contaminant sources, and 50%–75% for unknown source cases, which although larger than that of the schemes for contaminant detection and source identifications, is smaller than the current sampling schemes for contaminant data recovery. This general approach of compression and under-sampled recovery via NN can be applied to a wide range of networked infrastructures to enable efficient data sampling for digital twins.

2022 ◽  
Vol 18 (1) ◽  
pp. 1-13
David Thompson ◽  
Haibo Wang

This work presents a methodology to monitor the power signature of IoT devices for detecting operation abnormality. It does not require bulky measurement equipment thanks to the proposed power signature generation circuit which can be integrated into LDO voltage regulators. The proposed circuit is implemented using a 130 nm CMOS technology and simulated with power trace measured from a wireless sensor. It shows the generated power signature accurately reflects the power consumption and can be used to distinguish different operation conditions, such as wireless transmission levels, data sampling rates and microcontroller UART communications.

2022 ◽  
Vol 31 (1) ◽  
pp. 1-38
Yingzhe Lyu ◽  
Gopi Krishnan Rajbahadur ◽  
Dayi Lin ◽  
Boyuan Chen ◽  
Zhen Ming (Jack) Jiang

Artificial Intelligence for IT Operations (AIOps) has been adopted in organizations in various tasks, including interpreting models to identify indicators of service failures. To avoid misleading practitioners, AIOps model interpretations should be consistent (i.e., different AIOps models on the same task agree with one another on feature importance). However, many AIOps studies violate established practices in the machine learning community when deriving interpretations, such as interpreting models with suboptimal performance, though the impact of such violations on the interpretation consistency has not been studied. In this article, we investigate the consistency of AIOps model interpretation along three dimensions: internal consistency, external consistency, and time consistency. We conduct a case study on two AIOps tasks: predicting Google cluster job failures and Backblaze hard drive failures. We find that the randomness from learners, hyperparameter tuning, and data sampling should be controlled to generate consistent interpretations. AIOps models with AUCs greater than 0.75 yield more consistent interpretation compared to low-performing models. Finally, AIOps models that are constructed with the Sliding Window or Full History approaches have the most consistent interpretation with the trends presented in the entire datasets. Our study provides valuable guidelines for practitioners to derive consistent AIOps model interpretation.

2022 ◽  
Meike E van der Heijden ◽  
Amanda M Brown ◽  
Roy V Sillitoe

In vivo single-unit recordings distinguish the basal spiking properties of neurons in different experimental settings and disease states. Here, we examined over 300 spike trains recorded from Purkinje cells and cerebellar nuclei neurons to test whether data sampling approaches influence the extraction of rich descriptors of firing properties. Our analyses included neurons recorded in awake and anesthetized control mice, as well as disease models of ataxia, dystonia, and tremor. We find that recording duration circumscribes overall representations of firing rate and pattern. Notably, shorter recording durations skew estimates for global firing rate variability towards lower values. We also find that only some populations of neurons in the same mouse are more similar to each other than to neurons recorded in different mice. These data reveal that recording duration and approach are primary considerations when interpreting task-independent single-neuron firing properties. If not accounted for, group differences may be concealed or exaggerated.

Cruz Y. Li ◽  
Zengshun Chen ◽  
Tim K. T. Tse ◽  
Asiri U. Weerasuriya ◽  
Xuelin Zhang ◽  

AbstractScientific research and engineering practice often require the modeling and decomposition of nonlinear systems. The dynamic mode decomposition (DMD) is a novel Koopman-based technique that effectively dissects high-dimensional nonlinear systems into periodically distinct constituents on reduced-order subspaces. As a novel mathematical hatchling, the DMD bears vast potentials yet an equal degree of unknown. This effort investigates the nuances of DMD sampling with an engineering-oriented emphasis. It aimed at elucidating how sampling range and resolution affect the convergence of DMD modes. We employed the most classical nonlinear system in fluid mechanics as the test subject—the turbulent free-shear flow over a prism—for optimal pertinency. We numerically simulated the flow by the dynamic-stress Large-Eddies Simulation with Near-Wall Resolution. With the large-quantity, high-fidelity data, we parametrized and identified four global convergence states: Initialization, Transition, Stabilization, and Divergence with increasing sampling range. Results showed that Stabilization is the optimal state for modal convergence, in which DMD output becomes independent of the sampling range. The Initialization state also yields sufficient accuracy for most system reconstruction tasks. Moreover, defying popular beliefs, over-sampling causes algorithmic instability: as the temporal dimension, n, approaches and transcends the spatial dimension, m (i.e., m < n), the output diverges and becomes meaningless. Additionally, the convergence of the sampling resolution depends on the mode-specific dynamics, such that the resolution of 15 frames per cycle for target activities is suggested for most engineering implementations. Finally, a bi-parametric study revealed that the convergence of the sampling range and resolution are mutually independent.

2022 ◽  
Vol 1 (3) ◽  
pp. 160-167
Iga Maliga ◽  
Herni Hasifah ◽  
Desy Fadilah Adina Putri

Pertumbuhan penduduk dan membaiknya kondisi ekonomi masyarakat Indonesia saat ini tanpa kita sadari telah menyebabkan kualitas lingkungan menurun. Penurunan kualitas lingkungan dapat kita lihat dari banyaknya kondisi lingkungan yang mengalami degradasi atau penurunan kualitas lingkungan akibat pencemaran lingkungan. Desa Kukin merupakan salah satu wilayah di Kecamatan Sumbawa yang berpenduduk padat dan belum memiliki akses truk sampah. Penelitian ini bertujuan untuk mengetahui jumlah sampah yang dihasilkan di Desa Kukin, Kabupaten Sumbawa dan untuk mengetahui dampak yang ditimbulkan dari akumulasi timbulan sampah yang ada. Penelitian ini menggunakan kombinasi metode deskriptif kuantitatif dan kualitatif. Sampel penelitian ini sebanyak 40 ibu rumah tangga sebagai responden yang diambil secara purposive sampling. Proses pengambilan data sampling sampah harian dilakukan selama 7 hari. Wawancara mendalam dan observasi digunakan sebagai bentuk pembuatan matriks dampak yang dirasakan warga. Matriks dampak melibatkan aspek sosial budaya, kesehatan dan lingkungan. Hasil penelitian menunjukkan bahwa rata-rata timbulan sampah harian 32,6 kg, rata-rata timbulan sampah harian per orang 0,82 kg dan kepadatan timbulan sampah 190 kg/L. Proses prediksi dampak negatif yang terjadi terhadap lingkungan menggunakan matriks Akibat dari penimbunan di kawasan Desa Kukin dapat menimbulkan dampak yang berkaitan dengan kondisi geofisika, kondisi biotik dan kondisi sosial, ekonomi dan budaya.

2022 ◽  
Vol 2022 ◽  
pp. 1-17
Zhihui Hu ◽  
Xiaoran Wei ◽  
Xiaoxu Han ◽  
Guang Kou ◽  
Haoyu Zhang ◽  

Density peaks clustering (DPC) is a well-known density-based clustering algorithm that can deal with nonspherical clusters well. However, DPC has high computational complexity and space complexity in calculating local density ρ and distance δ , which makes it suitable only for small-scale data sets. In addition, for clustering high-dimensional data, the performance of DPC still needs to be improved. High-dimensional data not only make the data distribution more complex but also lead to more computational overheads. To address the above issues, we propose an improved density peaks clustering algorithm, which combines feature reduction and data sampling strategy. Specifically, features of the high-dimensional data are automatically extracted by principal component analysis (PCA), auto-encoder (AE), and t-distributed stochastic neighbor embedding (t-SNE). Next, in order to reduce the computational overhead, we propose a novel data sampling method for the low-dimensional feature data. Firstly, the data distribution in the low-dimensional feature space is estimated by the Quasi-Monte Carlo (QMC) sequence with low-discrepancy characteristics. Then, the representative QMC points are selected according to their cell densities. Next, the selected QMC points are used to calculate ρ and δ instead of the original data points. In general, the number of the selected QMC points is much smaller than that of the initial data set. Finally, a two-stage classification strategy based on the QMC points clustering results is proposed to classify the original data set. Compared with current works, our proposed algorithm can reduce the computational complexity from O n 2 to O N n , where N denotes the number of selected QMC points and n is the size of original data set, typically N ≪ n . Experimental results demonstrate that the proposed algorithm can effectively reduce the computational overhead and improve the model performance.

Sign in / Sign up

Export Citation Format

Share Document