scholarly journals How to balance the bioinformatics data: pseudo-negative sampling

2019 ◽  
Vol 20 (S25) ◽  
Author(s):  
Yongqing Zhang ◽  
Shaojie Qiao ◽  
Rongzhao Lu ◽  
Nan Han ◽  
Dingxiang Liu ◽  
...  

Abstract Background Imbalanced datasets are commonly encountered in bioinformatics classification problems, that is, the number of negative samples is much larger than that of positive samples. Particularly, the data imbalance phenomena will make us underestimate the performance of the minority class of positive samples. Therefore, how to balance the bioinformatic data becomes a very challenging and difficult problem. Results In this study, we propose a new data sampling approach, called pseudo-negative sampling, which can be effectively applied to handle the case that: negative samples greatly dominate positive samples. Specifically, we design a supervised learning method based on a max-relevance min-redundancy criterion beyond Pearson correlation coefficient (MMPCC), which is used to choose pseudo-negative samples from the negative samples and view them as positive samples. In addition, MMPCC uses an incremental searching technique to select optimal pseudo-negative samples to reduce the computation cost. Consequently, the discovered pseudo-negative samples have strong relevance to positive samples and less redundancy to negative ones. Conclusions To validate the performance of our method, we conduct experiments base on four UCI datasets and three real bioinformatics datasets. According to the experimental results, we clearly observe the performance of MMPCC is better than other sampling methods in terms of Sensitivity, Specificity, Accuracy and the Mathew’s Correlation Coefficient. This reveals that the pseudo-negative samples are particularly helpful to solve the imbalance dataset problem. Moreover, the gain of Sensitivity from the minority samples with pseudo-negative samples grows with the improvement of prediction accuracy on all dataset.

Complexity ◽  
2021 ◽  
Vol 2021 ◽  
pp. 1-9
Author(s):  
Liping Chen ◽  
Jiabao Jiang ◽  
Yong Zhang

The classical classifiers are ineffective in dealing with the problem of imbalanced big dataset classification. Resampling the datasets and balancing samples distribution before training the classifier is one of the most popular approaches to resolve this problem. An effective and simple hybrid sampling method based on data partition (HSDP) is proposed in this paper. First, all the data samples are partitioned into different data regions. Then, the data samples in the noise minority samples region are removed and the samples in the boundary minority samples region are selected as oversampling seeds to generate the synthetic samples. Finally, a weighted oversampling process is conducted considering the generation of synthetic samples in the same cluster of the oversampling seed. The weight of each selected minority class sample is computed by the ratio between the proportion of majority class in the neighbors of this selected sample and the sum of all these proportions. Generation of synthetic samples in the same cluster of the oversampling seed guarantees new synthetic samples located inside the minority class area. Experiments conducted on eight datasets show that the proposed method, HSDP, is better than or comparable with the typical sampling methods for F-measure and G-mean.


Classification is a major obstacle in Machine Learning generally and also specific when tackling class imbalance problem. A dataset is said to be imbalanced if a class we are interested in falls to the minority class and appears scanty when compared to the majority class, the minority class is also known as the positive class while the majority class is also known as the negative class. Class imbalance has been a major bottleneck for Machine Learning scientist as it often leads to using wrong model for different purposes, this Survey will lead researchers to choose the right model and the best strategies to handle imbalance dataset in the course of tackling machine learning problems. Proper handling of class imbalance dataset could leads to accurate and good result. Handling class imbalance data in a conventional manner, especially when the level of imbalance is high may leads to accuracy paradox (an assumption of realizing 99% accuracy during evaluation process when the class distribution is highly imbalanced), hence imbalance class distribution requires special consideration, and for this purpose we dealt extensively on handling and solving imbalanced class problem in machine learning, such as Data Sampling Approach, Cost sensitive learning approach and Ensemble Approach.


2020 ◽  
Vol 11 (01) ◽  
pp. 144-150
Author(s):  
Narender Kaloria ◽  
Nidhi Bidyut Panda ◽  
Hemant Bhagat ◽  
Neha Kaloria ◽  
Shiv Lal Soni ◽  
...  

Abstract Background The intracranial pressure (ICP) is measured through various noninvasive methods to overcome complications of invasive ICP monitoring. In this study, transcranial Doppler was used to measure pulsatility index (PI) and resistive index (RI) that were correlated with opening intraventricular ICP. The opening intraventricular ICP was measured with the placement of intraventricular catheter in lateral ventricle without loss of cerebrospinal fluid. Methods The prospective, observational study was conducted on 40 patients with clinical and radiological features of raised ICP who underwent either endoscopic third ventriculostomy or ventriculoperitoneal shunt surgery. The PI and RI were measured simultaneously with opening ICP measurements under general anesthesia. Both PI and RI were correlated with ICP by using Pearson correlation coefficient. The receiver operating characteristic (ROC) curve was used to get the optimal values of PI ad RI for corresponding ICP values. Results The mean PI was 1.01 ± 0.41 and mean RI was 0.59 ± 0.32. The mean opening ICP value was 21.81 ± 8.68 mm Hg. The correlation between PI and RI with ICP was a statistically significant with correlation coefficient of 0.697 and 0.503, respectively. The ROC curve shown statistically significant association between PI and ICP from 15 to 40 mm Hg, whereas the association between RI and ICP was from 15 to 25 mm Hg, with various sensitivity and specificity. Conclusion The opening intraventricular ICP correlated better with PI than RI in patients with features of raised ICP.


Author(s):  
Novikova ◽  
SP Romanenko ◽  
MA Lobkis

Introduction: In the Russian Federation, much attention is traditionally paid to military education and training. A special place in its structure is occupied by the system of cadet classes and corps. A distinctive feature of the learning mode in such institutions is a combined effect of standard and specific factors of indoor school environment and intensive physical activity owing to sports, applied military and drill training. No evidence-based methods of establishing nutrient requirements of children in modern conditions of cadet corps have been developed so far, which predetermines the potential of transforming nutrition from a health-saving factor into a health risk factor. Our objective was to provide a scientific substantiation of the model of healthy nutrition for students of cadet-type educational establishments. Methods: The statistical significance of the correlation was evaluated using the Student’s t-test. Correlation and regression analyses were used to assess cause-and-effect relationships. The Pearson correlation coefficient (rxy) was used as an indicator of the strength of the relationship between quantitative indicators x and y, both having a normal distribution. Correlation coefficient (rxy) values were interpreted in accordance with the Chaddock scale. For the purpose of statistical modeling, the method of multiple linear regressions was used. Conclusions: We substantiated the innovative model of organizing healthy nutrition for students of cadet-type schools based on the correlation and regression analyses with determination of statistical significance of the studied characteristics. Its efficiency indicators include an increase in average functional capabilities of students by more than 10 % and a reduction in the probability of developmental disorders by more than 25 %.


2020 ◽  
Vol 16 (1) ◽  
pp. 47-53
Author(s):  
Vicente Benavides-Córdoba ◽  
Mauricio Palacios Gómez

Introduction: Animal models have been used to understand the pathophysiology of pulmonary hypertension, to describe the mechanisms of action and to evaluate promising active ingredients. The monocrotaline-induced pulmonary hypertension model is the most used animal model. In this model, invasive and non-invasive hemodynamic variables that resemble human measurements have been used. Aim: To define if non-invasive variables can predict hemodynamic measures in the monocrotaline-induced pulmonary hypertension model. Materials and Methods: Twenty 6-week old male Wistar rats weighing between 250-300g from the bioterium of the Universidad del Valle (Cali - Colombia) were used in order to establish that the relationships between invasive and non-invasive variables are sustained in different conditions (healthy, hypertrophy and treated). The animals were organized into three groups, a control group who was given 0.9% saline solution subcutaneously (sc), a group with pulmonary hypertension induced with a single subcutaneous dose of Monocrotaline 30 mg/kg, and a group with pulmonary hypertension with 30 mg/kg of monocrotaline treated with Sildenafil. Right ventricle ejection fraction, heart rate, right ventricle systolic pressure and the extent of hypertrophy were measured. The functional relation between any two variables was evaluated by the Pearson correlation coefficient. Results: It was found that all correlations were statistically significant (p <0.01). The strongest correlation was the inverse one between the RVEF and the Fulton index (r = -0.82). The Fulton index also had a strong correlation with the RVSP (r = 0.79). The Pearson correlation coefficient between the RVEF and the RVSP was -0.81, meaning that the higher the systolic pressure in the right ventricle, the lower the ejection fraction value. Heart rate was significantly correlated to the other three variables studied, although with relatively low correlation. Conclusion: The correlations obtained in this study indicate that the parameters evaluated in the research related to experimental pulmonary hypertension correlate adequately and that the measurements that are currently made are adequate and consistent with each other, that is, they have good predictive capacity.


Author(s):  
Yu Wang ◽  
Jiantao Wang ◽  
Haiping Wang ◽  
Xinyu Yang ◽  
Liming Chang ◽  
...  

Objective: Accurate assessment of breast tumor size preoperatively is important for the initial decision-making in surgical approach. Therefore, we aimed to compare efficacy of mammography and ultrasonography in ductal carcinoma in situ (DCIS) of breast cancer. Methods: Preoperative mammography and ultrasonography were performed on 104 women with DCIS of breast cancer. We compared the accuracy of each of the imaging modalities with pathological size by Pearson correlation. For each modality, it was considered concordant if the difference between imaging assessment and pathological measurement is less than 0.5cm. Results: At pathological examination tumor size ranged from 0.4cm to 7.2cm in largest diameter. For mammographically determined size versus pathological size, correlation coefficient of r was 0.786 and for ultrasonography it was 0.651. Grouped by breast composition, in almost entirely fatty and scattered areas of fibroglandular dense breast, correlation coefficient of r was 0.790 for mammography and 0.678 for ultrasonography; in heterogeneously dense and extremely dense breast, correlation coefficient of r was 0.770 for mammography and 0.548 for ultrasonography. In microcalcification positive group, coeffient of r was 0.772 for mammography and 0.570 for ultrasonography. In microcalcification negative group, coeffient of r was 0.806 for mammography and 0.783 for ultrasonography. Conclusion: Mammography was more accurate than ultrasonography in measuring the largest cancer diameter in DCIS of breast cancer. The correlation coefficient improved in the group of almost entirely fatty/ scattered areas of fibroglandular dense breast or in microcalcification negative group.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Yingxi Yang ◽  
Hui Wang ◽  
Wen Li ◽  
Xiaobo Wang ◽  
Shizhao Wei ◽  
...  

Abstract Background Protein post-translational modification (PTM) is a key issue to investigate the mechanism of protein’s function. With the rapid development of proteomics technology, a large amount of protein sequence data has been generated, which highlights the importance of the in-depth study and analysis of PTMs in proteins. Method We proposed a new multi-classification machine learning pipeline MultiLyGAN to identity seven types of lysine modified sites. Using eight different sequential and five structural construction methods, 1497 valid features were remained after the filtering by Pearson correlation coefficient. To solve the data imbalance problem, Conditional Generative Adversarial Network (CGAN) and Conditional Wasserstein Generative Adversarial Network (CWGAN), two influential deep generative methods were leveraged and compared to generate new samples for the types with fewer samples. Finally, random forest algorithm was utilized to predict seven categories. Results In the tenfold cross-validation, accuracy (Acc) and Matthews correlation coefficient (MCC) were 0.8589 and 0.8376, respectively. In the independent test, Acc and MCC were 0.8549 and 0.8330, respectively. The results indicated that CWGAN better solved the existing data imbalance and stabilized the training error. Alternatively, an accumulated feature importance analysis reported that CKSAAP, PWM and structural features were the three most important feature-encoding schemes. MultiLyGAN can be found at https://github.com/Lab-Xu/MultiLyGAN. Conclusions The CWGAN greatly improved the predictive performance in all experiments. Features derived from CKSAAP, PWM and structure schemes are the most informative and had the greatest contribution to the prediction of PTM.


Sensors ◽  
2020 ◽  
Vol 21 (1) ◽  
pp. 156
Author(s):  
Charles Carlson ◽  
Vanessa-Rose Turpin ◽  
Ahmad Suliman ◽  
Carl Ade ◽  
Steve Warren ◽  
...  

Background: The goal of this work was to create a sharable dataset of heart-driven signals, including ballistocardiograms (BCGs) and time-aligned electrocardiograms (ECGs), photoplethysmograms (PPGs), and blood pressure waveforms. Methods: A custom, bed-based ballistocardiographic system is described in detail. Affiliated cardiopulmonary signals are acquired using a GE Datex CardioCap 5 patient monitor (which collects ECG and PPG data) and a Finapres Medical Systems Finometer PRO (which provides continuous reconstructed brachial artery pressure waveforms and derived cardiovascular parameters). Results: Data were collected from 40 participants, 4 of whom had been or were currently diagnosed with a heart condition at the time they enrolled in the study. An investigation revealed that features extracted from a BCG could be used to track changes in systolic blood pressure (Pearson correlation coefficient of 0.54 +/− 0.15), dP/dtmax (Pearson correlation coefficient of 0.51 +/− 0.18), and stroke volume (Pearson correlation coefficient of 0.54 +/− 0.17). Conclusion: A collection of synchronized, heart-driven signals, including BCGs, ECGs, PPGs, and blood pressure waveforms, was acquired and made publicly available. An initial study indicated that bed-based ballistocardiography can be used to track beat-to-beat changes in systolic blood pressure and stroke volume. Significance: To the best of the authors’ knowledge, no other database that includes time-aligned ECG, PPG, BCG, and continuous blood pressure data is available to the public. This dataset could be used by other researchers for algorithm testing and development in this fast-growing field of health assessment, without requiring these individuals to invest considerable time and resources into hardware development and data collection.


Genes ◽  
2021 ◽  
Vol 12 (6) ◽  
pp. 870
Author(s):  
Jiansheng Zhang ◽  
Hongli Fu ◽  
Yan Xu

In recent years, scientists have found a close correlation between DNA methylation and aging in epigenetics. With the in-depth research in the field of DNA methylation, researchers have established a quantitative statistical relationship to predict the individual ages. This work used human blood tissue samples to study the association between age and DNA methylation. We built two predictors based on healthy and disease data, respectively. For the health data, we retrieved a total of 1191 samples from four previous reports. By calculating the Pearson correlation coefficient between age and DNA methylation values, 111 age-related CpG sites were selected. Gradient boosting regression was utilized to build the predictive model and obtained the R2 value of 0.86 and MAD of 3.90 years on testing dataset, which were better than other four regression methods as well as Horvath’s results. For the disease data, 354 rheumatoid arthritis samples were retrieved from a previous study. Then, 45 CpG sites were selected to build the predictor and the corresponded MAD and R2 were 3.11 years and 0.89 on the testing dataset respectively, which showed the robustness of our predictor. Our results were better than the ones from other four regression methods. Finally, we also analyzed the twenty-four common CpG sites in both healthy and disease datasets which illustrated the functional relevance of the selected CpG sites.


Sign in / Sign up

Export Citation Format

Share Document