DeepSSPred: A Deep Learning Based Sulfenylation site predictor via a novel n-segmented optimize federated feature encoder

2020 ◽  
Vol 27 ◽  
Author(s):  
Zaheer Ullah Khan ◽  
Dechang Pi

Background: S-sulfenylation (S-sulphenylation, or sulfenic acid) proteins, are special kinds of post-translation modification, which plays an important role in various physiological and pathological processes such as cytokine signaling, transcriptional regulation, and apoptosis. Despite these aforementioned significances, and by complementing existing wet methods, several computational models have been developed for sulfenylation cysteine sites prediction. However, the performance of these models was not satisfactory due to inefficient feature schemes, severe imbalance issues, and lack of an intelligent learning engine. Objective: In this study, our motivation is to establish a strong and novel computational predictor for discrimination of sulfenylation and non-sulfenylation sites. Methods: In this study, we report an innovative bioinformatics feature encoding tool, named DeepSSPred, in which, resulting encoded features is obtained via n-segmented hybrid feature, and then the resampling technique called synthetic minority oversampling was employed to cope with the severe imbalance issue between SC-sites (minority class) and non-SC sites (majority class). State of the art 2DConvolutional Neural Network was employed over rigorous 10-fold jackknife cross-validation technique for model validation and authentication. Results: Following the proposed framework, with a strong discrete presentation of feature space, machine learning engine, and unbiased presentation of the underline training data yielded into an excellent model that outperforms with all existing established studies. The proposed approach is 6% higher in terms of MCC from the first best. On an independent dataset, the existing first best study failed to provide sufficient details. The model obtained an increase of 7.5% in accuracy, 1.22% in Sn, 12.91% in Sp and 13.12% in MCC on the training data and12.13% of ACC, 27.25% in Sn, 2.25% in Sp, and 30.37% in MCC on an independent dataset in comparison with 2nd best method. These empirical analyses show the superlative performance of the proposed model over both training and Independent dataset in comparison with existing literature studies. Conclusion : In this research, we have developed a novel sequence-based automated predictor for SC-sites, called DeepSSPred. The empirical simulations outcomes with a training dataset and independent validation dataset have revealed the efficacy of the proposed theoretical model. The good performance of DeepSSPred is due to several reasons, such as novel discriminative feature encoding schemes, SMOTE technique, and careful construction of the prediction model through the tuned 2D-CNN classifier. We believe that our research work will provide a potential insight into a further prediction of S-sulfenylation characteristics and functionalities. Thus, we hope that our developed predictor will significantly helpful for large scale discrimination of unknown SC-sites in particular and designing new pharmaceutical drugs in general.

Author(s):  
Shaolei Wang ◽  
Zhongyuan Wang ◽  
Wanxiang Che ◽  
Sendong Zhao ◽  
Ting Liu

Spoken language is fundamentally different from the written language in that it contains frequent disfluencies or parts of an utterance that are corrected by the speaker. Disfluency detection (removing these disfluencies) is desirable to clean the input for use in downstream NLP tasks. Most existing approaches to disfluency detection heavily rely on human-annotated data, which is scarce and expensive to obtain in practice. To tackle the training data bottleneck, in this work, we investigate methods for combining self-supervised learning and active learning for disfluency detection. First, we construct large-scale pseudo training data by randomly adding or deleting words from unlabeled data and propose two self-supervised pre-training tasks: (i) a tagging task to detect the added noisy words and (ii) sentence classification to distinguish original sentences from grammatically incorrect sentences. We then combine these two tasks to jointly pre-train a neural network. The pre-trained neural network is then fine-tuned using human-annotated disfluency detection training data. The self-supervised learning method can capture task-special knowledge for disfluency detection and achieve better performance when fine-tuning on a small annotated dataset compared to other supervised methods. However, limited in that the pseudo training data are generated based on simple heuristics and cannot fully cover all the disfluency patterns, there is still a performance gap compared to the supervised models trained on the full training dataset. We further explore how to bridge the performance gap by integrating active learning during the fine-tuning process. Active learning strives to reduce annotation costs by choosing the most critical examples to label and can address the weakness of self-supervised learning with a small annotated dataset. We show that by combining self-supervised learning with active learning, our model is able to match state-of-the-art performance with just about 10% of the original training data on both the commonly used English Switchboard test set and a set of in-house annotated Chinese data.


2021 ◽  
Vol 2021 ◽  
pp. 1-11 ◽  
Author(s):  
Yue Wu

In the field of computer science, data mining is a hot topic. It is a mathematical method for identifying patterns in enormous amounts of data. Image mining is an important data mining technique involving a variety of fields. In image mining, art image organization is an interesting research field worthy of attention. The classification of art images into several predetermined sets is referred to as art image categorization. Image preprocessing, feature extraction, object identification, object categorization, object segmentation, object classification, and a variety of other approaches are all part of it. The purpose of this paper is to suggest an improved boosting algorithm that employs a specific method of traditional and simple, yet weak classifiers to create a complex, accurate, and strong classifier image as well as a realistic image. This paper investigated the characteristics of cartoon images, realistic images, painting images, and photo images, created color variance histogram features, and used them for classification. To execute classification experiments, this paper uses an image database of 10471 images, which are randomly distributed into two portions that are used as training data and test data, respectively. The training dataset contains 6971 images, while the test dataset contains 3478 images. The investigational results show that the planned algorithm has a classification accuracy of approximately 97%. The method proposed in this paper can be used as the basis of automatic large-scale image classification and has strong practicability.


2021 ◽  
Author(s):  
Ying-Shi Sun ◽  
Yu-Hong Qu ◽  
Dong Wang ◽  
Yi Li ◽  
Lin Ye ◽  
...  

Abstract Background: Computer-aided diagnosis using deep learning algorithms has been initially applied in the field of mammography, but there is no large-scale clinical application.Methods: This study proposed to develop and verify an artificial intelligence model based on mammography. Firstly, retrospectively collected mammograms from six centers were randomized to a training dataset and a validation dataset for establishing the model. Secondly, the model was tested by comparing 12 radiologists’ performance with and without it. Finally, prospectively multicenter mammograms were diagnosed by radiologists with the model. The detection and diagnostic capabilities were evaluated using the free-response receiver operating characteristic (FROC) curve and ROC curve.Results: The sensitivity of model for detecting lesion after matching was 0.908 for false positive rate of 0.25 in unilateral images. The area under ROC curve (AUC) to distinguish the benign from malignant lesions was 0.855 (95% CI: 0.830, 0.880). The performance of 12 radiologists with the model was higher than that of radiologists alone (AUC: 0.852 vs. 0.808, P = 0.005). The mean reading time of with the model was shorter than that of reading alone (80.18 s vs. 62.28 s, P = 0.03). In prospective application, the sensitivity of detection reached 0.887 at false positive rate of 0.25; the AUC of radiologists with the model was 0.983 (95% CI: 0.978, 0.988), with sensitivity, specificity, PPV, and NPV of 94.36%, 98.07%, 87.76%, and 99.09%, respectively.Conclusions: The artificial intelligence model exhibits high accuracy for detecting and diagnosing breast lesions, improves diagnostic accuracy and saves time.Trial registration: NCT, NCT03708978. Registered 17 April 2018, https://register.clinicaltrials.gov/prs/app/ NCT03708978


2019 ◽  
Vol 11 (7) ◽  
pp. 755 ◽  
Author(s):  
Xiaodong Zhang ◽  
Kun Zhu ◽  
Guanzhou Chen ◽  
Xiaoliang Tan ◽  
Lifei Zhang ◽  
...  

Object detection on very-high-resolution (VHR) remote sensing imagery has attracted a lot of attention in the field of image automatic interpretation. Region-based convolutional neural networks (CNNs) have been vastly promoted in this domain, which first generate candidate regions and then accurately classify and locate the objects existing in these regions. However, the overlarge images, the complex image backgrounds and the uneven size and quantity distribution of training samples make the detection tasks more challenging, especially for small and dense objects. To solve these problems, an effective region-based VHR remote sensing imagery object detection framework named Double Multi-scale Feature Pyramid Network (DM-FPN) was proposed in this paper, which utilizes inherent multi-scale pyramidal features and combines the strong-semantic, low-resolution features and the weak-semantic, high-resolution features simultaneously. DM-FPN consists of a multi-scale region proposal network and a multi-scale object detection network, these two modules share convolutional layers and can be trained end-to-end. We proposed several multi-scale training strategies to increase the diversity of training data and overcome the size restrictions of the input images. We also proposed multi-scale inference and adaptive categorical non-maximum suppression (ACNMS) strategies to promote detection performance, especially for small and dense objects. Extensive experiments and comprehensive evaluations on large-scale DOTA dataset demonstrate the effectiveness of the proposed framework, which achieves mean average precision (mAP) value of 0.7927 on validation dataset and the best mAP value of 0.793 on testing dataset.


Genes ◽  
2018 ◽  
Vol 9 (9) ◽  
pp. 424 ◽  
Author(s):  
Xingyan Li ◽  
Weidong Li ◽  
Yan Xu

All tissues of organisms will become old as time goes on. In recent years, epigenetic investigations have found that there is a close correlation between DNA methylation and aging. With the development of DNA methylation research, a quantitative statistical relationship between DNA methylation and different ages was established based on the change rule of methylation with age, it is then possible to predict the age of individuals. All the data in this work were retrieved from the Illumina HumanMethylation BeadChip platform (27K or 450K). We analyzed 16 sets of healthy samples and 9 sets of diseased samples. The healthy samples included a total of 1899 publicly available blood samples (0–103 years old) and the diseased samples included 2395 blood samples. Six age-related CpG sites were selected through calculating Pearson correlation coefficients between age and DNA methylation values. We built a gradient boosting regressor model for these age-related CpG sites. 70% of the data was randomly selected as training data and the other 30% as independent data in each dataset for 25 runs in total. In the training dataset, the healthy samples showed that the correlation between predicted age and DNA methylation was 0.97, and the mean absolute deviation (MAD) was 2.72 years. In the independent dataset, the MAD was 4.06 years. The proposed model was further tested using the diseased samples. The MAD was 5.44 years for the training dataset and 7.08 years for the independent dataset. Furthermore, our model worked well when it was applied to saliva samples. These results illustrated that the age prediction based on six DNA methylation markers is very effective using the gradient boosting regressor.


2019 ◽  
Vol 11 (19) ◽  
pp. 2190
Author(s):  
Kushiyama ◽  
Matsuoka

After a large-scale disaster, many damaged buildings are demolished and treated as disaster waste. Though the weight of disaster waste was estimated two months after the 2016 earthquake in Kumamoto, Japan, the estimated weight was significantly different from the result when the disaster waste disposal was completed in March 2018. The amount of disaster waste generated is able to be estimated by an equation by multiplying the total number of severely damaged and partially damaged buildings by the coefficient of generated weight per building. We suppose that the amount of disaster waste would be affected by the conditions of demolished buildings, namely, the areas and typologies of building structures, but this has not yet been clarified. Therefore, in this study, we aimed to use geographic information system (GIS) map data to create a time series GIS map dataset with labels of demolished and remaining buildings in Mashiki town for the two-year period prior to the completion of the disaster waste disposal. We used OpenStreetMap (OSM) data as the base data and time series SPOT images observed in the two years following the Kumamoto earthquake to label all demolished and remaining buildings in the GIS map dataset. To effectively label the approximately 16,000 buildings in Mashiki town, we calculated an indicator that shows the possibility of the buildings to be classified as the remaining and demolished buildings from a change of brightness in SPOT images. We classified 5701 demolished buildings from 16,106 buildings, as of March 2018, by visual interpretation of the SPOT and Pleiades images with reference to this indicator. We verified that the number of demolished buildings was almost the same as the number reported by Mashiki municipality. Moreover, we assessed the accuracy of our proposed method: The F-measure was higher than 0.9 using the training dataset, which was verified by a field survey and visual interpretation, and included the labels of the 55 demolished and 55 remaining buildings. We also assessed the accuracy of the proposed method by applying it to all the labels in the OSM dataset, but the F-measure was 0.579. If we applied test data including balanced labels of the other 100 demolished and 100 remaining buildings, which were other than the training data, the F-measure was 0.790 calculated from the SPOT image of 25 March 2018. Our proposed method performed better for the balanced classification but not for imbalanced classification. We studied the examples of image characteristics of correct and incorrect estimation by thresholding the indicator.


2020 ◽  
Vol 10 (1) ◽  
Author(s):  
Poulamee Chakraborty ◽  
Bhabani S. Das ◽  
Hitesh B. Vasava ◽  
Niranjan Panigrahi ◽  
Priyabrata Santra

Abstract Pedotransfer function (PTF) approach is a convenient way for estimating difficult-to-measure soil properties from basic soil data. Typically, PTFs are developed using a large number of samples collected from small (regional) areas for training and testing a predictive model. National soil legacy databases offer an opportunity to provide soil data for developing PTFs although legacy data are sparsely distributed covering large areas. Here, we examined the Indian soil legacy (ISL) database to select a comprehensive training dataset for estimating cation exchange capacity (CEC) as a test case in the PTF approach. Geostatistical and correlation analyses showed that legacy data entail diverse spatial and correlation structure needed in building robust PTFs. Through non-linear correlation measures and intelligent predictive algorithms, we developed a methodology to extract an efficient training dataset from the ISL data for estimating CEC with high prediction accuracy. The selected training data had comparable spatial variation and nonlinearity in parameters for training and test datasets. Thus, we identified specific indicators for constructing robust PTFs from legacy data. Our results open a new avenue to use large volume of existing soil legacy data for developing region-specific PTFs without the need for collecting new soil data.


2020 ◽  
Vol 9 (7) ◽  
pp. 443 ◽  
Author(s):  
Xinxiang Lei ◽  
Wei Chen ◽  
Binh Thai Pham

The main purpose of this study was to apply the novel bivariate weights-of-evidence-based SysFor (SF) for landslide susceptibility mapping, and two machine learning techniques, namely the naïve Bayes (NB) and Radial basis function networks (RBFNetwork), as benchmark models. Firstly, by using aerial photos and geological field surveys, the 263 landslide locations in the study area were obtained. Next, the identified landslides were randomly classified according to the ratio of 70/30 to construct training data and validation models, respectively. Secondly, based on the landslide inventory map, combined with the geological and geomorphological characteristics of the study area, 14 affecting factors of the landslide were determined. The predictive ability of the selected factors was evaluated using the LSVM model. Using the WoE model, the relationship between landslides and affecting factors was analyzed by positive and negative correlation methods. The above three hybrid models were then used to map landslide susceptibility. Thirdly, the ROC curve and various statistical data (SE, 95% CI and MAE) were used to verify and compare the predictive power of the model. Compared with the other two models, the Sysfor model had a larger area under the curve (AUC) of 0.876 (training dataset) and 0.783 (validation dataset). Finally, by quantitatively comparing the susceptibility values of each pixel, the differences in spatial morphology of landslide susceptibility maps were compared, and the model was found to have limitations and effectiveness. The landslide susceptibility maps obtained by the three models are reasonable, and the landslide susceptibility maps generated by the SysFor model have the highest comprehensive performance. The results obtained in this paper can help local governments in land use planning, disaster reduction and environmental protection.


2021 ◽  
Vol 9 ◽  
Author(s):  
Ahnjili ZhuParris ◽  
Matthijs D. Kruizinga ◽  
Max van Gent ◽  
Eva Dessing ◽  
Vasileios Exadaktylos ◽  
...  

Introduction: The duration and frequency of crying of an infant can be indicative of its health. Manual tracking and labeling of crying is laborious, subjective, and sometimes inaccurate. The aim of this study was to develop and technically validate a smartphone-based algorithm able to automatically detect crying.Methods: For the development of the algorithm a training dataset containing 897 5-s clips of crying infants and 1,263 clips of non-crying infants and common domestic sounds was assembled from various online sources. OpenSMILE software was used to extract 1,591 audio features per audio clip. A random forest classifying algorithm was fitted to identify crying from non-crying in each audio clip. For the validation of the algorithm, an independent dataset consisting of real-life recordings of 15 infants was used. A 29-min audio clip was analyzed repeatedly and under differing circumstances to determine the intra- and inter- device repeatability and robustness of the algorithm.Results: The algorithm obtained an accuracy of 94% in the training dataset and 99% in the validation dataset. The sensitivity in the validation dataset was 83%, with a specificity of 99% and a positive- and negative predictive value of 75 and 100%, respectively. Reliability of the algorithm appeared to be robust within- and across devices, and the performance was robust to distance from the sound source and barriers between the sound source and the microphone.Conclusion: The algorithm was accurate in detecting cry duration and was robust to various changes in ambient settings.


2020 ◽  
Vol 2020 ◽  
pp. 1-7
Author(s):  
Bin Li ◽  
Jianmin Yao

The performance of a machine translation system (MTS) depends on the quality and size of the training data. How to extend the training dataset for the MTS in specific domains with effective methods to enhance the performance of machine translation needs to be explored. A method for selecting in-domain bilingual sentence pairs based on the topic information is proposed. With the aid of the topic relevance of the bilingual sentence pairs to the target domain, subsets of sentence pairs related to the texts to be translated are selected from a large-scale bilingual corpus to train the translation system in specific domains to improve the translation quality for in-domain texts. Through the test, the bilingual sentence pairs are selected by using the proposed method, and further the MTS is trained. In this way, the translation performance is greatly enhanced.


Sign in / Sign up

Export Citation Format

Share Document