training set
Recently Published Documents





Shilpa Pandey ◽  
Gaurav Harit

In this article, we address the problem of localizing text and symbolic annotations on the scanned image of a printed document. Previous approaches have considered the task of annotation extraction as binary classification into printed and handwritten text. In this work, we further subcategorize the annotations as underlines, encirclements, inline text, and marginal text. We have collected a new dataset of 300 documents constituting all classes of annotations marked around or in-between printed text. Using the dataset as a benchmark, we report the results of two saliency formulations—CRF Saliency and Discriminant Saliency, for predicting salient patches, which can correspond to different types of annotations. We also compare our work with recent semantic segmentation techniques using deep models. Our analysis shows that Discriminant Saliency can be considered as the preferred approach for fast localization of patches containing different types of annotations. The saliency models were learned on a small dataset, but still, give comparable performance to the deep networks for pixel-level semantic segmentation. We show that saliency-based methods give better outcomes with limited annotated data compared to more sophisticated segmentation techniques that require a large training set to learn the model.

Jovi D’Silva ◽  
Uzzal Sharma

<span lang="EN-US">Automatic text summarization has gained immense popularity in research. Previously, several methods have been explored for obtaining effective text summarization outcomes. However, most of the work pertains to the most popular languages spoken in the world. Through this paper, we explore the area of extractive automatic text summarization using deep learning approach and apply it to Konkani language, which is a low-resource language as there are limited resources, such as data, tools, speakers and/or experts in Konkani. In the proposed technique, Facebook’s fastText <br /> pre-trained word embeddings are used to get a vector representation for sentences. Thereafter, deep multi-layer perceptron technique is employed, as a supervised binary classification task for auto-generating summaries using the feature vectors. Using pre-trained fastText word embeddings eliminated the requirement of a large training set and reduced training time. The system generated summaries were evaluated against the ‘gold-standard’ human generated summaries with recall-oriented understudy for gisting evaluation (ROUGE) toolkit. The results thus obtained showed that performance of the proposed system matched closely to the performance of the human annotators in generating summaries.</span>

Kashif Munir ◽  
Hongxiao Bai ◽  
Hai Zhao ◽  
Junhan Zhao

Implicit discourse relation recognition is a challenging task due to the absence of the necessary informative clues from explicit connectives. An implicit discourse relation recognizer has to carefully tackle the semantic similarity of sentence pairs and the severe data sparsity issue. In this article, we learn token embeddings to encode the structure of a sentence from a dependency point of view in their representations and use them to initialize a baseline model to make it really strong. Then, we propose a novel memory component to tackle the data sparsity issue by allowing the model to master the entire training set, which helps in achieving further performance improvement. The memory mechanism adequately memorizes information by pairing representations and discourse relations of all training instances, thus filling the slot of the data-hungry issue in the current implicit discourse relation recognizer. The proposed memory component, if attached with any suitable baseline, can help in performance enhancement. The experiments show that our full model with memorizing the entire training data provides excellent results on PDTB and CDTB datasets, outperforming the baselines by a fair margin.

2022 ◽  
Vol 14 (2) ◽  
pp. 1-15
Lara Mauri ◽  
Ernesto Damiani

Large-scale adoption of Artificial Intelligence and Machine Learning (AI-ML) models fed by heterogeneous, possibly untrustworthy data sources has spurred interest in estimating degradation of such models due to spurious, adversarial, or low-quality data assets. We propose a quantitative estimate of the severity of classifiers’ training set degradation: an index expressing the deformation of the convex hulls of the classes computed on a held-out dataset generated via an unsupervised technique. We show that our index is computationally light, can be calculated incrementally and complements well existing ML data assets’ quality measures. As an experimentation, we present the computation of our index on a benchmark convolutional image classifier.

2022 ◽  
Vol 31 (1) ◽  
pp. 1-26
Davide Falessi ◽  
Aalok Ahluwalia ◽  
Massimiliano DI Penta

Defect prediction models can be beneficial to prioritize testing, analysis, or code review activities, and has been the subject of a substantial effort in academia, and some applications in industrial contexts. A necessary precondition when creating a defect prediction model is the availability of defect data from the history of projects. If this data is noisy, the resulting defect prediction model could result to be unreliable. One of the causes of noise for defect datasets is the presence of “dormant defects,” i.e., of defects discovered several releases after their introduction. This can cause a class to be labeled as defect-free while it is not, and is, therefore “snoring.” In this article, we investigate the impact of snoring on classifiers' accuracy and the effectiveness of a possible countermeasure, i.e., dropping too recent data from a training set. We analyze the accuracy of 15 machine learning defect prediction classifiers, on data from more than 4,000 defects and 600 releases of 19 open source projects from the Apache ecosystem. Our results show that on average across projects (i) the presence of dormant defects decreases the recall of defect prediction classifiers, and (ii) removing from the training set the classes that in the last release are labeled as not defective significantly improves the accuracy of the classifiers. In summary, this article provides insights on how to create defects datasets by mitigating the negative effect of dormant defects on defect prediction.

2022 ◽  
Vol 11 ◽  
Huangqi Zhang ◽  
Binhao Zhang ◽  
Wenting Pan ◽  
Xue Dong ◽  
Xin Li ◽  

PurposeThis study aimed to develop a repeatable MRI-based machine learning model to differentiate between low-grade gliomas (LGGs) and glioblastoma (GBM) and provide more clinical information to improve treatment decision-making.MethodsPreoperative MRIs of gliomas from The Cancer Imaging Archive (TCIA)–GBM/LGG database were selected. The tumor on contrast-enhanced MRI was segmented. Quantitative image features were extracted from the segmentations. A random forest classification algorithm was used to establish a model in the training set. In the test phase, a random forest model was tested using an external test set. Three radiologists reviewed the images for the external test set. The area under the receiver operating characteristic curve (AUC) was calculated. The AUCs of the radiomics model and radiologists were compared.ResultsThe random forest model was fitted using a training set consisting of 142 patients [mean age, 52 years ± 16 (standard deviation); 78 men] comprising 88 cases of GBM. The external test set included 25 patients (14 with GBM). Random forest analysis yielded an AUC of 1.00 [95% confidence interval (CI): 0.86–1.00]. The AUCs for the three readers were 0.92 (95% CI 0.74–0.99), 0.70 (95% CI 0.49–0.87), and 0.59 (95% CI 0.38–0.78). Statistical differences were only found between AUC and Reader 1 (1.00 vs. 0.92, respectively; p = 0.16).ConclusionAn MRI radiomics-based random forest model was proven useful in differentiating GBM from LGG and showed better diagnostic performance than that of two inexperienced radiologists.

2022 ◽  
Vol 23 (1) ◽  
Dandan Guo ◽  
Huifang Wang ◽  
Jun Liu ◽  
Hang Liu ◽  
Ming Zhang ◽  

Abstract Background We aimed to develop and validate a nomogram model for predicting CKD after orthotopic liver transplantation (OLT). Methods The retrospective data of 399 patients who underwent transplantation and were followed in our centre were collected. They were randomly assigned to the training set (n = 293) and validation set (n = 106). Multivariable Cox regression analysis was performed in the training set to identify predictors of CKD. According to the Cox regression analysis results, a nomogram model was developed and validated. The renal function of recipients was monitored, and the long-term survival prognosis was assessed. Results The incidence of CKD at 5 years after OLT was 25.6%. Cox regression analysis identified several predictors of post-OLT CKD, including recipient age at surgery (HR 1.036, 95% CI 1.006-1.068; p = 0.018), female sex (HR 2.867, 95% CI 1.709-4.810; p < 0.001), preoperative hypertension (HR 1.670, 95% CI 0.962-2.898; p = 0.068), preoperative eGFR (HR 0.996, 95% CI 0.991-1.001; p = 0.143), uric acid at 3 months (HR 1.002, 95% CI 1.001-1.004; p = 0.028), haemoglobin at 3 months (HR 0.970, 95% CI 0.956-0.983; p < 0.001), and average concentration of cyclosporine A at 3 months (HR 1.002, 95% CI 1.001-1.003; p < 0.001). According to these parameters, a nomogram model for predicting CKD after OLT was constructed and validated. The C-indices were 0.75 and 0.80 in the training and validation sets. The calibration curve of the nomogram showed that the CKD probabilities predicted by the nomogram agreed with the observed probabilities at 1, 3, and 5 years after OLT (p > 0.05). Renal function declined slowly year by year, and there were significant differences between patients divided by these predictors. Kaplan-Meier survival analysis showed that the survival prognosis of recipients decreased significantly with the progression of renal function. Conclusions With excellent predictive abilities, the nomogram may be a simple and reliable tool to identify patients at high risk for CKD and poor long-term prognosis after OLT.

2022 ◽  
Vol 15 (1) ◽  
Yao Peng ◽  
Hui Wang ◽  
Qi Huang ◽  
Jingjing Wu ◽  
Mingjun Zhang

Abstract Background Long noncoding RNAs (lncRNAs) are important regulators of gene expression and can affect a variety of physiological processes. Recent studies have shown that immune-related lncRNAs play an important role in the tumour immune microenvironment and may have potential application value in the treatment and prognosis prediction of tumour patients. Epithelial ovarian cancer (EOC) is characterized by a high incidence and poor prognosis. However, there are few studies on immune-related lncRNAs in EOC. In this study, we focused on immune-related lncRNAs associated with survival in EOC. Methods We downloaded mRNA data for EOC patients from The Cancer Genome Atlas (TCGA) database and mRNA data for normal ovarian tissue from the Genotype-Tissue Expression (GTEx) database and identified differentially expressed genes through differential expression analysis. Immune-related lncRNAs were obtained through intersection and coexpression analysis of differential genes and immune-related genes from the Immunology Database and Analysis Portal (ImmPort). Samples in the TCGA EOC cohort were randomly divided into a training set, validation set and combination set. In the training set, Cox regression analysis and LASSO regression were performed to construct an immune-related lncRNA signature. Kaplan–Meier survival analysis, time-dependent ROC curve analysis, Cox regression analysis and principal component analysis were performed for verification in the training set, validation set and combination set. Further studies of pathways and immune cell infiltration were conducted through Gene Set Enrichment Analysis (GSEA) and the Timer data portal. Results An immune-related lncRNA signature was identified in EOC, which was composed of six immune-related lncRNAs (KRT7-AS, USP30-AS1, AC011445.1, AP005205.2, DNM3OS and AC027348.1). The signature was used to divide patients into high-risk and low-risk groups. The overall survival of the high-risk group was lower than that of the low-risk group and was verified to be robust in both the validation set and the combination set. The signature was confirmed to be an independent prognostic biomarker. Principal component analysis showed the different distribution patterns of high-risk and low-risk groups. This signature may be related to immune cell infiltration (mainly macrophages) and differential expression of immune checkpoint-related molecules (PD-1, PDL1, etc.). Conclusions We identified and established a prognostic signature of immune-related lncRNAs in EOC, which will be of great value in predicting the prognosis of clinical patients and may provide a new perspective for immunological research and individualized treatment in EOC.

2022 ◽  
Vol 53 (2) ◽  
pp. 225-232

Feedforward Neural Networks are used for daily precipitation forecast using several test stations all over India. The six year European Centre of Medium Range Weather Forecasting (ECMWF) data is used with the training set consisting of the four year data from 1985-1988 and validation set consisting of the data from 1989-1990. Neural networks are used to develop a concurrent relationship between precipitation and other atmospheric variables. No attempt is made to select optimal variables for this study and the inputs are chosen to be same as the ones obtained earlier at National Center for Medium Range Weather Forecasting (NCMRWF) in developing a linear regression model. Neural networks are found to yield results which are atleast as good as linear regression and in several cases yield 10 - 20 % improvement. This is encouraging since the variable selection has so far been optimized for linear regression.

2022 ◽  
Vol 9 ◽  
JinKui Wang ◽  
XiaoZhu Liu ◽  
Jie Tang ◽  
Qingquan Zhang ◽  
Yuanyang Zhao

Background: Hypopharyngeal squamous cell carcinomas (HPSCC) is one of the causes of death in elderly patients, an accurate prediction of survival can effectively improve the prognosis of patients. However, there is no accurate assessment of the survival prognosis of elderly patients with HPSCC. The purpose of this study is to establish a nomogram to predict the cancer-specific survival (CSS) of elderly patients with HPSCC.Methods: The clinicopathological data of all patients from 2004 to 2018 were downloaded from the SEER database. These patients were randomly divided into a training set (70%) and a validation set (30%). The univariate and multivariate Cox regression analysis confirmed independent risk factors for the prognosis of elderly patients with HPSCC. A new nomogram was constructed to predict 1-, 3-, and 5-year CSS in elderly patients with HPSCC. Then used the consistency index (C-index), the calibration curve, and the area under the receiver operating curve (AUC) to evaluate the accuracy and discrimination of the prediction model. Decision curve analysis (DCA) was used to assess the clinical value of the model.Results: A total of 3,172 patients were included in the study, and they were randomly divided into a training set (N = 2,219) and a validation set (N = 953). Univariate and multivariate analysis suggested that age, T stage, N stage, M stage, tumor size, surgery, radiotherapy, chemotherapy, and marriage were independent risk factors for patient prognosis. These nine variables are included in the nomogram to predict the CSS of patients. The C-index for the training set and validation was 0.713 (95% CI, 0.697–0.729) and 0.703 (95% CI, 0.678–0.729), respectively. The AUC results of the training and validation set indicate that this nomogram has good accuracy. The calibration curve indicates that the observed and predicted values are highly consistent. DCA indicated that the nomogram has a better clinical application value than the traditional TNM staging system.Conclusion: This study identified risk factors for survival in elderly patients with HPSCC. We found that age, T stage, N stage, M stage, tumor size, surgery, radiotherapy, chemotherapy, and marriage are independent prognostic factors. A new nomogram for predicting the CSS of elderly HPSCC patients was established. This model has good clinical application value and can help patients and doctors make clinical decisions.

Sign in / Sign up

Export Citation Format

Share Document