training dataset
Recently Published Documents





Shaolei Wang ◽  
Zhongyuan Wang ◽  
Wanxiang Che ◽  
Sendong Zhao ◽  
Ting Liu

Spoken language is fundamentally different from the written language in that it contains frequent disfluencies or parts of an utterance that are corrected by the speaker. Disfluency detection (removing these disfluencies) is desirable to clean the input for use in downstream NLP tasks. Most existing approaches to disfluency detection heavily rely on human-annotated data, which is scarce and expensive to obtain in practice. To tackle the training data bottleneck, in this work, we investigate methods for combining self-supervised learning and active learning for disfluency detection. First, we construct large-scale pseudo training data by randomly adding or deleting words from unlabeled data and propose two self-supervised pre-training tasks: (i) a tagging task to detect the added noisy words and (ii) sentence classification to distinguish original sentences from grammatically incorrect sentences. We then combine these two tasks to jointly pre-train a neural network. The pre-trained neural network is then fine-tuned using human-annotated disfluency detection training data. The self-supervised learning method can capture task-special knowledge for disfluency detection and achieve better performance when fine-tuning on a small annotated dataset compared to other supervised methods. However, limited in that the pseudo training data are generated based on simple heuristics and cannot fully cover all the disfluency patterns, there is still a performance gap compared to the supervised models trained on the full training dataset. We further explore how to bridge the performance gap by integrating active learning during the fine-tuning process. Active learning strives to reduce annotation costs by choosing the most critical examples to label and can address the weakness of self-supervised learning with a small annotated dataset. We show that by combining self-supervised learning with active learning, our model is able to match state-of-the-art performance with just about 10% of the original training data on both the commonly used English Switchboard test set and a set of in-house annotated Chinese data.

2022 ◽  
Vol 22 (1) ◽  
pp. 1-30
Rahul Kumar ◽  
Ankur Gupta ◽  
Harkirat Singh Arora ◽  
Balasubramanian Raman

Brain tumors are one of the critical malignant neurological cancers with the highest number of deaths and injuries worldwide. They are categorized into two major classes, high-grade glioma (HGG) and low-grade glioma (LGG), with HGG being more aggressive and malignant, whereas LGG tumors are less aggressive, but if left untreated, they get converted to HGG. Thus, the classification of brain tumors into the corresponding grade is a crucial task, especially for making decisions related to treatment. Motivated by the importance of such critical threats to humans, we propose a novel framework for brain tumor classification using discrete wavelet transform-based fusion of MRI sequences and Radiomics feature extraction. We utilized the Brain Tumor Segmentation 2018 challenge training dataset for the performance evaluation of our approach, and we extract features from three regions of interest derived using a combination of several tumor regions. We used wrapper method-based feature selection techniques for selecting a significant set of features and utilize various machine learning classifiers, Random Forest, Decision Tree, and Extra Randomized Tree for training the model. For proper validation of our approach, we adopt the five-fold cross-validation technique. We achieved state-of-the-art performance considering several performance metrics, 〈 Acc , Sens , Spec , F1-score , MCC , AUC 〉 ≡ 〈 98.60%, 99.05%, 97.33%, 99.05%, 96.42%, 98.19% 〉, where Acc , Sens , Spec , F1-score , MCC , and AUC represents the accuracy, sensitivity, specificity, F1-score, Matthews correlation coefficient, and area-under-the-curve, respectively. We believe our proposed approach will play a crucial role in the planning of clinical treatment and guidelines before surgery.

2022 ◽  
Vol 54 (8) ◽  
pp. 1-36
Shubhra Kanti Karmaker (“Santu”) ◽  
Md. Mahadi Hassan ◽  
Micah J. Smith ◽  
Lei Xu ◽  
Chengxiang Zhai ◽  

As big data becomes ubiquitous across domains, and more and more stakeholders aspire to make the most of their data, demand for machine learning tools has spurred researchers to explore the possibilities of automated machine learning (AutoML). AutoML tools aim to make machine learning accessible for non-machine learning experts (domain experts), to improve the efficiency of machine learning, and to accelerate machine learning research. But although automation and efficiency are among AutoML’s main selling points, the process still requires human involvement at a number of vital steps, including understanding the attributes of domain-specific data, defining prediction problems, creating a suitable training dataset, and selecting a promising machine learning technique. These steps often require a prolonged back-and-forth that makes this process inefficient for domain experts and data scientists alike and keeps so-called AutoML systems from being truly automatic. In this review article, we introduce a new classification system for AutoML systems, using a seven-tiered schematic to distinguish these systems based on their level of autonomy. We begin by describing what an end-to-end machine learning pipeline actually looks like, and which subtasks of the machine learning pipeline have been automated so far. We highlight those subtasks that are still done manually—generally by a data scientist—and explain how this limits domain experts’ access to machine learning. Next, we introduce our novel level-based taxonomy for AutoML systems and define each level according to the scope of automation support provided. Finally, we lay out a roadmap for the future, pinpointing the research required to further automate the end-to-end machine learning pipeline and discussing important challenges that stand in the way of this ambitious goal.

2022 ◽  
Vol 31 (1) ◽  
pp. 1-27
Amin Nikanjam ◽  
Houssem Ben Braiek ◽  
Mohammad Mehdi Morovati ◽  
Foutse Khomh

Nowadays, we are witnessing an increasing demand in both corporates and academia for exploiting Deep Learning ( DL ) to solve complex real-world problems. A DL program encodes the network structure of a desirable DL model and the process by which the model learns from the training dataset. Like any software, a DL program can be faulty, which implies substantial challenges of software quality assurance, especially in safety-critical domains. It is therefore crucial to equip DL development teams with efficient fault detection techniques and tools. In this article, we propose NeuraLint , a model-based fault detection approach for DL programs, using meta-modeling and graph transformations. First, we design a meta-model for DL programs that includes their base skeleton and fundamental properties. Then, we construct a graph-based verification process that covers 23 rules defined on top of the meta-model and implemented as graph transformations to detect faults and design inefficiencies in the generated models (i.e., instances of the meta-model). First, the proposed approach is evaluated by finding faults and design inefficiencies in 28 synthesized examples built from common problems reported in the literature. Then NeuraLint successfully finds 64 faults and design inefficiencies in 34 real-world DL programs extracted from Stack Overflow posts and GitHub repositories. The results show that NeuraLint effectively detects faults and design issues in both synthesized and real-world examples with a recall of 70.5% and a precision of 100%. Although the proposed meta-model is designed for feedforward neural networks, it can be extended to support other neural network architectures such as recurrent neural networks. Researchers can also expand our set of verification rules to cover more types of issues in DL programs.

2022 ◽  
Vol 22 (1) ◽  
Tenghui Han ◽  
Jun Zhu ◽  
Xiaoping Chen ◽  
Rujie Chen ◽  
Yu Jiang ◽  

Abstract Background Liver is the most common metastatic site of colorectal cancer (CRC) and liver metastasis (LM) determines subsequent treatment as well as prognosis of patients, especially in T1 patients. T1 CRC patients with LM are recommended to adopt surgery and systematic treatments rather than endoscopic therapy alone. Nevertheless, there is still no effective model to predict the risk of LM in T1 CRC patients. Hence, we aim to construct an accurate predictive model and an easy-to-use tool clinically. Methods We integrated two independent CRC cohorts from Surveillance Epidemiology and End Results database (SEER, training dataset) and Xijing hospital (testing dataset). Artificial intelligence (AI) and machine learning (ML) methods were adopted to establish the predictive model. Results A total of 16,785 and 326 T1 CRC patients from SEER database and Xijing hospital were incorporated respectively into the study. Every single ML model demonstrated great predictive capability, with an area under the curve (AUC) close to 0.95 and a stacking bagging model displaying the best performance (AUC = 0.9631). Expectedly, the stacking model exhibited a favorable discriminative ability and precisely screened out all eight LM cases from 326 T1 patients in the outer validation cohort. In the subgroup analysis, the stacking model also demonstrated a splendid predictive ability for patients with tumor size ranging from one to50mm (AUC = 0.956). Conclusion We successfully established an innovative and convenient AI model for predicting LM in T1 CRC patients, which was further verified in the external dataset. Ultimately, we designed a novel and easy-to-use decision tree, which only incorporated four fundamental parameters and could be successfully applied in clinical practice.

2022 ◽  
Vol 12 (1) ◽  
Akitoshi Shimazaki ◽  
Daiju Ueda ◽  
Antoine Choppin ◽  
Akira Yamamoto ◽  
Takashi Honjo ◽  

AbstractWe developed and validated a deep learning (DL)-based model using the segmentation method and assessed its ability to detect lung cancer on chest radiographs. Chest radiographs for use as a training dataset and a test dataset were collected separately from January 2006 to June 2018 at our hospital. The training dataset was used to train and validate the DL-based model with five-fold cross-validation. The model sensitivity and mean false positive indications per image (mFPI) were assessed with the independent test dataset. The training dataset included 629 radiographs with 652 nodules/masses and the test dataset included 151 radiographs with 159 nodules/masses. The DL-based model had a sensitivity of 0.73 with 0.13 mFPI in the test dataset. Sensitivity was lower in lung cancers that overlapped with blind spots such as pulmonary apices, pulmonary hila, chest wall, heart, and sub-diaphragmatic space (0.50–0.64) compared with those in non-overlapped locations (0.87). The dice coefficient for the 159 malignant lesions was on average 0.52. The DL-based model was able to detect lung cancers on chest radiographs, with low mFPI.

2022 ◽  
Vol 27 (1) ◽  
pp. 7
Monika Stipsitz ◽  
Hèlios Sanchis-Alepuz

Thermal simulations are an important part of the design process in many engineering disciplines. In simulation-based design approaches, a considerable amount of time is spent by repeated simulations. An alternative, fast simulation tool would be a welcome addition to any automatized and simulation-based optimisation workflow. In this work, we present a proof-of-concept study of the application of convolutional neural networks to accelerate thermal simulations. We focus on the thermal aspect of electronic systems. The goal of such a tool is to provide accurate approximations of a full solution, in order to quickly select promising designs for more detailed investigations. Based on a training set of randomly generated circuits with corresponding finite element solutions, the full 3D steady-state temperature field is estimated using a fully convolutional neural network. A custom network architecture is proposed which captures the long-range correlations present in heat conduction problems. We test the network on a separate dataset and find that the mean relative error is around 2% and the typical evaluation time is 35 ms per sample ( 2 ms for evaluation, 33 ms for data transfer). The benefit of this neural-network-based approach is that, once training is completed, the network can be applied to any system within the design space spanned by the randomized training dataset (which includes different components, material properties, different positioning of components on a PCB, etc.).

Sensors ◽  
2022 ◽  
Vol 22 (2) ◽  
pp. 562
Marcin Kociołek ◽  
Michał Kozłowski ◽  
Antonio Cardone

The perceived texture directionality is an important, not fully explored image characteristic. In many applications texture directionality detection is of fundamental importance. Several approaches have been proposed, such as the fast Fourier-based method. We recently proposed a method based on the interpolated grey-level co-occurrence matrix (iGLCM), robust to image blur and noise but slower than the Fourier-based method. Here we test the applicability of convolutional neural networks (CNNs) to texture directionality detection. To obtain the large amount of training data required, we built a training dataset consisting of synthetic textures with known directionality and varying perturbation levels. Subsequently, we defined and tested shallow and deep CNN architectures. We present the test results focusing on the CNN architectures and their robustness with respect to image perturbations. We identify the best performing CNN architecture, and compare it with the iGLCM, the Fourier and the local gradient orientation methods. We find that the accuracy of CNN is lower, yet comparable to the iGLCM, and it outperforms the other two methods. As expected, the CNN method shows the highest computing speed. Finally, we demonstrate the best performing CNN on real-life images. Visual analysis suggests that the learned patterns generalize to real-life image data. Hence, CNNs represent a promising approach for texture directionality detection, warranting further investigation.

Computation ◽  
2022 ◽  
Vol 10 (1) ◽  
pp. 6
Korab Rrmoku ◽  
Besnik Selimi ◽  
Lule Ahmedi

Receiving a recommendation for a certain item or a place to visit is now a common experience. However, the issue of trustworthiness regarding the recommended items/places remains one of the main concerns. In this paper, we present an implementation of the Naive Bayes classifier, one of the most powerful classes of Machine Learning and Artificial Intelligence algorithms in existence, to improve the accuracy of the recommendation and raise the trustworthiness confidence of the users and items within a network. Our approach is proven as a feasible one, since it reached the prediction accuracy of 89%, with a confidence of approximately 0.89, when applied to an online dataset of a social network. Naive Bayes algorithms, in general, are widely used on recommender systems because they are fast and easy to implement. However, the requirement for predictors to be independent remains a challenge due to the fact that in real-life scenarios, the predictors are usually dependent. As such, in our approach we used a larger training dataset; hence, the response vector has a higher selection quantity, thus empowering a higher determining accuracy.

Sensors ◽  
2022 ◽  
Vol 22 (2) ◽  
pp. 497
Sébastien Villon ◽  
Corina Iovan ◽  
Morgan Mangeas ◽  
Laurent Vigliola

With the availability of low-cost and efficient digital cameras, ecologists can now survey the world’s biodiversity through image sensors, especially in the previously rather inaccessible marine realm. However, the data rapidly accumulates, and ecologists face a data processing bottleneck. While computer vision has long been used as a tool to speed up image processing, it is only since the breakthrough of deep learning (DL) algorithms that the revolution in the automatic assessment of biodiversity by video recording can be considered. However, current applications of DL models to biodiversity monitoring do not consider some universal rules of biodiversity, especially rules on the distribution of species abundance, species rarity and ecosystem openness. Yet, these rules imply three issues for deep learning applications: the imbalance of long-tail datasets biases the training of DL models; scarce data greatly lessens the performances of DL models for classes with few data. Finally, the open-world issue implies that objects that are absent from the training dataset are incorrectly classified in the application dataset. Promising solutions to these issues are discussed, including data augmentation, data generation, cross-entropy modification, few-shot learning and open set recognition. At a time when biodiversity faces the immense challenges of climate change and the Anthropocene defaunation, stronger collaboration between computer scientists and ecologists is urgently needed to unlock the automatic monitoring of biodiversity.

Sign in / Sign up

Export Citation Format

Share Document