CRSSC: Salvage Reusable Samples from Noisy Data for Robust Learning

Author(s):  
Zeren Sun ◽  
Xian-Sheng Hua ◽  
Yazhou Yao ◽  
Xiu-Shen Wei ◽  
Guosheng Hu ◽  
...  
Keyword(s):  
2020 ◽  
Vol 34 (04) ◽  
pp. 6853-6860
Author(s):  
Xuchao Zhang ◽  
Xian Wu ◽  
Fanglan Chen ◽  
Liang Zhao ◽  
Chang-Tien Lu

The success of training accurate models strongly depends on the availability of a sufficient collection of precisely labeled data. However, real-world datasets contain erroneously labeled data samples that substantially hinder the performance of machine learning models. Meanwhile, well-labeled data is usually expensive to obtain and only a limited amount is available for training. In this paper, we consider the problem of training a robust model by using large-scale noisy data in conjunction with a small set of clean data. To leverage the information contained via the clean labels, we propose a novel self-paced robust learning algorithm (SPRL) that trains the model in a process from more reliable (clean) data instances to less reliable (noisy) ones under the supervision of well-labeled data. The self-paced learning process hedges the risk of selecting corrupted data into the training set. Moreover, theoretical analyses on the convergence of the proposed algorithm are provided under mild assumptions. Extensive experiments on synthetic and real-world datasets demonstrate that our proposed approach can achieve a considerable improvement in effectiveness and robustness to existing methods.


Author(s):  
Alexander Rader ◽  
Ionela G Mocanu ◽  
Vaishak Belle ◽  
Brendan Juba

Robust learning in expressive languages with real-world data continues to be a challenging task. Numerous conventional methods appeal to heuristics without any assurances of robustness. While probably approximately correct (PAC) Semantics offers strong guarantees, learning explicit representations is not tractable, even in propositional logic. However, recent work on so-called “implicit" learning has shown tremendous promise in terms of obtaining polynomial-time results for fragments of first-order logic. In this work, we extend implicit learning in PAC-Semantics to handle noisy data in the form of intervals and threshold uncertainty in the language of linear arithmetic. We prove that our extended framework keeps the existing polynomial-time complexity guarantees. Furthermore, we provide the first empirical investigation of this hitherto purely theoretical framework. Using benchmark problems, we show that our implicit approach to learning optimal linear programming objective constraints significantly outperforms an explicit approach in practice.


2014 ◽  
Vol 2 (1) ◽  
pp. 1
Author(s):  
Richard Schwartz
Keyword(s):  

10.28945/3602 ◽  
2016 ◽  
Vol 15 ◽  
pp. 593-609
Author(s):  
Hsun-Ming Lee ◽  
Ju Long ◽  
Lucian Visinescu

Developing Business Intelligence (BI) has been a top priority for enterprise executives in recent years. To meet these demands, universities need to prepare students to work with BI in enterprise settings. In this study, we considered a business simulator that offers students opportunities to apply BI and make top-management decisions in a system used by real-world professionals. The simulation-based instruction can be effective only if students are not discouraged by the difficulty of using the BI computer system and comprehending the complex BI subjects. Constructivist practices embedded in the business simulation are investigated to understand their potentials for helping the students to overcome the perceived difficulty. Consequently, it would enable instructors to more efficiently use the simulator by providing insights on its pedagogical practices. Our findings showed that the constructivist practices such as collaboration and subject integration positively influence active learning and meaningful learning respectively. In turn, both active learning and meaningful learning positively influence business intelligence motivational behavior. These findings can be further used to develop a robust learning environment in BI classes.


Entropy ◽  
2021 ◽  
Vol 23 (6) ◽  
pp. 727
Author(s):  
Eric J. Ma ◽  
Arkadij Kummer

We present a case study applying hierarchical Bayesian estimation on high-throughput protein melting-point data measured across the tree of life. We show that the model is able to impute reasonable melting temperatures even in the face of unreasonably noisy data. Additionally, we demonstrate how to use the variance in melting-temperature posterior-distribution estimates to enable principled decision-making in common high-throughput measurement tasks, and contrast the decision-making workflow against simple maximum-likelihood curve-fitting. We conclude with a discussion of the relative merits of each workflow.


2021 ◽  
Vol 15 ◽  
pp. 174830262110084
Author(s):  
Bishnu P Lamichhane ◽  
Elizabeth Harris ◽  
Quoc Thong Le Gia

We compare a recently proposed multivariate spline based on mixed partial derivatives with two other standard splines for the scattered data smoothing problem. The splines are defined as the minimiser of a penalised least squares functional. The penalties are based on partial differential operators, and are integrated using the finite element method. We compare three methods to two problems: to remove the mixture of Gaussian and impulsive noise from an image, and to recover a continuous function from a set of noisy observations.


2021 ◽  
Vol 11 (11) ◽  
pp. 5123
Author(s):  
Maiada M. Mahmoud ◽  
Nahla A. Belal ◽  
Aliaa Youssif

Transcription factors (TFs) are proteins that control the transcription of a gene from DNA to messenger RNA (mRNA). TFs bind to a specific DNA sequence called a binding site. Transcription factor binding sites have not yet been completely identified, and this is considered to be a challenge that could be approached computationally. This challenge is considered to be a classification problem in machine learning. In this paper, the prediction of transcription factor binding sites of SP1 on human chromosome1 is presented using different classification techniques, and a model using voting is proposed. The highest Area Under the Curve (AUC) achieved is 0.97 using K-Nearest Neighbors (KNN), and 0.95 using the proposed voting technique. However, the proposed voting technique is more efficient with noisy data. This study highlights the applicability of the voting technique for the prediction of binding sites, and highlights the outperformance of KNN on this type of data. The study also highlights the significance of using voting.


Sign in / Sign up

Export Citation Format

Share Document