Weak Supervision and Machine Learning for Online Harassment Detection

Author(s):  
Bert Huang ◽  
Elaheh Raisi
2021 ◽  
Author(s):  
Jason Meil

<p>Data preparation process generally consumes up to 80% of the Data Scientists time, with 60% of that being attributed to cleaning and labeling data.[1]  Our solution is to use automated pipelines to prepare, annotate, and catalog data. The first step upon ingestion, especially in the case of real world—unstructured and unlabeled datasets—is to leverage Snorkel, a tool specifically designed around a paradigm to rapidly create, manage, and model training data. Configured properly, Snorkel can be leveraged to temper this labeling bottle-neck through a process called weak supervision. Weak supervision uses programmatic labeling functions—heuristics, distant supervision, SME or knowledge base—scripted in python to generate “noisy labels”. The function traverses the entirety of the dataset and feeds the labeled data into a generative—conditionally probabilistic—model. The function of this model is to output the distribution of each response variable and predict the conditional probability based on a joint probability distribution algorithm. This is done by comparing the various labeling functions and the degree to which their outputs are congruent to each other. A single labeling function that has a high degree of congruence with other labeling functions will have a high degree of learned accuracy, that is, the fraction of predictions that the model got right. Conversely, single labeling functions that have a low degree of congruence with other functions will have low learned accuracy. Each prediction is then combined by the estimated weighted accuracy, whereby the predictions of the higher learned functions are counted multiple times. The result yields a transformation from a binary classification of 0 or 1 to a fuzzy label between 0 and 1— there is “x” probability that based on heuristic “n”, the response variable is “y”. The addition of data to this generative model multi-class inference will be made on the response variables positive, negative, or abstain, assigning probabilistic labels to potentially millions of data points. Thus, we have generated a discriminative ground truth for all further labeling efforts and have improved the scalability of our models. Labeling functions can be applied to unlabeled data to further machine learning efforts.<br> <br>Once our datasets are labeled and a ground truth is established, we need to persist the data into our delta lake since it combines the most performant aspects of a warehouse with the low-cost storage for data lakes. In addition, the lake can accept unstructured, semi structured, or structured data sources, and those sources can be further aggregated into raw ingestion, cleaned, and feature engineered data layers.  By sectioning off the data sources into these “layers”, the data engineering portion is abstracted away from the data scientist, who can access model ready data at any time.  Data can be ingested via batch or stream. <br> <br>The design of the entire ecosystem is to eliminate as much technical debt in machine learning paradigms as possible in terms of configuration, data collection, verification, governance, extraction, analytics, process management, resource management, infrastructure, monitoring, and post verification. </p>


2020 ◽  
Author(s):  
Emad Kasaeyan Naeini ◽  
Ajan Subramanian ◽  
Michael-David Calderon ◽  
Kai Zheng ◽  
Nikil Dutt ◽  
...  

BACKGROUND There is a strong demand for an accurate and objective means for assessing acute pain among hospitalized patients to help clinicians provide a proper dosage of pain medications and in a timely manner. Heart rate variability (HRV) comprises changes in the time intervals between consecutive heartbeats, which can be measured through acquisition and interpretation of electrocardiogram (ECG) captured from bedside monitors or wearable devices. As increased sympathetic activity affects the HRV, an index of autonomic regulation of heart rate, ultra-short-term HRV analysis can provide a reliable source of information for acute pain monitoring. In this study, widely used HRV time- and frequency-domain measurements are used in acute pain assessments among postoperative patients. The existing approaches have only focused on stimulated pain on healthy subjects, whereas, to the best of our knowledge, there is no work in the literature building models using real pain data and on postoperative patients. OBJECTIVE To develop and evaluate an automatic and adaptable pain assessment algorithm based on ECG features for assessing acute pain in postoperative patients likely experiencing mild to moderate pain. METHODS The study used a prospective observational design. The sample consisted of 25 patient participants aged 18 to 65 years. In part 1 of the study, a Transcutaneous Electrical Nerve Stimulation unit was employed to obtain baseline discomfort threshold for the patients. In part 2, a multichannel biosignal acquisition device was used as patients were engaging in non-noxious activities. At all times, pain intensity was measured using patient self-reports based on the Numerical Rating Scale (NRS). A weak supervision framework was inherited for rapid training data creation. The collected labels were then transformed from 11 intensity levels to 5 intensity levels. Prediction models were developed using 5 different machine-learning methods. Mean prediction accuracy was calculated using Leave-One-Subject-Out cross-validation. We compared the performance of these models with the results from a previously published research study. RESULTS Five different machine-learning algorithms were applied to perform binary classification of no pain (NP) vs. 4 distinct pain levels (PL1 through PL4). Highest validation accuracy using 3 time-domain HRV features of BioVid research paper for no pain vs. any other pain level was achieved by SVM 62.72% (NP vs. PL4) to 84.14% (NP vs. PL2). Similar results were achieved for the top 8 features based on the Gini Index using the SVM method; with an accuracy ranging from 63.86% (NP vs. PL4) to 84.79% (NP vs. PL2). CONCLUSIONS We propose a novel pain assessment method for postoperative patients using the ECG signal. Weak supervision applied for labeling and feature extraction improves the robustness of the approach. Our results show the viability of using a machine-learning algorithm to accurately and objectively assess acute pain among hospitalized patients. INTERNATIONAL REGISTERED REPORT RR2-10.2196/17783


2018 ◽  
Author(s):  
Jason A. Fries ◽  
Paroma Varma ◽  
Vincent S. Chen ◽  
Ke Xiao ◽  
Heliodoro Tejeda ◽  
...  

AbstractBiomedical repositories such as the UK Biobank provide increasing access to prospectively collected cardiac imaging, however these data are unlabeled which creates barriers to their use in supervised machine learning. We develop a weakly supervised deep learning model for classification of aortic valve malformations using up to 4,000 unlabeled cardiac MRI sequences. Instead of requiring highly curated training data, weak supervision relies on noisy heuristics defined by domain experts to programmatically generate large-scale, imperfect training labels. For aortic valve classification, models trained with imperfect labels substantially outperform a supervised model trained on hand-labeled MRIs. In an orthogonal validation experiment using health outcomes data, our model identifies individuals with a 1.8-fold increase in risk of a major adverse cardiac event. This work formalizes a learning baseline for aortic valve classification and outlines a general strategy for using weak supervision to train machine learning models using unlabeled medical images at scale.


2019 ◽  
Vol 7 ◽  
pp. 233-248
Author(s):  
Laura Jehl ◽  
Carolin Lawrence ◽  
Stefan Riezler

In many machine learning scenarios, supervision by gold labels is not available and conse quently neural models cannot be trained directly by maximum likelihood estimation. In a weak supervision scenario, metric-augmented objectives can be employed to assign feedback to model outputs, which can be used to extract a supervision signal for training. We present several objectives for two separate weakly supervised tasks, machine translation and semantic parsing. We show that objectives should actively discourage negative outputs in addition to promoting a surrogate gold structure. This notion of bipolarity is naturally present in ramp loss objectives, which we adapt to neural models. We show that bipolar ramp loss objectives outperform other non-bipolar ramp loss objectives and minimum risk training on both weakly supervised tasks, as well as on a supervised machine translation task. Additionally, we introduce a novel token-level ramp loss objective, which is able to outperform even the best sequence-level ramp loss on both weakly supervised tasks.


2020 ◽  
Vol 21 (S23) ◽  
Author(s):  
Arnaud Ferré ◽  
Louise Deléger ◽  
Robert Bossy ◽  
Pierre Zweigenbaum ◽  
Claire Nédellec

Abstract Background Entity normalization is an important information extraction task which has gained renewed attention in the last decade, particularly in the biomedical and life science domains. In these domains, and more generally in all specialized domains, this task is still challenging for the latest machine learning-based approaches, which have difficulty handling highly multi-class and few-shot learning problems. To address this issue, we propose C-Norm, a new neural approach which synergistically combines standard and weak supervision, ontological knowledge integration and distributional semantics. Results Our approach greatly outperforms all methods evaluated on the Bacteria Biotope datasets of BioNLP Open Shared Tasks 2019, without integrating any manually-designed domain-specific rules. Conclusions Our results show that relatively shallow neural network methods can perform well in domains that present highly multi-class and few-shot learning problems.


2021 ◽  
Vol 11 ◽  
Author(s):  
Phillipe Loher ◽  
Nestoras Karathanasis

The development of single-cell sequencing technologies has allowed researchers to gain important new knowledge about the expression profile of genes in thousands of individual cells of a model organism or tissue. A common disadvantage of this technology is the loss of the three-dimensional (3-D) structure of the cells. Consequently, the Dialogue on Reverse Engineering Assessment and Methods (DREAM) organized the Single-Cell Transcriptomics Challenge, in which we participated, with the aim to address the following two problems: (a) to identify the top 60, 40, and 20 genes of the Drosophila melanogaster embryo that contain the most spatial information and (b) to reconstruct the 3-D arrangement of the embryo using information from those genes. We developed two independent techniques, leveraging machine learning models from least absolute shrinkage and selection operator (Lasso) and deep neural networks (NNs), which are applied to high-dimensional single-cell sequencing data in order to accurately identify genes that contain spatial information. Our first technique, Lasso.TopX, utilizes the Lasso and ranking statistics and allows a user to define a specific number of features they are interested in. The NN approach utilizes weak supervision for linear regression to accommodate for uncertain or probabilistic training labels. We show, individually for both techniques, that we are able to identify important, stable, and a user-defined number of genes containing the most spatial information. The results from both techniques achieve high performance when reconstructing spatial information in D. melanogaster and also generalize to zebrafish (Danio rerio). Furthermore, we identified novel D. melanogaster genes that carry important positional information and were not previously suspected. We also show how the indirect use of the full datasets’ information can lead to data leakage and generate bias in overestimating the model’s performance. Lastly, we discuss the applicability of our approaches to other feature selection problems outside the realm of single-cell sequencing and the importance of being able to handle probabilistic training labels. Our source code and detailed documentation are available at https://github.com/TJU-CMC-Org/SingleCell-DREAM/.


2019 ◽  
Author(s):  
Phillipe Loher ◽  
Nestoras Karathanasis

AbstractMotivationWe participated in the DREAM Single Cell Transcriptomics Challenge. The challenge’s focus was two-fold; a) to identify the top 60, 40 and 20 genes that contain the most spatial information, and b) to reconstruct the 3-D arrangement of the D. melanogaster embryo using information from those genes.ResultsWe developed two independent approaches, leveraging machine learning models from Lasso and Deep Neural Networks, that we successfully apply to high-dimensional single-cell sequencing data. Our methods allowed us to achieve top performance when compared to the ground truth. Among ~40 participating teams, the resulting solutions placed 10th, 6th, and 4th in the three DREAM sub-challenges #1, #2 and #3, respectively. Notably, for the Lasso approach we introduced a feature selection technique, Lasso-TopX, that allows a user to define a specific number of features they are interested in and the Neural Network approach utilizes weak supervision for linear regression to accommodate for uncertain or probabilistic training labels. Furthermore, we identified novel D. melanogaster genes that carry important positional information and were not previously suspected. Lastly, we show how the indirect use of the full datasets’ information can lead to data leakage and generate bias in overestimating the model’s performance.Availabilityhttps://github.com/TJU-CMC-Org/SingleCell-DREAM/[email protected]


2020 ◽  
Vol 43 ◽  
Author(s):  
Myrthe Faber

Abstract Gilead et al. state that abstraction supports mental travel, and that mental travel critically relies on abstraction. I propose an important addition to this theoretical framework, namely that mental travel might also support abstraction. Specifically, I argue that spontaneous mental travel (mind wandering), much like data augmentation in machine learning, provides variability in mental content and context necessary for abstraction.


2020 ◽  
Author(s):  
Man-Wai Mak ◽  
Jen-Tzung Chien

2020 ◽  
Author(s):  
Mohammed J. Zaki ◽  
Wagner Meira, Jr
Keyword(s):  

Sign in / Sign up

Export Citation Format

Share Document