scholarly journals Imprecise Oracles Impose Limits to Predictability in Supervised Learning (Extended Abstract)

Author(s):  
Anjali Sifar ◽  
Nisheeth Srivastava

Supervised learning operates on the premise that labels unambiguously represent ground truth. This premise is reasonable in domains wherein a high degree of consensus is easily possible for any given data record, e.g. in agreeing on whether an image contains an elephant or not. However, there are several domains wherein people disagree with each other on the appropriate label to assign to a record, e.g. whether a tweet is toxic. We argue that data labeling must be understood as a process with some degree of domain-dependent noise and that any claims of predictive prowess must be sensitive to the degree of this noise. We present a method for quantifying labeling noise in a particular domain wherein people are seen to disagree with their own past selves on the appropriate label to assign to a record: choices under prospect uncertainty. Our results indicate that `state-of-the-art' choice models of decisions from description, by failing to consider the intrinsic variability of human choice behavior, find themselves in the odd position of predicting humans' choices better than the same humans' own previous choices for the same problem. We conclude with observations on how the predicament we empirically demonstrate in our work could be handled in the practice of supervised learning.

2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Eustasio del Barrio ◽  
Hristo Inouzhe ◽  
Jean-Michel Loubes ◽  
Carlos Matrán ◽  
Agustín Mayo-Íscar

Abstract Background Data obtained from flow cytometry present pronounced variability due to biological and technical reasons. Biological variability is a well-known phenomenon produced by measurements on different individuals, with different characteristics such as illness, age, sex, etc. The use of different settings for measurement, the variation of the conditions during experiments and the different types of flow cytometers are some of the technical causes of variability. This mixture of sources of variability makes the use of supervised machine learning for identification of cell populations difficult. The present work is conceived as a combination of strategies to facilitate the task of supervised gating. Results We propose optimalFlowTemplates, based on a similarity distance and Wasserstein barycenters, which clusters cytometries and produces prototype cytometries for the different groups. We show that supervised learning, restricted to the new groups, performs better than the same techniques applied to the whole collection. We also present optimalFlowClassification, which uses a database of gated cytometries and optimalFlowTemplates to assign cell types to a new cytometry. We show that this procedure can outperform state of the art techniques in the proposed datasets. Our code is freely available as optimalFlow, a Bioconductor R package at https://bioconductor.org/packages/optimalFlow. Conclusions optimalFlowTemplates + optimalFlowClassification addresses the problem of using supervised learning while accounting for biological and technical variability. Our methodology provides a robust automated gating workflow that handles the intrinsic variability of flow cytometry data well. Our main innovation is the methodology itself and the optimal transport techniques that we apply to flow cytometry analysis.


Author(s):  
Ding Li ◽  
Scott Dick

AbstractGraph-based algorithms are known to be effective approaches to semi-supervised learning. However, there has been relatively little work on extending these algorithms to the multi-label classification case. We derive an extension of the Manifold Regularization algorithm to multi-label classification, which is significantly simpler than the general Vector Manifold Regularization approach. We then augment our algorithm with a weighting strategy to allow differential influence on a model between instances having ground-truth vs. induced labels. Experiments on four benchmark multi-label data sets show that the resulting algorithm performs better overall compared to the existing semi-supervised multi-label classification algorithms at various levels of label sparsity. Comparisons with state-of-the-art supervised multi-label approaches (which of course are fully labeled) also show that our algorithm outperforms all of them even with a substantial number of unlabeled examples.


Author(s):  
Weijia Zhang

Multi-instance learning is a type of weakly supervised learning. It deals with tasks where the data is a set of bags and each bag is a set of instances. Only the bag labels are observed whereas the labels for the instances are unknown. An important advantage of multi-instance learning is that by representing objects as a bag of instances, it is able to preserve the inherent dependencies among parts of the objects. Unfortunately, most existing algorithms assume all instances to be identically and independently distributed, which violates real-world scenarios since the instances within a bag are rarely independent. In this work, we propose the Multi-Instance Variational Autoencoder (MIVAE) algorithm which explicitly models the dependencies among the instances for predicting both bag labels and instance labels. Experimental results on several multi-instance benchmarks and end-to-end medical imaging datasets demonstrate that MIVAE performs better than state-of-the-art algorithms for both instance label and bag label prediction tasks.


2019 ◽  
Vol 33 (13) ◽  
pp. 1950133 ◽  
Author(s):  
Mei Chen ◽  
Mei Zhang ◽  
Ming Li ◽  
Mingwei Leng ◽  
Zhichong Yang ◽  
...  

Detecting the natural communities in a real-world network can uncover its underlying structure and potential function. In this paper, a novel community algorithm SUM is introduced. The fundamental idea of SUM is that a node with relatively low degree stays faithful to its community, because it only has links with nodes in one community, while a node with relatively high degree not only has links with nodes within but also outside its community, and this may cause confusion when detecting communities. Based on this idea, SUM detects communities by suspecting the links of the maximum degree nodes to their neighbors within a community, and relying mainly on the nodes with relatively low degree simultaneously. SUM elegantly defines a similarity which takes into account both the commonality and the rejective degree of two adjacent nodes. After putting similar nodes into one community, SUM generates initial communities by reassigning the maximum degree nodes. Next, SUM assigns nodes without labels to the initial communities, and adjusts the border node to its most linked community. To evaluate the effectiveness of SUM, SUM is compared with seven baselines, including four classical and three state-of-the-art methods on a wide range of complex networks. On the small size networks with ground-truth community structures, results are visually demonstrated, as well as quantitatively measured with ARI, NMI and Modularity. On the relatively large size networks without ground-truth community structures, the performances of these algorithms are evaluated according to Modularity. Experimental results indicate that SUM can effectively determine community structures on small or relatively large size networks with high quality, and also outperforms the compared state-of-the-art methods.


2019 ◽  
Vol 30 (04) ◽  
pp. 1950021
Author(s):  
Jinfang Sheng ◽  
Kai Wang ◽  
Zejun Sun ◽  
Jie Hu ◽  
Bin Wang ◽  
...  

In recent years, community detection has gradually become a hot topic in the complex network data mining field. The research of community detection is helpful not only to understand network topology structure but also to explore network hiding function. In this paper, we improve FluidC which is a novel community detection algorithm based on fluid propagation, by ameliorating the quality of seed set based on positive feedback and determining the node update order. We first summarize the shortcomings of FluidC and analyze the reasons result in these drawbacks. Then, we took some effective measures to overcome them and proposed an efficient community detection algorithm, called FluidC+. Finally, experiments on the generated network and real-world network show that our method not only greatly improves the performance of the original algorithm FluidC but also is better than many state-of-the-art algorithms, especially in the performance on real-world network with ground truth.


Author(s):  
P.R. Swann ◽  
A.E. Lloyd

Figure 1 shows the design of a specimen stage used for the in situ observation of phase transformations in the temperature range between ambient and −160°C. The design has the following features a high degree of specimen stability during tilting linear tilt actuation about two orthogonal axes for accurate control of tilt angle read-out high angle tilt range for stereo work and habit plane determination simple, robust construction temperature control of better than ±0.5°C minimum thermal drift and transmission of vibration from the cooling system.


2019 ◽  
Vol 8 ◽  
pp. 54-56
Author(s):  
Ashmita Dahal Chhetri

Advertisements have been used for many years to influence the buying behaviors of the consumers. Advertisements are helpful in creating the awareness and perception among the customers of a product. This particular research was conducted on the 100 young male and female who use different brands of product to check the influence of advertisement on their buying behavior while creating the awareness and building the perceptions. Correlation, regression and other statistical tools were used to identify the relationship between these variables. The results revealed that the relationship between media and consumer behavior is positive. The adve1tising impact on sales and there is positive and high degree relationship between advertising and consumer behavior. The impact on advertising of a product of electronic media is better than non-electronic media.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
João Lobo ◽  
Rui Henriques ◽  
Sara C. Madeira

Abstract Background Three-way data started to gain popularity due to their increasing capacity to describe inherently multivariate and temporal events, such as biological responses, social interactions along time, urban dynamics, or complex geophysical phenomena. Triclustering, subspace clustering of three-way data, enables the discovery of patterns corresponding to data subspaces (triclusters) with values correlated across the three dimensions (observations $$\times$$ × features $$\times$$ × contexts). With increasing number of algorithms being proposed, effectively comparing them with state-of-the-art algorithms is paramount. These comparisons are usually performed using real data, without a known ground-truth, thus limiting the assessments. In this context, we propose a synthetic data generator, G-Tric, allowing the creation of synthetic datasets with configurable properties and the possibility to plant triclusters. The generator is prepared to create datasets resembling real 3-way data from biomedical and social data domains, with the additional advantage of further providing the ground truth (triclustering solution) as output. Results G-Tric can replicate real-world datasets and create new ones that match researchers needs across several properties, including data type (numeric or symbolic), dimensions, and background distribution. Users can tune the patterns and structure that characterize the planted triclusters (subspaces) and how they interact (overlapping). Data quality can also be controlled, by defining the amount of missing, noise or errors. Furthermore, a benchmark of datasets resembling real data is made available, together with the corresponding triclustering solutions (planted triclusters) and generating parameters. Conclusions Triclustering evaluation using G-Tric provides the possibility to combine both intrinsic and extrinsic metrics to compare solutions that produce more reliable analyses. A set of predefined datasets, mimicking widely used three-way data and exploring crucial properties was generated and made available, highlighting G-Tric’s potential to advance triclustering state-of-the-art by easing the process of evaluating the quality of new triclustering approaches.


2021 ◽  
Vol 9 (1) ◽  
Author(s):  
Aysen Degerli ◽  
Mete Ahishali ◽  
Mehmet Yamac ◽  
Serkan Kiranyaz ◽  
Muhammad E. H. Chowdhury ◽  
...  

AbstractComputer-aided diagnosis has become a necessity for accurate and immediate coronavirus disease 2019 (COVID-19) detection to aid treatment and prevent the spread of the virus. Numerous studies have proposed to use Deep Learning techniques for COVID-19 diagnosis. However, they have used very limited chest X-ray (CXR) image repositories for evaluation with a small number, a few hundreds, of COVID-19 samples. Moreover, these methods can neither localize nor grade the severity of COVID-19 infection. For this purpose, recent studies proposed to explore the activation maps of deep networks. However, they remain inaccurate for localizing the actual infestation making them unreliable for clinical use. This study proposes a novel method for the joint localization, severity grading, and detection of COVID-19 from CXR images by generating the so-called infection maps. To accomplish this, we have compiled the largest dataset with 119,316 CXR images including 2951 COVID-19 samples, where the annotation of the ground-truth segmentation masks is performed on CXRs by a novel collaborative human–machine approach. Furthermore, we publicly release the first CXR dataset with the ground-truth segmentation masks of the COVID-19 infected regions. A detailed set of experiments show that state-of-the-art segmentation networks can learn to localize COVID-19 infection with an F1-score of 83.20%, which is significantly superior to the activation maps created by the previous methods. Finally, the proposed approach achieved a COVID-19 detection performance with 94.96% sensitivity and 99.88% specificity.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Christian Crouzet ◽  
Gwangjin Jeong ◽  
Rachel H. Chae ◽  
Krystal T. LoPresti ◽  
Cody E. Dunn ◽  
...  

AbstractCerebral microhemorrhages (CMHs) are associated with cerebrovascular disease, cognitive impairment, and normal aging. One method to study CMHs is to analyze histological sections (5–40 μm) stained with Prussian blue. Currently, users manually and subjectively identify and quantify Prussian blue-stained regions of interest, which is prone to inter-individual variability and can lead to significant delays in data analysis. To improve this labor-intensive process, we developed and compared three digital pathology approaches to identify and quantify CMHs from Prussian blue-stained brain sections: (1) ratiometric analysis of RGB pixel values, (2) phasor analysis of RGB images, and (3) deep learning using a mask region-based convolutional neural network. We applied these approaches to a preclinical mouse model of inflammation-induced CMHs. One-hundred CMHs were imaged using a 20 × objective and RGB color camera. To determine the ground truth, four users independently annotated Prussian blue-labeled CMHs. The deep learning and ratiometric approaches performed better than the phasor analysis approach compared to the ground truth. The deep learning approach had the most precision of the three methods. The ratiometric approach has the most versatility and maintained accuracy, albeit with less precision. Our data suggest that implementing these methods to analyze CMH images can drastically increase the processing speed while maintaining precision and accuracy.


Sign in / Sign up

Export Citation Format

Share Document