Improving Rule Induction Precision for Automated Annotation by Balancing Skewed Data Sets

Author(s):  
Gustavo E. A. P. A. Batista ◽  
Maria C. Monard ◽  
Ana L. C. Bazzan
2016 ◽  
Author(s):  
Maxwell W. Libbrecht ◽  
Oscar Rodriguez ◽  
Zhiping Weng ◽  
Jeffrey A. Bilmes ◽  
Michael M. Hoffman ◽  
...  

AbstractSemi-automated genome annotation methods such as Segway enable understanding of chromatin activity. Here we present chromatin state annotations of 164 human cell types using 1,615 genomics data sets. To produce these annotations, we developed a fully-automated annotation strategy in which we train separate unsupervised annotation models on each cell type and use a machine learning classifier to automate the state interpretation step. Using these annotations, we developed a measure of the importance of each genomic position called the “conservation-associated activity score,” which we use to aggregate information across cell types into a multi-cell type view. The aggregated conservation-associated activity score provides a measure of importance directly attributable to a specific activity in a specific set of cell types. In contrast to evolutionary conservation, this measure is not biased to detect only elements shared with related species. Using the conservation-associated activity score, we combined all our annotations into a single, cell type-agnostic encyclopedia that catalogs all human transcriptional and regulatory elements, enabling easy and intuitive interpretation of the effect of genome variants on phenotype, such as in disease-associated, evolutionarily conserved or positively selected loci. These resources, including cell type-specific annotations, encyclopedia, and a visualization server, are available at http://noble.gs.washington.edu/proj/encyclopedia.Author SummaryGenome annotation algorithms are an effective class of tools for understanding the function of the genome. These algorithms take as input a set of genome-wide measurements about the activity at each base pair in a given tissue, such as where a given protein is binding or how accessible the DNA is to being read by a protein. The genome is then partitioned and each segment is assigned a label such that positions with the same label exhibit similar patterns in the input data. Such annotations are widely used for many applications, such as to understand the mechanism of impact of a given genetic variant. Here we present, to our knowledge, the most comprehensive set of genome annotations created so far, encompassing 164 human cell types and including 1,615 genomics data sets. These comprehensive annotations are made possible by a strategy that automates the previous interpretation step. Furthermore, we present several methodological innovations that make these genome annotations more useful.


2017 ◽  
Author(s):  
S. Wu ◽  
Y. Toyoshima ◽  
M.S. Jang ◽  
M. Kanamori ◽  
T. Teramoto ◽  
...  

AbstractShifting from individual neuron analysis to whole-brain neural network analysis opens up new research opportunities for Caenorhabditis elegans (C. elegans). An automated data processing pipeline, including neuron detection, segmentation, tracking and annotation, will significantly improve the efficiency of analyzing whole-brain C. elegans imaging. The resulting large data sets may motivate new scientific discovery by exploiting many promising analysis tools for big data. In this study, we focus on the development of an automated annotation procedure. With only around 180 neurons in the central nervous system of a C. elegans, the annotation of each individual neuron still remains a major challenge because of the high density in space, similarity in neuron shape, unpredictable distortion of the worm’s head during motion, intrinsic variations during worm development, etc. We use an ensemble learning approach to achieve around 25% error for a test based on real experimental data. Also, we demonstrate the importance of exploring extra source of information for annotation other than the neuron positions.


Author(s):  
José Hernández Santiago ◽  
Jair Cervantes ◽  
Asdrúbal López-Chau ◽  
Farid García Lamont

2020 ◽  
Vol 57 (4) ◽  
pp. 444-464
Author(s):  
Gauss M. Cordeiro ◽  
Thiago G. Ramires ◽  
Edwin M. M. Ortega ◽  
Rodrigo R. Pescim

We define the extended beta family of distributions to generalize the beta generator pioneered by Eugene et al. [10]. This paper is cited in at least 970 scientific articles and extends more than fifty well-known distributions. Any continuous distribution can be generalized by means of this family. The proposed family can present greater flexibility to model skewed data. Some of its mathematical properties are investigated and maximum likelihood is adopted to estimate its parameters. Further, for different parameter settings and sample sizes, some simulations are conducted. The superiority of the proposed family is illustrated by means of two real data sets.


2014 ◽  
Vol 132 (3) ◽  
pp. 365-379 ◽  
Author(s):  
Patrick G. Clark ◽  
Jerzy W. Grzymala-Busse ◽  
Zdzislaw S. Hippe

2021 ◽  
Vol 93 (40) ◽  
pp. 13421-13425
Author(s):  
Dušan Veličković ◽  
Tamara Bečejac ◽  
Sergii Mamedov ◽  
Kumar Sharma ◽  
Namasivayam Ambalavanan ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document