AKT: Ancestry and Kinship Toolkit

Mapping Intimacies ◽

10.1101/047829 ◽

2016 ◽

Author(s):

Rudy Arthur ◽

Ole Schulz-Trieglaff ◽

Anthony J. Cox ◽

Jared Michael O’Connell

Keyword(s):

Data Clustering ◽

State Of The Art ◽

Source Code ◽

Statistical Genetics ◽

Data Sets ◽

Whole Genome ◽

Link Type ◽

Art Methods ◽

Invaluable Tool

AbstractAncestry and Kinship Toolkit (AKT) is a statistical genetics tool for analysing large cohorts of whole-genome sequenced samples. It can rapidly detect related samples, characterise sample ancestry, calculate correlation between variants, check Mendel consistency and perform data clustering. AKT brings together the functionality of many state-of-the-art methods, with a focus on speed and a unified interface. We believe it will be an invaluable tool for the curation of large WGS data-sets.AvailabilityThe source code is available at https://illumina.github.io/[email protected], [email protected]

Repeat aware evaluation of scaffolding tools

10.1101/148932 ◽

2017 ◽

Cited By ~ 1

Author(s):

Igor Mandric ◽

Sergey Knyazev ◽

Alex Zelikovsky

Keyword(s):

State Of The Art ◽

Source Code ◽

Evaluation Framework ◽

Whole Genome ◽

Accurate Assessment ◽

Challenging Problem ◽

Scalable Algorithm ◽

Link Type ◽

Representative Subset ◽

Evaluation Problem

AbstractSummaryGenomic sequences are assembled into a variable, but large number of contigs that should be scaffolded (ordered and oriented) for facilitating comparative or functional analysis. Finding scaffolding is computationally challenging due to misassemblies, inconsistent coverage across the genome, and long repeats. An accurate assessment of scaffolding tools should take into account multiple locations of the same contig on the reference scaffolding rather than matching a repeat to a single best location. This makes mapping of inferred scaffoldings onto the reference a computationally challenging problem. This paper formulates the repeat-aware scaffolding evaluation problem which is to find a mapping of the inferred scaffolding onto the reference maximizing number of correct links and proposes a scalable algorithm capable of handling large whole-genome datasets. Our novel scaffolding validation pipeline has been applied to assess the most of state-of-the-art scaffolding tools on the representative subset of GAGE datasets.AvailabilityThe source code of this evaluation framework is available at https://github.com/mandricigor/repeat-aware. The documentation is hosted at https://mandricigor.github.io/repeat-aware.

GraphAligner: rapid and versatile sequence-to-graph alignment

Genome Biology ◽

10.1186/s13059-020-02157-2 ◽

2020 ◽

Vol 21 (1) ◽

Cited By ~ 1

Author(s):

Mikko Rautiainen ◽

Tobias Marschall

Keyword(s):

Genetic Variation ◽

Error Correction ◽

Genome Assembly ◽

State Of The Art ◽

Source Code ◽

The State ◽

Graph Alignment ◽

Link Type ◽

Long Reads

Abstract Genome graphs can represent genetic variation and sequence uncertainty. Aligning sequences to genome graphs is key to many applications, including error correction, genome assembly, and genotyping of variants in a pangenome graph. Yet, so far, this step is often prohibitively slow. We present GraphAligner, a tool for aligning long reads to genome graphs. Compared to the state-of-the-art tools, GraphAligner is 13x faster and uses 3x less memory. When employing GraphAligner for error correction, we find it to be more than twice as accurate and over 12x faster than extant tools.Availability: Package manager: https://anaconda.org/bioconda/graphalignerand source code: https://github.com/maickrau/GraphAligner

Plotgardener: Cultivating precise multi-panel figures in R

10.1101/2021.09.08.459338 ◽

2021 ◽

Author(s):

Nicole E Kramer ◽

Eric S Davis ◽

Craig D Wenger ◽

Erika M Deoudes ◽

Sarah M Parker ◽

...

Keyword(s):

Programming Languages ◽

Genomic Data ◽

Data Access ◽

Manuscript Preparation ◽

Data Sets ◽

New Paradigm ◽

Link Type ◽

Bioconductor Project ◽

Invaluable Tool ◽

R Programming

The R programming language is one of the most widely used programming languages for transforming raw genomic data sets into meaningful biological conclusions through analysis and visualization, which has been largely facilitated by infrastructure and tools developed by the Bioconductor project. However, existing plotting packages rely on relative positioning and sizing of plots, which is often sufficient for exploratory analysis but is poorly suited for the creation of publication-quality multi-panel images inherent to scientific manuscript preparation. We present plotgardener, a coordinate-based genomic data visualization package that offers a new paradigm for multi-plot figure generation in R. Plotgardener allows precise, programmatic control over the placement, aesthetics, and arrangements of plots while maximizing user experience through fast and memory-efficient data access, support for a wide variety of data and file types, and tight integration with the Bioconductor environment. Plotgardener also allows precise placement and sizing of ggplot2 plots, making it an invaluable tool for R users and data scientists from virtually any discipline.AvailabilityPackage: https://bioconductor.org/packages/plotgardenerCode: https://github.com/PhanstielLab/plotgardenerDocumentation: https://phanstiellab.github.io/plotgardener/

Search for Compatible Source Code

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s0218194021500169 ◽

2021 ◽

Vol 31 (03) ◽

pp. 477-502

Author(s):

Fuqi Cai ◽

Changjing Wang ◽

Qing Huang ◽

Zhengkang Zuo ◽

Yunyan Liao

Keyword(s):

Programming Language ◽

State Of The Art ◽

Source Code ◽

The State ◽

Third Party ◽

Search Model ◽

Code Search ◽

Art Methods ◽

Local Programming ◽

Cosine Distance

Third-party libraries always evolve and produce multiple versions. Lucene, for example, released ten new versions (from version 7.7.0 to 8.4.0) in 2019. These versions confuse the existing code search methods to retrieve the source code that is not compatible with local programming language. To solve this issue, we propose DCSE, a deep code search model based on evolving information (i.e. evolved code tokens and evolution description). DCSE first deeply excavates evolved code tokens and evolution description in the code evolution process; then it takes evolved code tokens and evolution description as one feature of source code and code description, respectively. With such fuller representation, DCSE embeds source code and its code description into a high-dimensional shared vector space, and makes the cosine distance of their vectors closer. For the ever-evolving third-party libraries like Lucene, the experimental results show that DCSE could retrieve the source code that is compatible with local programming language, it outperforms the state-of-the-art methods (e.g. CODEnn) by 56.9–60.9[Formula: see text] in RFVersion. For the rarely-evolving third-party libraries, DCSE outperforms the state-of-the-art methods (e.g. CODEnn) by 4–11[Formula: see text] in Precision.

GraphAligner: Rapid and Versatile Sequence-to-Graph Alignment

10.1101/810812 ◽

2019 ◽

Cited By ~ 9

Author(s):

Mikko Rautiainen ◽

Tobias Marschall

Keyword(s):

Genetic Variation ◽

Error Correction ◽

Genome Assembly ◽

State Of The Art ◽

Source Code ◽

Graph Alignment ◽

Link Type ◽

Long Reads ◽

Reference Genomes ◽

Genome Graph

AbstractGenome graphs can represent genetic variation and sequence uncertainty. Aligning sequences to genome graphs is key to many applications, including error correction, genome assembly, and genotyping of variants in a pan-genome graph. Yet, so far this step is often prohibitively slow. We present GraphAligner, a tool for aligning long reads to genome graphs. Compared to state-of-the-art tools, GraphAligner is 12x faster and uses 5x less memory, making it as efficient as aligning reads to linear reference genomes. When employing GraphAligner for error correction, we find it to be almost 3x more accurate and over 15x faster than extant tools.Availability Package managerhttps://anaconda.org/bioconda/graphaligner and source code: https://github.com/maickrau/GraphAligner

A Document-grounded Matching Network for Response Selection in Retrieval-based Chatbots

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/756 ◽

2019 ◽

Cited By ~ 1

Author(s):

Xueliang Zhao ◽

Chongyang Tao ◽

Wei Wu ◽

Can Xu ◽

Dongyan Zhao ◽

...

Keyword(s):

Response Selection ◽

State Of The Art ◽

Empirical Studies ◽

Data Sets ◽

Matching Network ◽

Public Data ◽

Art Methods ◽

Hierarchical Interaction ◽

Different Parts

We present a document-grounded matching network (DGMN) for response selection that can power a knowledge-aware retrieval-based chatbot system. The challenges of building such a model lie in how to ground conversation contexts with background documents and how to recognize important information in the documents for matching. To overcome the challenges, DGMN fuses information in a document and a context into representations of each other, and dynamically determines if grounding is necessary and importance of different parts of the document and the context through hierarchical interaction with a response at the matching step. Empirical studies on two public data sets indicate that DGMN can significantly improve upon state-of-the-art methods and at the same time enjoys good interpretability.

MolRep: A Deep Representation Learning Library for Molecular Property Prediction

10.1101/2021.01.13.426489 ◽

2021 ◽

Author(s):

Jiahua Rao ◽

Shuangjia Zheng ◽

Ying Song ◽

Jianwen Chen ◽

Chengtao Li ◽

...

Keyword(s):

State Of The Art ◽

Source Code ◽

Representation Learning ◽

Supplementary Information ◽

Data Sets ◽

Supplementary Data ◽

Property Prediction ◽

Average Rank ◽

Benchmark Data ◽

Classification Tasks

AbstractSummaryRecently, novel representation learning algorithms have shown potential for predicting molecular properties. However, unified frameworks have not yet emerged for fairly measuring algorithmic progress, and experimental procedures of different representation models often lack rigorousness and are hardly reproducible. Herein, we have developed MolRep by unifying 16 state-of-the-art models across 4 popular molecular representations for application and comparison. Furthermore, we ran more than 12.5 million experiments to optimize hyperparameters for each method on 12 common benchmark data sets. As a result, CMPNN achieves the best results ranked the 1st in 5 out of 12 tasks with an average rank of 1.75. Relatively, ECC has good performance in classification tasks and MAT good for regression (both ranked 1st for 3 tasks) with an average rank of 2.71 and 2.6, respectively.AvailabilityThe source code is available at: https://github.com/biomed-AI/MolRepSupplementary informationSupplementary data are available online.

HLA-MA: Simple yet powerful matching of samples using HLA typing results

10.1101/066548 ◽

2016 ◽

Author(s):

Clemens Messerschmidt ◽

Manuel Holtgrewe ◽

Dieter Beule

Keyword(s):

Microsatellite Instability ◽

State Of The Art ◽

The State ◽

Hla Typing ◽

Whole Genome ◽

Consistency Checking ◽

Rna Seq ◽

Simple Method ◽

Link Type ◽

Typing Result

AbstractSummaryWe propose the simple method HLA-MA for consistency checking in pipelines operating on human HTS data. The method is based on the HLA typing result of the state-of-the-art method Opti-Type. Provided that there is sufficient coverage of the HLA loci, comparing HLA types allows for simple, fast, and robust matching of samples from whole genome, exome, and RNA-seq data. This approach is reliable for sample re-identification even for samples with high mutational loads, e.g., caused by microsatellite instability or POLE1 defects.Availability and ImplementationThe software is implemented In Python 3 and freely available under the MIT license at https://github.com/bihealth/hlama and via [email protected]

Automated Gene Data Integration with Databio

10.1101/768077 ◽

2019 ◽

Author(s):

Robert W Reid ◽

Jacob W Ferrier ◽

Jeremy J Jay

Keyword(s):

Data Structures ◽

Computational Analysis ◽

Source Code ◽

Data Provenance ◽

Data Sets ◽

Time Data ◽

Specialized Knowledge ◽

Link Type ◽

Gene Data ◽

The Web

AbstractSummaryDatabio is capable of providing fast and accurate annotation of gene-oriented data sets, coupled with an integrated identifier conversion service to empower downstream data mining and computational analysis. Databio is enabled by fast real-time data structures applied to over 137 million unique identifiers, and uses automated heuristics to permit accurate data provenance without highly specialized knowledge and bioinformatics training.Availability and ImplementationFreely available on the web at https://datab.io/. Source code and binaries are freely available for download at https://github.com/joiningdata/databio/, implemented in Go and supported on Linux, Windows, and macOS.

ROBUST FEATURES FOR LEG DETECTION IN 2D LASER RANGE DATA

ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-archives-xlii-4-351-2018 ◽

2018 ◽

Vol XLII-4 ◽

pp. 351-358

Author(s):

D. Li ◽

L. Li ◽

M. Zhou ◽

X. Zuo

Keyword(s):

State Of The Art ◽

Smart Cities ◽

Data Sets ◽

Range Data ◽

People Detection ◽

Laser Range ◽

Adaboost Algorithm ◽

Below The Knee ◽

Knee Height ◽

Art Methods

<p><strong>Abstract.</strong> People detection in 2D laser range data is widely used in many application, such as robotics, smart cities or regions, and intelligent driving. For most current methods on people detection based on a single laser range finder are actually leg detectors as the sensor are always established below the knee height. Current state-of-the-art methods share similar steps including segmentation, feature extraction and a machine learning-based classification, but use different features which have good performance on their own experimental data. For researchers, it is important and desirable to know which features are more robust. In this paper, taking advantage of the fact that effective features can be selected by AdaBoost and assembled into a strong classifier, a set of features presented in state-of-the-art methods is combined with a set of features presented by us to train a leg detector by the AdaBoost algorithm. This detector is assembling by effective features and can classify segments into leg and non-leg. Three open source data sets including simple and complex scenarios are used for the experiments to test the features and extracted the important ones. To reduce the effect of segmentation on the final results, three segmentation methods are simultaneously used for experiments and analysis to ensure the reliability and credibility of our conclusion. Finally, 10 robust features for leg detection in 2D laser range data are presented based on the results.</p>