Enabling Genomics Pipelines in Commodity Personal Computers With Flash Storage

Analysis of a patient's genomics data is the first step toward precision medicine. Such analyses are performed on expensive enterprise-class server machines because input data sets are large, and the intermediate data structures are even larger (TB-size) and require random accesses. We present a general method to perform a specific genomics problem, mutation detection, on a cheap commodity personal computer (PC) with a small amount of DRAM. We construct and access large histograms of k-mers efficiently on external storage (SSDs) and apply our technique to a state-of-the-art reference-free genomics algorithm, SMUFIN, to create SMUFIN-F. We show that on two PCs, SMUFIN-F can achieve the same throughput at only one third (36%) the hardware cost and half (45%) the energy compared to SMUFIN on an enterprise-class server. To the best of our knowledge, SMUFIN-F is the first reference-free system that can detect somatic mutations on commodity PCs for whole human genomes. We believe our technique should apply to other k-mer or n-gram-based algorithms.

Download Full-text

Achieving robust somatic mutation detection with deep learning models derived from reference data sets of a cancer sample

Genome Biology ◽

10.1186/s13059-021-02592-9 ◽

2022 ◽

Vol 23 (1) ◽

Author(s):

Sayed Mohammad Ebrahim Sahraeian ◽

Li Tai Fang ◽

Konstantinos Karagiannis ◽

Malcolm Moos ◽

Sean Smith ◽

...

Keyword(s):

Deep Learning ◽

Somatic Mutation ◽

Somatic Mutations ◽

Mutation Detection ◽

Reference Data ◽

Cancer Cell Line ◽

Data Sets ◽

Sequencing Technologies ◽

Multiple Data Sets ◽

Somatic Mutation Detection

Abstract Background Accurate detection of somatic mutations is challenging but critical in understanding cancer formation, progression, and treatment. We recently proposed NeuSomatic, the first deep convolutional neural network-based somatic mutation detection approach, and demonstrated performance advantages on in silico data. Results In this study, we use the first comprehensive and well-characterized somatic reference data sets from the SEQC2 consortium to investigate best practices for using a deep learning framework in cancer mutation detection. Using the high-confidence somatic mutations established for a cancer cell line by the consortium, we identify the best strategy for building robust models on multiple data sets derived from samples representing real scenarios, for example, a model trained on a combination of real and spike-in mutations had the highest average performance. Conclusions The strategy identified in our study achieved high robustness across multiple sequencing technologies for fresh and FFPE DNA input, varying tumor/normal purities, and different coverages, with significant superiority over conventional detection approaches in general, as well as in challenging situations such as low coverage, low variant allele frequency, DNA damage, and difficult genomic regions

Download Full-text

TrustSVD: A Novel Trust-Based Matrix Factorization Model with User Trust and Item Ratings

International Journal of Advanced Research in Computer Science and Software Engineering ◽

10.23956/ijarcsse.v7i11.422 ◽

2017 ◽

Vol 7 (11) ◽

pp. 7 ◽

Cited By ~ 1

Author(s):

K Sobha Rani

Keyword(s):

Matrix Factorization ◽

Social Trust ◽

State Of The Art ◽

Data Sets ◽

Real World Data ◽

Recommendation Algorithm ◽

Active User ◽

Factorization Model ◽

The Social ◽

Matrix Factorization Technique

Collaborative filtering suffers from the problems of data sparsity and cold start, which dramatically degrade recommendation performance. To help resolve these issues, we propose TrustSVD, a trust-based matrix factorization technique. By analyzing the social trust data from four real-world data sets, we conclude that not only the explicit but also the implicit influence of both ratings and trust should be taken into consideration in a recommendation model. Hence, we build on top of a state-of-the-art recommendation algorithm SVD++ which inherently involves the explicit and implicit influence of rated items, by further incorporating both the explicit and implicit influence of trusted users on the prediction of items for an active user. To our knowledge, the work reported is the first to extend SVD++ with social trust information. Experimental results on the four data sets demonstrate that our approach TrustSVD achieves better accuracy than other ten counterparts, and can better handle the concerned issues.

Download Full-text

A novel optimal multi-pattern matching method with wildcards for DNA sequence

Technology and Health Care ◽

10.3233/thc-218012 ◽

2021 ◽

Vol 29 ◽

pp. 115-124

Author(s):

Xinlu Wang ◽

Ahmed A.F. Saif ◽

Dayou Liu ◽

Yungang Zhu ◽

Jon Atli Benediktsson

Keyword(s):

Dna Sequence ◽

Pattern Matching ◽

Health Informatics ◽

State Of The Art ◽

Machine Language ◽

Data Sets ◽

Fundamental Issue ◽

Matching Method ◽

Dna Sequence Alignment ◽

The Given

BACKGROUND: DNA sequence alignment is one of the most fundamental and important operation to identify which gene family may contain this sequence, pattern matching for DNA sequence has been a fundamental issue in biomedical engineering, biotechnology and health informatics. OBJECTIVE: To solve this problem, this study proposes an optimal multi pattern matching with wildcards for DNA sequence. METHODS: This proposed method packs the patterns and a sliding window of texts, and the window slides along the given packed text, matching against stored packed patterns. RESULTS: Three data sets are used to test the performance of the proposed algorithm, and the algorithm was seen to be more efficient than the competitors because its operation is close to machine language. CONCLUSIONS: Theoretical analysis and experimental results both demonstrate that the proposed method outperforms the state-of-the-art methods and is especially effective for the DNA sequence.

Download Full-text

gbt-HIPS: Explaining the Classifications of Gradient Boosted Tree Ensembles

Applied Sciences ◽

10.3390/app11062511 ◽

2021 ◽

Vol 11 (6) ◽

pp. 2511

Author(s):

Julian Hatwell ◽

Mohamed Medhat Gaber ◽

R. Muhammad Atif Azad

Keyword(s):

State Of The Art ◽

Heuristic Method ◽

Good Explanation ◽

Classification Rule ◽

Data Sets ◽

Classification Models ◽

Boundary Values ◽

Class Label ◽

Input Space ◽

Boosted Tree

This research presents Gradient Boosted Tree High Importance Path Snippets (gbt-HIPS), a novel, heuristic method for explaining gradient boosted tree (GBT) classification models by extracting a single classification rule (CR) from the ensemble of decision trees that make up the GBT model. This CR contains the most statistically important boundary values of the input space as antecedent terms. The CR represents a hyper-rectangle of the input space inside which the GBT model is, very reliably, classifying all instances with the same class label as the explanandum instance. In a benchmark test using nine data sets and five competing state-of-the-art methods, gbt-HIPS offered the best trade-off between coverage (0.16–0.75) and precision (0.85–0.98). Unlike competing methods, gbt-HIPS is also demonstrably guarded against under- and over-fitting. A further distinguishing feature of our method is that, unlike much prior work, our explanations also provide counterfactual detail in accordance with widely accepted recommendations for what makes a good explanation.

Download Full-text

Learning emotional word embeddings for sentiment analysis

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-201993 ◽

2021 ◽

pp. 1-13

Author(s):

Qingtian Zeng ◽

Xishi Zhao ◽

Xiaohui Hu ◽

Hua Duan ◽

Zhongying Zhao ◽

...

Keyword(s):

Sentiment Analysis ◽

Language Processing ◽

State Of The Art ◽

Research Problem ◽

Emotional Word ◽

Classification Model ◽

Data Sets ◽

Word Embeddings ◽

Real World Data ◽

Text Documents

Word embeddings have been successfully applied in many natural language processing tasks due to its their effectiveness. However, the state-of-the-art algorithms for learning word representations from large amounts of text documents ignore emotional information, which is a significant research problem that must be addressed. To solve the above problem, we propose an emotional word embedding (EWE) model for sentiment analysis in this paper. This method first applies pre-trained word vectors to represent document features using two different linear weighting methods. Then, the resulting document vectors are input to a classification model and used to train a text sentiment classifier, which is based on a neural network. In this way, the emotional polarity of the text is propagated into the word vectors. The experimental results on three kinds of real-world data sets demonstrate that the proposed EWE model achieves superior performances on text sentiment prediction, text similarity calculation, and word emotional expression tasks compared to other state-of-the-art models.

Download Full-text

Critical Aspects of Person Counting and Density Estimation

Journal of Imaging ◽

10.3390/jimaging7020021 ◽

2021 ◽

Vol 7 (2) ◽

pp. 21

Author(s):

Roland Perko ◽

Manfred Klopschitz ◽

Alexander Almer ◽

Peter M. Roth

Keyword(s):

Density Estimation ◽

Network Architecture ◽

Reference Data ◽

State Of The Art ◽

Limit State ◽

Ground Truth ◽

Data Sets ◽

Ground Truth Generation ◽

Baseline Approach ◽

Critical Aspects

Many scientific studies deal with person counting and density estimation from single images. Recently, convolutional neural networks (CNNs) have been applied for these tasks. Even though often better results are reported, it is often not clear where the improvements are resulting from, and if the proposed approaches would generalize. Thus, the main goal of this paper was to identify the critical aspects of these tasks and to show how these limit state-of-the-art approaches. Based on these findings, we show how to mitigate these limitations. To this end, we implemented a CNN-based baseline approach, which we extended to deal with identified problems. These include the discovery of bias in the reference data sets, ambiguity in ground truth generation, and mismatching of evaluation metrics w.r.t. the training loss function. The experimental results show that our modifications allow for significantly outperforming the baseline in terms of the accuracy of person counts and density estimation. In this way, we get a deeper understanding of CNN-based person density estimation beyond the network architecture. Furthermore, our insights would allow to advance the field of person density estimation in general by highlighting current limitations in the evaluation protocols.

Download Full-text

METHOD FOR UPGRADING A COMPONENT WITHIN REFURBISHMENT

Proceedings of the Design Society ◽

10.1017/pds.2021.467 ◽

2021 ◽

Vol 1 ◽

pp. 2057-2066

Author(s):

Nicola Viktoria Ganter ◽

Behrend Bode ◽

Paul Christoph Gembarski ◽

Roland Lachmayer

Keyword(s):

Closed System ◽

Individual Component ◽

State Of The Art ◽

The State ◽

Product Architecture ◽

Additive Processes ◽

Modular Product ◽

General Method

AbstractOne of the arguments against an increased use of repair is that, due to the constantly growing progress, an often already outdated component would be restored. However, refurbishment also allows a component to be modified in order to upgrade it to the state of the art or to adapt it to changed requirements. Many existing approaches regarding Design for Upgradeability are based on a modular product architecture. In these approaches, however, only the upgradeability of a product is considered through the exchange of components. Nevertheless, the exchange and improvement of individual component regions within a refurbishment has already been successfully carried out using additive processes. In this paper, a general method is presented to support the reengineering process, which is necessary to refurbish and upgrade a damaged component. In order to identify which areas can be replaced in the closed system of a component, the systematics of the modular product architecture are used. This allows dependencies between functions and component regions to be identified. Thus, it possible to determine which functions can be integrated into the intended component.

Download Full-text

Towards Application of One-Class Classification Methods to Medical Data

The Scientific World JOURNAL ◽

10.1155/2014/730712 ◽

2014 ◽

Vol 2014 ◽

pp. 1-7 ◽

Cited By ~ 10

Author(s):

Itziar Irigoien ◽

Basilio Sierra ◽

Concepción Arenas

Keyword(s):

State Of The Art ◽

Gaussian Mixture ◽

Support Vector ◽

Support Vector Data Description ◽

Data Sets ◽

Biomedical Data ◽

Vector Data ◽

Target Class ◽

Tumor Recognition ◽

One Class Classification

In the problem of one-class classification (OCC) one of the classes, the target class, has to be distinguished from all other possible objects, considered as nontargets. In many biomedical problems this situation arises, for example, in diagnosis, image based tumor recognition or analysis of electrocardiogram data. In this paper an approach to OCC based on a typicality test is experimentally compared with reference state-of-the-art OCC techniques—Gaussian, mixture of Gaussians, naive Parzen, Parzen, and support vector data description—using biomedical data sets. We evaluate the ability of the procedures using twelve experimental data sets with not necessarily continuous data. As there are few benchmark data sets for one-class classification, all data sets considered in the evaluation have multiple classes. Each class in turn is considered as the target class and the units in the other classes are considered as new units to be classified. The results of the comparison show the good performance of the typicality approach, which is available for high dimensional data; it is worth mentioning that it can be used for any kind of data (continuous, discrete, or nominal), whereas state-of-the-art approaches application is not straightforward when nominal variables are present.

Download Full-text

Ambient Intelligence Based on IoT for Assisting People with Alzheimer’s Disease Through Context Histories

Electronics ◽

10.3390/electronics10111260 ◽

2021 ◽

Vol 10 (11) ◽

pp. 1260

Author(s):

Savanna Denega Machado ◽

João Elison da Rosa Tavares ◽

Márcio Garcia Martins ◽

Jorge Luis Victória Barbosa ◽

Gabriel Villarrubia González ◽

...

Keyword(s):

Alzheimer’S Disease ◽

Alzheimer's Disease ◽

State Of The Art ◽

Physiological Data ◽

Data Sets ◽

Scientific Contribution ◽

Daily Lives ◽

Iot Applications ◽

Methodological Aspects ◽

Dangerous Behaviors

New Internet of Things (IoT) applications are enabling the development of projects that help with monitoring people with different diseases in their daily lives. Alzheimer’s is a disease that affects neurological functions and needs support to maintain maximum independence and security of patients during this stage of life, as the cure and reversal of symptoms have not yet been discovered. The IoT-based monitoring system provides the caregivers’ support in monitoring people with Alzheimer’s disease (AD). This paper presents an ontology-based computational model that receives physiological data from external IoT applications, allowing identification of potentially dangerous behaviors for patients with AD. The main scientific contribution of this work is the specification of a model focusing on Alzheimer’s disease using the analysis of context histories and context prediction, which, considering the state of the art, is the only one that uses analysis of context histories to perform predictions. In this research, we also propose a simulator to generate activities of the daily life of patients, allowing the creation of data sets. These data sets were used to evaluate the contributions of the model and were generated according to the standardization of the ontology. The simulator generated 1026 scenarios applied to guide the predictions, which achieved average accurary of 97.44%. The experiments also allowed the learning of 20 relevant lessons on technological, medical, and methodological aspects that are recorded in this article.

Download Full-text

The Three-Cornered Hat Method for Estimating Error Variances of Three or More Atmospheric Data Sets – Part II: Evaluating Radio Occultation and Radiosonde Observations, Global Model Forecasts, and Reanalyses

Journal of Atmospheric and Oceanic Technology ◽

10.1175/jtech-d-20-0209.1 ◽

2021 ◽

Author(s):

Therese Rieckh ◽

Jeremiah P. Sjoberg ◽

Richard A. Anthes

Keyword(s):

Radio Occultation ◽

State Of The Art ◽

Specific Humidity ◽

Data Sets ◽

Error Growth ◽

Atmospheric Conditions ◽

Error Statistics ◽

Weather And Climate ◽

Atmospheric Data ◽

The Impact

AbstractWe apply the three-cornered hat (3CH) method to estimate refractivity, bending angle, and specific humidity error variances for a number of data sets widely used in research and/or operations: radiosondes, radio occultation (COSMIC, COSMIC-2), NCEP global forecasts, and nine reanalyses. We use a large number and combinations of data sets to obtain insights into the impact of the error correlations among different data sets that affect 3CH estimates. Error correlations may be caused by actual correlations of errors, representativeness differences, or imperfect co-location of the data sets. We show that the 3CH method discriminates among the data sets and how error statistics of observations compare to state-of-the-art reanalyses and forecasts, as well as reanalyses that do not assimilate satellite data. We explore results for October and November 2006 and 2019 over different latitudinal regions and show error growth of the NCEP forecasts with time. Because of the importance of tropospheric water vapor to weather and climate, we compare error estimates of refractivity for dry and moist atmospheric conditions.

Download Full-text