scholarly journals KNNCNV: A K-Nearest Neighbor Based Method for Detection of Copy Number Variations Using NGS Data

Author(s):  
Kun Xie ◽  
Kang Liu ◽  
Haque A K Alvi ◽  
Yuehui Chen ◽  
Shuzhen Wang ◽  
...  

Copy number variation (CNV) is a well-known type of genomic mutation that is associated with the development of human cancer diseases. Detection of CNVs from the human genome is a crucial step for the pipeline of starting from mutation analysis to cancer disease diagnosis and treatment. Next-generation sequencing (NGS) data provides an unprecedented opportunity for CNVs detection at the base-level resolution, and currently, many methods have been developed for CNVs detection using NGS data. However, due to the intrinsic complexity of CNVs structures and NGS data itself, accurate detection of CNVs still faces many challenges. In this paper, we present an alternative method, called KNNCNV (K-Nearest Neighbor based CNV detection), for the detection of CNVs using NGS data. Compared to current methods, KNNCNV has several distinctive features: 1) it assigns an outlier score to each genome segment based solely on its first k nearest-neighbor distances, which is not only easy to extend to other data types but also improves the power of discovering CNVs, especially the local CNVs that are likely to be masked by their surrounding regions; 2) it employs the variational Bayesian Gaussian mixture model (VBGMM) to transform these scores into a series of binary labels without a user-defined threshold. To evaluate the performance of KNNCNV, we conduct both simulation and real sequencing data experiments and make comparisons with peer methods. The experimental results show that KNNCNV could derive better performance than others in terms of F1-score.

Sensors ◽  
2021 ◽  
Vol 21 (4) ◽  
pp. 1274
Author(s):  
Daniel Bonet-Solà ◽  
Rosa Ma Alsina-Pagès

Acoustic event detection and analysis has been widely developed in the last few years for its valuable application in monitoring elderly or dependant people, for surveillance issues, for multimedia retrieval, or even for biodiversity metrics in natural environments. For this purpose, sound source identification is a key issue to give a smart technological answer to all the aforementioned applications. Diverse types of sounds and variate environments, together with a number of challenges in terms of application, widen the choice of artificial intelligence algorithm proposal. This paper presents a comparative study on combining several feature extraction algorithms (Mel Frequency Cepstrum Coefficients (MFCC), Gammatone Cepstrum Coefficients (GTCC), and Narrow Band (NB)) with a group of machine learning algorithms (k-Nearest Neighbor (kNN), Neural Networks (NN), and Gaussian Mixture Model (GMM)), tested over five different acoustic environments. This work has the goal of detailing a best practice method and evaluate the reliability of this general-purpose algorithm for all the classes. Preliminary results show that most of the combinations of feature extraction and machine learning present acceptable results in most of the described corpora. Nevertheless, there is a combination that outperforms the others: the use of GTCC together with kNN, and its results are further analyzed for all the corpora.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Xinping Fan ◽  
Guanghao Luo ◽  
Yu S. Huang

Abstract Background Copy number alterations (CNAs), due to their large impact on the genome, have been an important contributing factor to oncogenesis and metastasis. Detecting genomic alterations from the shallow-sequencing data of a low-purity tumor sample remains a challenging task. Results We introduce Accucopy, a method to infer total copy numbers (TCNs) and allele-specific copy numbers (ASCNs) from challenging low-purity and low-coverage tumor samples. Accucopy adopts many robust statistical techniques such as kernel smoothing of coverage differentiation information to discern signals from noise and combines ideas from time-series analysis and the signal-processing field to derive a range of estimates for the period in a histogram of coverage differentiation information. Statistical learning models such as the tiered Gaussian mixture model, the expectation–maximization algorithm, and sparse Bayesian learning were customized and built into the model. Accucopy is implemented in C++ /Rust, packaged in a docker image, and supports non-human samples, more at http://www.yfish.org/software/. Conclusions We describe Accucopy, a method that can predict both TCNs and ASCNs from low-coverage low-purity tumor sequencing data. Through comparative analyses in both simulated and real-sequencing samples, we demonstrate that Accucopy is more accurate than Sclust, ABSOLUTE, and Sequenza.


PeerJ ◽  
2021 ◽  
Vol 9 ◽  
pp. e12564
Author(s):  
Taifu Wang ◽  
Jinghua Sun ◽  
Xiuqing Zhang ◽  
Wen-Jing Wang ◽  
Qing Zhou

Background Copy-number variants (CNVs) have been recognized as one of the major causes of genetic disorders. Reliable detection of CNVs from genome sequencing data has been a strong demand for disease research. However, current software for detecting CNVs has high false-positive rates, which needs further improvement. Methods Here, we proposed a novel and post-processing approach for CNVs prediction (CNV-P), a machine-learning framework that could efficiently remove false-positive fragments from results of CNVs detecting tools. A series of CNVs signals such as read depth (RD), split reads (SR) and read pair (RP) around the putative CNV fragments were defined as features to train a classifier. Results The prediction results on several real biological datasets showed that our models could accurately classify the CNVs at over 90% precision rate and 85% recall rate, which greatly improves the performance of state-of-the-art algorithms. Furthermore, our results indicate that CNV-P is robust to different sizes of CNVs and the platforms of sequencing. Conclusions Our framework for classifying high-confident CNVs could improve both basic research and clinical diagnosis of genetic diseases.


2020 ◽  
Author(s):  
Getiria Onsongo ◽  
Ham Ching Lam ◽  
Matthew Bower ◽  
Bharat Thyagarajan

Abstract Objective : Detection of small copy number variations (CNVs) in clinically relevant genes is routinely being used to aid diagnosis. We recently developed a tool, CNV-RF , capable of detecting small clinically relevant CNVs. CNV-RF was designed for small gene panels and did not scale well to large gene panels. On large gene panels, CNV-RF routinely failed due to memory limitations. When successful, it took about 2 days to complete a single analysis, making it impractical for routinely analyzing large gene panels. We need a reliable tool capable of detecting CNVs in the clinic that scales well to large gene panels. Results : We have developed Hadoop-CNV-RF, a scalable implementation of CNV-RF . Hadoop-CNV-RF is a freely available tool capable of rapidly analyzing large gene panels. It takes advantage of Hadoop, a big data framework developed to analyze large amounts of data. Preliminary results show it reduces analysis time from about 2 days to less than 4 hours and can seamlessly scale to large gene panels. Hadoop-CNV-RF has been clinically validated for targeted capture data and is currently being used in a CLIA molecular diagnostics laboratory. Its availability and usage instructions are publicly available at: https://github.com/getiria-onsongo/hadoop-cnvrf-public .


2013 ◽  
Vol 13 (03) ◽  
pp. 1350033 ◽  
Author(s):  
OLIVER FAUST ◽  
WENWEI YU ◽  
NAHRIZUL ADIB KADRI

This paper describes a computer-based identification system of normal and alcoholic Electroencephalography (EEG) signals. The identification system was constructed from feature extraction and classification algorithms. The feature extraction was based on wavelet packet decomposition (WPD) and energy measures. Feature fitness was established through the statistical t-test method. The extracted features were used as training and test data for a competitive 10-fold cross-validated analysis of six classification algorithms. This analysis showed that, with an accuracy of 95.8%, the k-nearest neighbor (k-NN) algorithm outperforms naïve Bayes classification (NBC), fuzzy Sugeno classifier (FSC), probabilistic neural network (PNN), Gaussian mixture model (GMM), and decision tree (DT). The 10-fold stratified cross-validation instilled reliability in the result, therefore we are confident when we state that EEG signals can be used to automate both diagnosis and treatment monitoring of alcoholic patients. Such an automatization can lead to cost reduction by relieving medical experts from routine and administrative tasks.


Author(s):  
Mohamed Loey ◽  
Mukdad Rasheed Naman ◽  
Hala Helmy Zayed

Blood disease detection and diagnosis using blood cells images is an interesting and active research area in both the computer and medical fields. There are many techniques developed to examine blood samples to detect leukemia disease, these techniques are the traditional techniques and the deep learning (DL) technique. This article presents a survey on the different traditional techniques and DL approaches that have been employed in blood disease diagnosis based on blood cells images and to compare between the two approaches in quality of assessment, accuracy, cost and speed. This article covers 19 studies, 11 of these studies were in traditional techniques which used image processing and machine learning (ML) algorithms such as K-means, K-nearest neighbor (KNN), Naïve Bayes, Support Vector Machine (SVM), and 8 studies in advanced techniques which used DL, particularly Convolutional Neural Networks (CNNs) which is the most widely used in the field of blood image diseases detection since it is highly accurate, fast, and has the least cost. In addition, it analyzes a number of recent works that have been introduced in the field including the size of the dataset, the used methodologies, the obtained results, etc. Finally, based on the conducted study, it can be concluded that the proposed system CNN was achieving huge successes in the field whether regarding features extraction or classification task, time, accuracy, and had a lower cost in the detection of leukemia diseases.


Sign in / Sign up

Export Citation Format

Share Document