Profile hidden Markov model sequence analysis can help remove putative pseudogenes from DNA barcoding and metabarcoding datasets

Mapping Intimacies ◽

10.1101/2021.01.24.427982 ◽

2021 ◽

Author(s):

T. M. Porter ◽

M. Hajibabaei

Keyword(s):

Markov Model ◽

Hidden Markov Model ◽

Dna Barcoding ◽

Hidden Markov ◽

Dna Barcode ◽

Open Reading Frame ◽

Protein Coding ◽

Reading Frame ◽

Frame Length ◽

Open Reading Frame Length

AbstractBackgroundPseudogenes are non-functional copies of protein coding genes that typically follow a different molecular evolutionary path as compared to functional genes. The inclusion of pseudogene sequences in DNA barcoding and metabarcoding analysis can lead to misleading results. None of the most widely used bioinformatic pipelines used to process marker gene (metabarcode) high throughput sequencing data specifically accounts for the presence of pseudogenes in protein-coding marker genes. The purpose of this study is to develop a method to screen for obvious pseudogenes in large COI metabarcode datasets. We do this by: 1) describing gene and pseudogene characteristics from a simulated DNA barcode dataset, 2) show the impact of two different pseudogene removal methods on mock metabarcode datasets with simulated pseudogenes, and 3) incorporate a pseudogene filtering step in a bioinformatic pipeline that can be used to process Illumina paired-end COI metabarcode sequences. Open reading frame length and sequence bit scores from hidden Markov model (HMM) profile were used to detect pseudogenes.ResultsOur simulations showed that it was more difficult to identify pseudogenes from shorter amplicon sequences such as those typically used in metabarcoding (∼300 bp) compared with full length DNA barcodes that are used in construction of barcode libraries (∼ 650 bp). It was also more difficult to identify pseudogenes in datasets where there is a high percentage of pseudogene sequences. We show that existing bioinformatic pipelines used to process metabarcode sequences already remove some apparent pseudogenes, especially in the rare sequence removal step, but the addition of a pseudogene filtering step can remove more.ConclusionsThe combination of open reading frame length and hidden Markov model profile analysis can be used to effectively screen out obvious pseudogenes from large datasets. There is more to learn from COI pseudogenes such as their frequency in DNA barcode and metabarcoding studies, their taxonomic distribution, and evolution. Thus, we encourage the submission of verified COI pseudogenes to public databases to facilitate future studies.

Download Full-text

Profile hidden Markov model sequence analysis can help remove putative pseudogenes from DNA barcoding and metabarcoding datasets

BMC Bioinformatics ◽

10.1186/s12859-021-04180-x ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

T. M. Porter ◽

M. Hajibabaei

Keyword(s):

Markov Model ◽

Hidden Markov Model ◽

Dna Barcoding ◽

Profile Analysis ◽

Hidden Markov ◽

Open Reading Frame ◽

Protein Coding ◽

Reading Frame ◽

Frame Length ◽

Open Reading Frame Length

Abstract Background Pseudogenes are non-functional copies of protein coding genes that typically follow a different molecular evolutionary path as compared to functional genes. The inclusion of pseudogene sequences in DNA barcoding and metabarcoding analysis can lead to misleading results. None of the most widely used bioinformatic pipelines used to process marker gene (metabarcode) high throughput sequencing data specifically accounts for the presence of pseudogenes in protein-coding marker genes. The purpose of this study is to develop a method to screen for nuclear mitochondrial DNA segments (nuMTs) in large COI datasets. We do this by: (1) describing gene and nuMT characteristics from an artificial COI barcode dataset, (2) show the impact of two different pseudogene removal methods on perturbed community datasets with simulated nuMTs, and (3) incorporate a pseudogene filtering step in a bioinformatic pipeline that can be used to process Illumina paired-end COI metabarcode sequences. Open reading frame length and sequence bit scores from hidden Markov model (HMM) profile analysis were used to detect pseudogenes. Results Our simulations showed that it was more difficult to identify nuMTs from shorter amplicon sequences such as those typically used in metabarcoding compared with full length DNA barcodes that are used in the construction of barcode libraries. It was also more difficult to identify nuMTs in datasets where there is a high percentage of nuMTs. Existing bioinformatic pipelines used to process metabarcode sequences already remove some nuMTs, especially in the rare sequence removal step, but the addition of a pseudogene filtering step can remove up to 5% of sequences even when other filtering steps are in place. Conclusions Open reading frame length filtering alone or combined with hidden Markov model profile analysis can be used to effectively screen out apparent pseudogenes from large datasets. There is more to learn from COI nuMTs such as their frequency in DNA barcoding and metabarcoding studies, their taxonomic distribution, and evolution. Thus, we encourage the submission of verified COI nuMTs to public databases to facilitate future studies.

Download Full-text

Detection of Short Protein Coding Regions within the Cyanobacterium Genome: Application of the Hidden Markov Model

DNA Research ◽

10.1093/dnares/3.6.355 ◽

1996 ◽

Vol 3 (6) ◽

pp. 355-361 ◽

Cited By ~ 15

Author(s):

T. Yada

Keyword(s):

Markov Model ◽

Hidden Markov Model ◽

Hidden Markov ◽

Protein Coding ◽

Coding Regions ◽

Short Protein

Download Full-text

Traffic Analysis Of Anonymity Protocol Using Hidden Markov Model (hmm)-based Model Confidence In The Media

10.33811/1847-003-013-010 ◽

2018 ◽

pp. 187

Author(s):

قاسم عبود مهدى

Keyword(s):

Markov Model ◽

Hidden Markov Model ◽

Hidden Markov ◽

Traffic Analysis ◽

Model Confidence ◽

The Media

Download Full-text

Multiple Sequence Alignment and Profile Analysis of Protein Family Utsing Hidden Markov Model

International Journal of Scientific Research ◽

10.15373/22778179/june2013/66 ◽

2012 ◽

Vol 2 (6) ◽

pp. 208-211

Author(s):

Navjot Kaur ◽

◽

Rajbir Singh Cheema ◽

Harmandeep Singh Harmandeep Singh

Keyword(s):

Markov Model ◽

Hidden Markov Model ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Profile Analysis ◽

Hidden Markov ◽

Protein Family ◽

Multiple Sequence

Download Full-text

Auscultating Diagnosis for Hemodialysis Shunt Stenosis using a Self-Organizing Map and Hidden Markov Model

IEEJ Transactions on Electronics Information and Systems ◽

10.1541/ieejeiss.132.1589 ◽

2012 ◽

Vol 132 (10) ◽

pp. 1589-1594 ◽

Cited By ~ 2

Author(s):

Hayato Waki ◽

Yutaka Suzuki ◽

Osamu Sakata ◽

Mizuya Fukasawa ◽

Hatsuhiro Kato

Keyword(s):

Markov Model ◽

Hidden Markov Model ◽

Hidden Markov ◽

Self Organizing Map ◽

Self Organizing

Download Full-text

A Fast Voice Command Recognition Algorithm Based on the Hidden Markov Model Stationary Distribution

Vestnik MEI ◽

10.24160/1993-6982-2018-5-65-72 ◽

2018 ◽

Vol 5 (5) ◽

pp. 65-72

Author(s):

Pavel A. Paramonov ◽

◽

Ivan V. Ognev ◽

Keyword(s):

Markov Model ◽

Hidden Markov Model ◽

Stationary Distribution ◽

Hidden Markov ◽

Recognition Algorithm ◽

Voice Command

Download Full-text

Engaging Voluntary Contributions in Online Communities: A Hidden Markov Model

MIS Quarterly ◽

10.25300/misq/2018/14196 ◽

2018 ◽

Vol 42 (1) ◽

pp. 83-100 ◽

Cited By ~ 21

Author(s):

Wei Chen ◽

◽

Xiahua Wei ◽

Kevin Xiaoguo Zhu ◽

◽

...

Keyword(s):

Markov Model ◽

Hidden Markov Model ◽

Online Communities ◽

Hidden Markov ◽

Voluntary Contributions

Download Full-text

Behavior Monitoring of a Process in a Computational Grid Using Hidden Markov Model

i-manager s Journal on Computer Science ◽

10.26634/jcom.1.3.2546 ◽

2013 ◽

Vol 1 (3) ◽

pp. 22-30

Author(s):

Shaik Naseera

Keyword(s):

Markov Model ◽

Hidden Markov Model ◽

Hidden Markov ◽

Computational Grid ◽

Behavior Monitoring

Download Full-text

Ground mobile target tracking by Hidden Markov model = Theo dõi mục tiêu di động mặt đất bằng mô hình Hidden Markov.

Science and Technology Development Journal ◽

10.3125/jstd.v9i12.365 ◽

2007 ◽

Vol 9 (12) ◽

Author(s):

Hieu Quan Huynh ◽

Thanh Dinh Vu

Keyword(s):

Markov Model ◽

Hidden Markov Model ◽

Target Tracking ◽

Hidden Markov ◽

Mobile Target

Download Full-text

Implementation of Android Based Speech Recognition for Indonesian Geography Dictionary

Jurnal ULTIMA Computing ◽

10.31937/sk.v7i2.296 ◽

2016 ◽

Vol 7 (2) ◽

pp. 76-82

Author(s):

Hugeng Hugeng ◽

Edbert Hansel

Keyword(s):

Speech Recognition ◽

Markov Model ◽

Hidden Markov Model ◽

Hidden Markov ◽

Spoken Word ◽

Silent Condition ◽

Word Detection ◽

Index Terms ◽

To Receive ◽

Noisy Condition

We have built an application of speech recognition for Indonesian geography dictionary based on Android operating system, named GAIA. This application uses a smartphone as a device to receive input in the form of a spoken word from a user. The approach used in recognition is Hidden Markov Model which is contained in the Pocketsphinx library. The phonemes used are Indonesian phonemes’ rule. The advantage of this application is that it can be used without internet access. In the application testing, word detection is done with four conditions to determine the level of accuracy. The four conditions are near silent, near noisy, far silent, and far noisy. From the testing and analysis conducted, it can be concluded that GAIA application can be built as a speech recognition application on Android for Indonesian geography dictionary; with the results in the near silent condition accuracy of word recognition reaches an average of 52.87%, in the near noisy reaches an average of 14.5%, in the far silent condition reaches an average of 23.2%, and in the far noisy condition reaches an average of 2.8%. Index Terms—speech recognition, Indonesian geography dictionary, Hidden Markov Model, Pocketsphinx, Android.

Download Full-text