Audio-to-score singing transcription based on a CRNN-HSMM hybrid model

APSIPA Transactions on Signal and Information Processing ◽

10.1017/atsip.2021.4 ◽

2021 ◽

Vol 10 ◽

Author(s):

Ryo Nishikimi ◽

Eita Nakamura ◽

Masataka Goto ◽

Kazuyoshi Yoshii

Keyword(s):

Hybrid Model ◽

Viterbi Algorithm ◽

Language Model ◽

Expressive Power ◽

Acoustic Model ◽

Musical Language ◽

Musical Score ◽

Singing Voice ◽

Generative Process ◽

Music Signal

This paper describes an automatic singing transcription (AST) method that estimates a human-readable musical score of a sung melody from an input music signal. Because of the considerable pitch and temporal variation of a singing voice, a naive cascading approach that estimates an F0 contour and quantizes it with estimated tatum times cannot avoid many pitch and rhythm errors. To solve this problem, we formulate a unified generative model of a music signal that consists of a semi-Markov language model representing the generative process of latent musical notes conditioned on musical keys and an acoustic model based on a convolutional recurrent neural network (CRNN) representing the generative process of an observed music signal from the notes. The resulting CRNN-HSMM hybrid model enables us to estimate the most-likely musical notes from a music signal with the Viterbi algorithm, while leveraging both the grammatical knowledge about musical notes and the expressive power of the CRNN. The experimental results showed that the proposed method outperformed the conventional state-of-the-art method and the integration of the musical language model with the acoustic model has a positive effect on the AST performance.

Download Full-text

Chord-aware automatic music transcription based on hierarchical Bayesian integration of acoustic and language models

APSIPA Transactions on Signal and Information Processing ◽

10.1017/atsip.2018.17 ◽

2018 ◽

Vol 7 ◽

Author(s):

Yuta Ojima ◽

Eita Nakamura ◽

Katsutoshi Itoyama ◽

Kazuyoshi Yoshii

Keyword(s):

Latent Variables ◽

Language Model ◽

Language Models ◽

Sequential Dependency ◽

Acoustic Model ◽

Hierarchical Bayesian ◽

Generative Process ◽

Music Transcription ◽

Automatic Music Transcription ◽

Music Audio

This paper describes automatic music transcription with chord estimation for music audio signals. We focus on the fact that concurrent structures of musical notes such as chords form the basis of harmony and are considered for music composition. Since chords and musical notes are deeply linked with each other, we propose joint pitch and chord estimation based on a Bayesian hierarchical model that consists of an acoustic model representing the generative process of a spectrogram and a language model representing the generative process of a piano roll. The acoustic model is formulated as a variant of non-negative matrix factorization that has binary variables indicating a piano roll. The language model is formulated as a hidden Markov model that has chord labels as the latent variables and emits a piano roll. The sequential dependency of a piano roll can be represented in the language model. Both models are integrated through a piano roll in a hierarchical Bayesian manner. All the latent variables and parameters are estimated using Gibbs sampling. The experimental results showed the great potential of the proposed method for unified music transcription and grammar induction.

Download Full-text

Musical score recognition of solo music signal based on WAV file

Journal of Computer Applications ◽

10.3724/sp.j.1087.2009.00768 ◽

2009 ◽

Vol 29 (3) ◽

pp. 768-770

Author(s):

Xi-zheng CAO ◽

Chun-hong LIU ◽

Lin SUN

Keyword(s):

Musical Score ◽

Music Signal ◽

Solo Music

Download Full-text

A language model for Amdo Tibetan speech recognition

MATEC Web of Conferences ◽

10.1051/matecconf/202133606016 ◽

2021 ◽

Vol 336 ◽

pp. 06016

Author(s):

Taiben Suan ◽

Rangzhuoma Cai ◽

Zhijie Cai ◽

Ba Zu ◽

Baojia Gong

Keyword(s):

Speech Recognition ◽

Network Architecture ◽

Language Model ◽

Acoustic Model ◽

End To End

We built a language model which is based on Transformer network architecture, used attention mechanisms to dispensing with recurrence and convalutions entirely. Through the transliteration of Tibetan to International Phonetic Alphabets, the language model was trained using the syllables and phonemes of the Tibetan word as modeling units to predict corresponding Tibetan sentences according to the context semantics of IPA. And it combined with the acoustic model as the Tibetan speech recognition was compared with end-to-end Tibetan speech recognition.

Download Full-text

Improved mixed language speech recognition using asymmetric acoustic model and language model with code-switch inversion constraints

2013 IEEE International Conference on Acoustics, Speech and Signal Processing ◽

10.1109/icassp.2013.6639094 ◽

2013 ◽

Cited By ~ 5

Author(s):

Ying Li ◽

Pascale Fung

Keyword(s):

Speech Recognition ◽

Language Model ◽

Acoustic Model ◽

Mixed Language

Download Full-text

Analytics for Noisy Unstructured Text Data I

Encyclopedia of Artificial Intelligence ◽

10.4018/978-1-59904-849-9.ch015 ◽

2011 ◽

pp. 99-104

Author(s):

Shourya Roy ◽

L. Venkata Subramaniam

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Language Model ◽

Standard Form ◽

Acoustic Model ◽

Text Data ◽

Optical Character ◽

Contact Centers ◽

Noisy Text

Accdrnig to rscheearch at Cmabrigde Uinervtisy, it deosn’t mttaer in what oredr the ltteers in a wrod are, the olny iprmoetnt tihng is that the frist and lsat ltteer be at the rghit pclae. Tihs is bcuseae the human mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe.1 Unfortunately computing systems are not yet as smart as the human mind. Over the last couple of years a significant number of researchers have been focussing on noisy text analytics. Noisy text data is found in informal settings (online chat, SMS, e-mails, message boards, among others) and in text produced through automated speech recognition or optical character recognition systems. Noise can possibly degrade the performance of other information processing algorithms such as classification, clustering, summarization and information extraction. We will identify some of the key research areas for noisy text and give a brief overview of the state of the art. These areas will be, (i) classification of noisy text, (ii) correcting noisy text, (iii) information extraction from noisy text. We will cover the first one in this chapter and the later two in the next chapter. We define noise in text as any kind of difference in the surface form of an electronic text from the intended, correct or original text. We see such noisy text everyday in various forms. Each of them has unique characteristics and hence requires special handling. We introduce some such forms of noisy textual data in this section. Online Noisy Documents: E-mails, chat logs, scrapbook entries, newsgroup postings, threads in discussion fora, blogs, etc., fall under this category. People are typically less careful about the sanity of written content in such informal modes of communication. These are characterized by frequent misspellings, commonly and not so commonly used abbreviations, incomplete sentences, missing punctuations and so on. Almost always noisy documents are human interpretable, if not by everyone, at least by intended readers. SMS: Short Message Services are becoming more and more common. Language usage over SMS text significantly differs from the standard form of the language. An urge towards shorter message length facilitating faster typing and the need for semantic clarity, shape the structure of this non-standard form known as the texting language (Choudhury et. al., 2007). Text Generated by ASR Devices: ASR is the process of converting a speech signal to a sequence of words. An ASR system takes speech signal such as monologs, discussions between people, telephonic conversations, etc. as input and produces a string a words, typically not demarcated by punctuations as transcripts. An ASR system consists of an acoustic model, a language model and a decoding algorithm. The acoustic model is trained on speech data and their corresponding manual transcripts. The language model is trained on a large monolingual corpus. ASR convert audio into text by searching the acoustic model and language model space using the decoding algorithm. Most conversations at contact centers today between agents and customers are recorded. To do any processing of this data to obtain customer intelligence it is necessary to convert the audio into text. Text Generated by OCR Devices: Optical character recognition, or ‘OCR’, is a technology that allows digital images of typed or handwritten text to be transferred into an editable text document. It takes the picture of text and translates the text into Unicode or ASCII. . For handwritten optical character recognition, the rate of recognition is 80% to 90% with clean handwriting. Call Logs in Contact Centers: Today’s contact centers (also known as call centers, BPOs, KPOs) produce huge amounts of unstructured data in the form of call logs apart from emails, call transcriptions, SMS, chattranscripts etc. Agents are expected to summarize an interaction as soon as they are done with it and before picking up the next one. As the agents work under immense time pressure hence the summary logs are very poorly written and sometimes even difficult for human interpretation. Analysis of such call logs are important to identify problem areas, agent performance, evolving problems etc. In this chapter we will be focussing on automatic classification of noisy text. Automatic text classification refers to segregating documents into different topics depending on content. For example, categorizing customer emails according to topics such as billing problem, address change, product enquiry etc. It has important applications in the field of email categorization, building and maintaining web directories e.g. DMoz, spam filter, automatic call and email routing in contact center, pornographic material filter and so on.

Download Full-text

EfficientSing: A Chinese Singing Voice Synthesis System Using Duration-Free Acoustic Model and HiFi-GAN Vocoder

10.21437/interspeech.2021-771 ◽

2021 ◽

Author(s):

Zhengchen Liu ◽

Chenfeng Miao ◽

Qingying Zhu ◽

Minchuan Chen ◽

Jun Ma ◽

...

Keyword(s):

Acoustic Model ◽

Singing Voice ◽

Synthesis System ◽

Voice Synthesis

Download Full-text

Language model transformation applied to lightly supervised training of acoustic model for congress meetings

2009 IEEE International Conference on Acoustics, Speech and Signal Processing ◽

10.1109/icassp.2009.4960468 ◽

2009 ◽

Cited By ~ 4

Author(s):

Tatsuya Kawahara ◽

Masato Mimura ◽

Yuya Akita

Keyword(s):

Model Transformation ◽

Language Model ◽

Acoustic Model ◽

Supervised Training

Download Full-text

Attention please: modeling global and local context in glycan structure-function relationships

10.1101/2021.10.15.464532 ◽

2021 ◽

Author(s):

Bowen Dai ◽

Daniel E Mattox ◽

Chris Bailey-Kellogg

Keyword(s):

Structure Function ◽

Function Space ◽

Language Model ◽

Predictive Performance ◽

Structural Diversity ◽

Local Context ◽

Glycan Structure ◽

Generative Process ◽

Context Dependent ◽

Contextual Determinants

Glycans are found across the tree of life with remarkable structural diversity enabling critical contributions to diverse biological processes, ranging from facilitating host-pathogen interactions to regulating mitosis & DNA damage repair. While functional motifs within glycan structures are largely responsible for mediating interactions, the contexts in which the motifs are presented can drastically impact these interactions and their downstream effects. Here, we demonstrate the first deep learning method to represent both local and global context in the study of glycan structure-function relationships. Our method, glyBERT, encodes glycans with a branched biochemical language and employs an attention-based deep language model to learn biologically relevant glycan representations focused on the most important components within their global structures. Applying glyBERT to a variety of prediction tasks confirms the value of capturing rich context-dependent patterns in this attention-based model: the same monosaccharides and glycan motifs are represented differently in different contexts and thereby enable improved predictive performance relative to the previous state-of-the-art approaches. Furthermore, glyBERT supports generative exploration of context-dependent glycan structure-function space, moving from one glycan to "nearby" glycans so as to maintain or alter predicted functional properties. In a case study application to altering glycan immunogenicity, this generative process reveals the learned contextual determinants of immunogenicity while yielding both known and novel, realistic glycan structures with altered predicted immunogenicity. In summary, modeling the context dependence of glycan motifs is critical for investigating overall glycan functionality and can enable further exploration of glycan structure-function space to inform new hypotheses and synthetic efforts.

Download Full-text

Building Acoustic and Language Model for Continuous Speech Recognition in Bahasa Indonesia

Jurnal Teknik Informatika dan Sistem Informasi ◽

10.28932/jutisi.v6i2.2684 ◽

2020 ◽

Vol 6 (2) ◽

Author(s):

Vincent Elbert Budiman ◽

Andreas Widjaja

Keyword(s):

Speech Recognition ◽

Error Rate ◽

Language Model ◽

Beam Width ◽

Acoustic Model ◽

Continuous Speech ◽

Continuous Speech Recognition ◽

Word Error Rate ◽

Testing Data ◽

Bahasa Indonesia

Here a development of an Acoustic and Language Model is presented. Low Word Error Rate is an early good sign of a good Language and Acoustic Model. Although there are still parameters other than Words Error Rate, our work focused on building Bahasa Indonesia with approximately 2000 common words and achieved the minimum threshold of 25% Word Error Rate. There were several experiments consist of different cases, training data, and testing data with Word Error Rate and Testing Ratio as the main comparison. The language and acoustic model were built using Sphinx4 from Carnegie Mellon University using Hidden Markov Model for the acoustic model and ARPA Model for the language model. The models configurations, which are Beam Width and Force Alignment, directly correlates with Word Error Rate. The configurations were set to 1e-80 for Beam Width and 1e-60 for Force Alignment to prevent underfitting or overfitting of the acoustic model. The goals of this research are to build continuous speech recognition in Bahasa Indonesia which has low Word Error Rate and to determine the optimum numbers of training and testing data which minimize the Word Error Rate.

Download Full-text