scholarly journals Language modelling for biological sequences – curated datasets and baselines

2020 ◽  
Author(s):  
Jose Juan Almagro Armenteros ◽  
Alexander Rosenberg Johansen ◽  
Ole Winther ◽  
Henrik Nielsen

AbstractMotivationLanguage modelling (LM) on biological sequences is an emergent topic in the field of bioinformatics. Current research has shown that language modelling of proteins can create context-dependent representations that can be applied to improve performance on different protein prediction tasks. However, little effort has been directed towards analyzing the properties of the datasets used to train language models. Additionally, only the performance of cherry-picked downstream tasks are used to assess the capacity of LMs.ResultsWe analyze the entire UniProt database and investigate the different properties that can bias or hinder the performance of LMs such as homology, domain of origin, quality of the data, and completeness of the sequence. We evaluate n-gram and Recurrent Neural Network (RNN) LMs to assess the impact of these properties on performance. To our knowledge, this is the first protein dataset with an emphasis on language modelling. Our inclusion of properties specific to proteins gives a detailed analysis of how well natural language processing methods work on biological sequences. We find that organism domain and quality of data have an impact on the performance, while the completeness of the proteins has little influence. The RNN based LM can learn to model Bacteria, Eukarya, and Archaea; but struggles with Viruses. By using the LM we can also generate novel proteins that are shown to be similar to real proteins.Availability and implementationhttps://github.com/alrojo/UniLanguage

Author(s):  
M. D. Riazur Rahman ◽  
M. D. Tarek Habib ◽  
M. D. Sadekur Rahman ◽  
Gazi Zahirul Islam ◽  
M. D. Abbas Ali Khan

N-gram based language models are very popular and extensively used statistical methods for solving various natural language processing problems including grammar checking. Smoothing is one of the most effective techniques used in building a language model to deal with data sparsity problem. Kneser-Ney is one of the most prominently used and successful smoothing technique for language modelling. In our previous work, we presented a Witten-Bell smoothing based language modelling technique for checking grammatical correctness of Bangla sentences which showed promising results outperforming previous methods. In this work, we proposed an improved method using Kneser-Ney smoothing based n-gram language model for grammar checking and performed a comparative performance analysis between Kneser-Ney and Witten-Bell smoothing techniques for the same purpose. We also provided an improved technique for calculating the optimum threshold which further enhanced the the results. Our experimental results show that, Kneser-Ney outperforms Witten-Bell as a smoothing technique when used with n-gram LMs for checking grammatical correctness of Bangla sentences.


Author(s):  
J. Matthew Brennan ◽  
Angela Lowenstern ◽  
Paige Sheridan ◽  
Isabel J. Boero ◽  
Vinod H. Thourani ◽  
...  

Background Patients with symptomatic severe aortic stenosis (ssAS) have a high mortality risk and compromised quality of life. Surgical/transcatheter aortic valve replacement (AVR) is a Class I recommendation, but it is unclear if this recommendation is uniformly applied. We determined the impact of managing cardiologists on the likelihood of ssAS treatment. Methods and Results Using natural language processing of Optum electronic health records, we identified 26 438 patients with newly diagnosed ssAS (2011–2016). Multilevel, multivariable Fine‐Gray competing risk models clustered by cardiologists were used to determine the impact of cardiologists on the likelihood of 1‐year AVR treatment. Within 1 year of diagnosis, 35.6% of patients with ssAS received an AVR; however, rates varied widely among managing cardiologists (0%, lowest quartile; 100%, highest quartile [median, 29.6%; 25th–75th percentiles, 13.3%–47.0%]). The odds of receiving AVR varied >2‐fold depending on the cardiologist (median odds ratio for AVR, 2.25; 95% CI, 2.14–2.36). Compared with patients with ssAS of cardiologists with the highest treatment rates, those treated by cardiologists with the lowest AVR rates experienced significantly higher 1‐year mortality (lowest quartile, adjusted hazard ratio, 1.22, 95% CI, 1.13–1.33). Conclusions Overall AVR rates for ssAS were low, highlighting a potential challenge for ssAS management in the United States. Cardiologist AVR use varied substantially; patients treated by cardiologists with lower AVR rates had higher mortality rates than those treated by cardiologists with higher AVR rates.


2019 ◽  
Vol 9 (18) ◽  
pp. 3648
Author(s):  
Casper S. Shikali ◽  
Zhou Sijie ◽  
Liu Qihe ◽  
Refuoe Mokhosi

Deep learning has extensively been used in natural language processing with sub-word representation vectors playing a critical role. However, this cannot be said of Swahili, which is a low resource and widely spoken language in East and Central Africa. This study proposed novel word embeddings from syllable embeddings (WEFSE) for Swahili to address the concern of word representation for agglutinative and syllabic-based languages. Inspired by the learning methodology of Swahili in beginner classes, we encoded respective syllables instead of characters, character n-grams or morphemes of words and generated quality word embeddings using a convolutional neural network. The quality of WEFSE was demonstrated by the state-of-art results in the syllable-aware language model on both the small dataset (31.229 perplexity value) and the medium dataset (45.859 perplexity value), outperforming character-aware language models. We further evaluated the word embeddings using word analogy task. To the best of our knowledge, syllabic alphabets have not been used to compose the word representation vectors. Therefore, the main contributions of the study are a syllabic alphabet, WEFSE, a syllabic-aware language model and a word analogy dataset for Swahili.


2020 ◽  
Author(s):  
Damianos P. Melidis ◽  
Brandon Malone ◽  
Wolfgang Nejdl

Abstract Background: Word embedding approaches have revolutionized natural language processing (NLP) research. These approaches aim to map words to a low-dimensional vector space, in which words with similar linguistic features cluster together. Embedding-based methods have also been developed for proteins, where words are amino acids and sentences are proteins. The learned embeddings have been evaluated qualitatively, via visual inspection of the embedding space and extrinsically, via performance comparison on downstream protein prediction tasks. However, these sequence embeddings have the caveat that biological metadata do not exist for each amino acid, in order to measure the quality of each unique learned embedding vector. Results: Here, we present dom2vec, an approach for learning protein domain embeddings using word2vec on InterPro annotations. In contrast to sequence embeddings, biological metadata do exist for protein domains, related to each domain separately. Therefore, we present four intrinsic evaluation strategies to quantitatively assess the quality of the learned embedding space. To perform a reliable evaluation in terms of biology knowledge, we selected the metadata related to the most distinctive biological characteristics of domains. These are the structure, enzymatic and molecular function of a given domain. Notably, dom2vec obtains adequate level of performance in the intrinsic assessment, therefore we can draw an analogy between the local linguistic features in natural languages and the domain structure and function information in domain architectures. Moreover, we demonstrate the dom2vec applicability on protein prediction tasks, by comparing it with state-of-the-art sequence embeddings in three downstream tasks. We show that dom2vec outperform sequence embeddings for toxin and enzymatic function prediction and is comparable with sequence embeddings in cellular location prediction. Conclusions: We report that the application of word2vec on InterPro annotations produces domain embeddings with two significant advantages over sequence embeddings. First, each unique dom2vec vector can be quantitatively evaluated towards its available structure and function metadata. Second, the produced embeddings can outperform the sequence embeddings for a subset of downstream tasks. Overall, dom2vec embeddings are able to capture the most important biological properties of domains and surpass sequence embeddings for a subset of prediction tasks.


2019 ◽  
Author(s):  
Pavankumar Mulgund ◽  
Raj Sharman ◽  
Priya Anand ◽  
Shashank Shekhar ◽  
Priya Karadi

BACKGROUND In recent years, online physician-rating websites have become prominent and exert considerable influence on patients’ decisions. However, the quality of these decisions depends on the quality of data that these systems collect. Thus, there is a need to examine the various data quality issues with physician-rating websites. OBJECTIVE This study’s objective was to identify and categorize the data quality issues afflicting physician-rating websites by reviewing the literature on online patient-reported physician ratings and reviews. METHODS We performed a systematic literature search in ACM Digital Library, EBSCO, Springer, PubMed, and Google Scholar. The search was limited to quantitative, qualitative, and mixed-method papers published in the English language from 2001 to 2020. RESULTS A total of 423 articles were screened. From these, 49 papers describing 18 unique data quality issues afflicting physician-rating websites were included. Using a data quality framework, we classified these issues into the following four categories: intrinsic, contextual, representational, and accessible. Among the papers, 53% (26/49) reported intrinsic data quality errors, 61% (30/49) highlighted contextual data quality issues, 8% (4/49) discussed representational data quality issues, and 27% (13/49) emphasized accessibility data quality. More than half the papers discussed multiple categories of data quality issues. CONCLUSIONS The results from this review demonstrate the presence of a range of data quality issues. While intrinsic and contextual factors have been well-researched, accessibility and representational issues warrant more attention from researchers, as well as practitioners. In particular, representational factors, such as the impact of inline advertisements and the positioning of positive reviews on the first few pages, are usually deliberate and result from the business model of physician-rating websites. The impact of these factors on data quality has not been addressed adequately and requires further investigation.


2021 ◽  
Vol 2021 (2) ◽  
pp. 229-241
Author(s):  
Vera L. LUKICHEVA ◽  
◽  
Andrey A. PRIVALOV ◽  
Daniil D. TITOV ◽  
◽  
...  

Objective: To analyze the impact of computer attacks on the performance quality of data transmission channels and channeling systems. It is also necessary to take into account the capabilities of an intruder to introduce malware into channeling systems when committing a computer attack. Methods: To determine the required design ratios, several options for setting various distribution functions characterizing the parameters used as input data and types of inbound streams have been considered, taking into account the parameters of the intruder’s computer attack model set by the values of the probability of successful attack. Mathematical modeling is carried out using the method of topological transformation of stochastic networks. The exponential, momentum and gamma distributions are considered as distribution functions of random variables. The solutions are presented for inbound streams corresponding to the Poisson, Weibull, and Pareto models. Results: The proposed approach makes it possible to assess the performance quality of data transmission channels in the context of computer attacks. These assessments make it possible to analyze the state and develop guidelines for improving the performance quality of communication channels against the destructive information impact of the intruder. Various variants of the functions of random variables distribution and various types of the inbound stream were used for modeling, making it possible to compare them, as well as to assess the possibility of using them in channels that provide users with different services. Practical importance: The modeling results can be used to build communication management decision support systems, as well as to detect attempts of unauthorized access to the telecommunications resource of transportation management systems. The proposed approach can be applied in the development of threat models to describe the capabilities of the intruder (the ‘Intruder Model’).


2021 ◽  
Author(s):  
Sven Hilbert ◽  
Stefan Coors ◽  
Elisabeth Barbara Kraus ◽  
Bernd Bischl ◽  
Mario Frei ◽  
...  

Classical statistical methods are limited in the analysis of highdimensional datasets. Machine learning (ML) provides a powerful framework for prediction by using complex relationships, often encountered in modern data with a large number of variables, cases and potentially non-linear effects. ML has turned into one of the most influential analytical approaches of this millennium and has recently become popular in the behavioral and social sciences. The impact of ML methods on research and practical applications in the educational sciences is still limited, but continuously grows as larger and more complex datasets become available through massive open online courses (MOOCs) and large scale investigations.The educational sciences are at a crucial pivot point, because of the anticipated impact ML methods hold for the field. Here, we review the opportunities and challenges of ML for the educational sciences, show how a look at related disciplines can help learning from their experiences, and argue for a philosophical shift in model evaluation. We demonstrate how the overall quality of data analysis in educational research can benefit from these methods and show how ML can play a decisive role in the validation of empirical models. In this review, we (1) provide an overview of the types of data suitable for ML, (2) give practical advice for the application of ML methods, and (3) show how ML-based tools and applications can be used to enhance the quality of education. Additionally we provide practical R code with exemplary analyses, available at https: //osf.io/ntre9/?view only=d29ae7cf59d34e8293f4c6bbde3e4ab2.


Author(s):  
Eddie A Santos ◽  
Abram Hindle

Developers summarize their changes to code in commit messages. When a message seems “unusual,” however, this puts doubt into the quality of the code contained in the commit. We trained \(n\)-gram language models and used cross-entropy as an indicator of commit message “unusualness” of over 120 000 commits from open source projects. Build statuses collected from Travis-CI were used as a proxy for code quality. We then compared the distributions of failed and successful commits with regards to the “unusualness” of their commit message. Our analysis yielded significant results when correlating cross-entropy with build status.


2019 ◽  
Vol 18 ◽  
pp. 160940691987646 ◽  
Author(s):  
Saltanat Janenova

This article provides a reflective analysis of a local scholar on methodological challenges of conducting research in Kazakhstan — a post-Soviet, authoritarian, Central Asian country. It specifically addresses the problems of getting access to government officials and the quality of data, describes the strategies applied by the researcher to mitigate these obstacles, and discusses the impact of the political environment on decisions relating to the research design, ethical integrity, safety of participants and researchers, and publication dilemma. This article will be of interest both for researchers who are doing or planning to conduct research in Kazakhstan and Central Asia and those who are researching in nondemocratic contexts as methodological challenges of an authoritarian regime stretch beyond the geographical boundaries.


Sign in / Sign up

Export Citation Format

Share Document