Language modelling for biological sequences – curated datasets and baselines

AbstractMotivationLanguage modelling (LM) on biological sequences is an emergent topic in the field of bioinformatics. Current research has shown that language modelling of proteins can create context-dependent representations that can be applied to improve performance on different protein prediction tasks. However, little effort has been directed towards analyzing the properties of the datasets used to train language models. Additionally, only the performance of cherry-picked downstream tasks are used to assess the capacity of LMs.ResultsWe analyze the entire UniProt database and investigate the different properties that can bias or hinder the performance of LMs such as homology, domain of origin, quality of the data, and completeness of the sequence. We evaluate n-gram and Recurrent Neural Network (RNN) LMs to assess the impact of these properties on performance. To our knowledge, this is the first protein dataset with an emphasis on language modelling. Our inclusion of properties specific to proteins gives a detailed analysis of how well natural language processing methods work on biological sequences. We find that organism domain and quality of data have an impact on the performance, while the completeness of the proteins has little influence. The RNN based LM can learn to model Bacteria, Eukarya, and Archaea; but struggles with Viruses. By using the LM we can also generate novel proteins that are shown to be similar to real proteins.Availability and implementationhttps://github.com/alrojo/UniLanguage

Download Full-text

An exploratory research on grammar checking of Bangla sentences using statistical language models

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v10i3.pp3244-3252 ◽

2020 ◽

Vol 10 (3) ◽

pp. 3244

Author(s):

M. D. Riazur Rahman ◽

M. D. Tarek Habib ◽

M. D. Sadekur Rahman ◽

Gazi Zahirul Islam ◽

M. D. Abbas Ali Khan

Keyword(s):

Language Processing ◽

Language Model ◽

Language Models ◽

Exploratory Research ◽

Smoothing Technique ◽

Comparative Performance ◽

Statistical Language Models ◽

Language Modelling ◽

N Gram ◽

Improved Technique

N-gram based language models are very popular and extensively used statistical methods for solving various natural language processing problems including grammar checking. Smoothing is one of the most effective techniques used in building a language model to deal with data sparsity problem. Kneser-Ney is one of the most prominently used and successful smoothing technique for language modelling. In our previous work, we presented a Witten-Bell smoothing based language modelling technique for checking grammatical correctness of Bangla sentences which showed promising results outperforming previous methods. In this work, we proposed an improved method using Kneser-Ney smoothing based n-gram language model for grammar checking and performed a comparative performance analysis between Kneser-Ney and Witten-Bell smoothing techniques for the same purpose. We also provided an improved technique for calculating the optimum threshold which further enhanced the the results. Our experimental results show that, Kneser-Ney outperforms Witten-Bell as a smoothing technique when used with n-gram LMs for checking grammatical correctness of Bangla sentences.

Download Full-text

Association Between Patient Survival and Clinician Variability in Treatment Rates for Aortic Valve Stenosis

Journal of the American Heart Association ◽

10.1161/jaha.120.020490 ◽

2021 ◽

Author(s):

J. Matthew Brennan ◽

Angela Lowenstern ◽

Paige Sheridan ◽

Isabel J. Boero ◽

Vinod H. Thourani ◽

...

Keyword(s):

Aortic Valve ◽

Language Processing ◽

Severe Aortic Stenosis ◽

The United States ◽

Risk Models ◽

Transcatheter Aortic Valve ◽

Competing Risk Models ◽

The Impact ◽

Potential Challenge

Background Patients with symptomatic severe aortic stenosis (ssAS) have a high mortality risk and compromised quality of life. Surgical/transcatheter aortic valve replacement (AVR) is a Class I recommendation, but it is unclear if this recommendation is uniformly applied. We determined the impact of managing cardiologists on the likelihood of ssAS treatment. Methods and Results Using natural language processing of Optum electronic health records, we identified 26 438 patients with newly diagnosed ssAS (2011–2016). Multilevel, multivariable Fine‐Gray competing risk models clustered by cardiologists were used to determine the impact of cardiologists on the likelihood of 1‐year AVR treatment. Within 1 year of diagnosis, 35.6% of patients with ssAS received an AVR; however, rates varied widely among managing cardiologists (0%, lowest quartile; 100%, highest quartile [median, 29.6%; 25th–75th percentiles, 13.3%–47.0%]). The odds of receiving AVR varied >2‐fold depending on the cardiologist (median odds ratio for AVR, 2.25; 95% CI, 2.14–2.36). Compared with patients with ssAS of cardiologists with the highest treatment rates, those treated by cardiologists with the lowest AVR rates experienced significantly higher 1‐year mortality (lowest quartile, adjusted hazard ratio, 1.22, 95% CI, 1.13–1.33). Conclusions Overall AVR rates for ssAS were low, highlighting a potential challenge for ssAS management in the United States. Cardiologist AVR use varied substantially; patients treated by cardiologists with lower AVR rates had higher mortality rates than those treated by cardiologists with higher AVR rates.

Download Full-text

The Impact of Data Quantity and Source on the Quality of Data-Driven Hints for Programming

Lecture Notes in Computer Science - Artificial Intelligence in Education ◽

10.1007/978-3-319-93843-1_35 ◽

2018 ◽

pp. 476-490 ◽

Cited By ~ 2

Author(s):

Thomas W. Price ◽

Rui Zhi ◽

Yihuan Dong ◽

Nicholas Lytle ◽

Tiffany Barnes

Keyword(s):

Data Driven ◽

Quality Of Data ◽

The Impact

Download Full-text

Better Word Representation Vectors Using Syllabic Alphabet: A Case Study of Swahili

Applied Sciences ◽

10.3390/app9183648 ◽

2019 ◽

Vol 9 (18) ◽

pp. 3648

Author(s):

Casper S. Shikali ◽

Zhou Sijie ◽

Liu Qihe ◽

Refuoe Mokhosi

Keyword(s):

Language Processing ◽

Critical Role ◽

Language Model ◽

Central Africa ◽

Spoken Language ◽

Language Models ◽

Word Embeddings ◽

Word Representation

Deep learning has extensively been used in natural language processing with sub-word representation vectors playing a critical role. However, this cannot be said of Swahili, which is a low resource and widely spoken language in East and Central Africa. This study proposed novel word embeddings from syllable embeddings (WEFSE) for Swahili to address the concern of word representation for agglutinative and syllabic-based languages. Inspired by the learning methodology of Swahili in beginner classes, we encoded respective syllables instead of characters, character n-grams or morphemes of words and generated quality word embeddings using a convolutional neural network. The quality of WEFSE was demonstrated by the state-of-art results in the syllable-aware language model on both the small dataset (31.229 perplexity value) and the medium dataset (45.859 perplexity value), outperforming character-aware language models. We further evaluated the word embeddings using word analogy task. To the best of our knowledge, syllabic alphabets have not been used to compose the word representation vectors. Therefore, the main contributions of the study are a syllabic alphabet, WEFSE, a syllabic-aware language model and a word analogy dataset for Swahili.

Download Full-text

dom2vec: Capturing domain structure and function using self-supervision on protein domain architectures

10.21203/rs.3.rs-58816/v1 ◽

2020 ◽

Author(s):

Damianos P. Melidis ◽

Brandon Malone ◽

Wolfgang Nejdl

Keyword(s):

Domain Structure ◽

Language Processing ◽

Performance Comparison ◽

Structure And Function ◽

Protein Domain ◽

Linguistic Features ◽

Enzymatic Function ◽

Protein Prediction ◽

And Function

Abstract Background: Word embedding approaches have revolutionized natural language processing (NLP) research. These approaches aim to map words to a low-dimensional vector space, in which words with similar linguistic features cluster together. Embedding-based methods have also been developed for proteins, where words are amino acids and sentences are proteins. The learned embeddings have been evaluated qualitatively, via visual inspection of the embedding space and extrinsically, via performance comparison on downstream protein prediction tasks. However, these sequence embeddings have the caveat that biological metadata do not exist for each amino acid, in order to measure the quality of each unique learned embedding vector. Results: Here, we present dom2vec, an approach for learning protein domain embeddings using word2vec on InterPro annotations. In contrast to sequence embeddings, biological metadata do exist for protein domains, related to each domain separately. Therefore, we present four intrinsic evaluation strategies to quantitatively assess the quality of the learned embedding space. To perform a reliable evaluation in terms of biology knowledge, we selected the metadata related to the most distinctive biological characteristics of domains. These are the structure, enzymatic and molecular function of a given domain. Notably, dom2vec obtains adequate level of performance in the intrinsic assessment, therefore we can draw an analogy between the local linguistic features in natural languages and the domain structure and function information in domain architectures. Moreover, we demonstrate the dom2vec applicability on protein prediction tasks, by comparing it with state-of-the-art sequence embeddings in three downstream tasks. We show that dom2vec outperform sequence embeddings for toxin and enzymatic function prediction and is comparable with sequence embeddings in cellular location prediction. Conclusions: We report that the application of word2vec on InterPro annotations produces domain embeddings with two significant advantages over sequence embeddings. First, each unique dom2vec vector can be quantitatively evaluated towards its available structure and function metadata. Second, the produced embeddings can outperform the sequence embeddings for a subset of downstream tasks. Overall, dom2vec embeddings are able to capture the most important biological properties of domains and surpass sequence embeddings for a subset of prediction tasks.

Download Full-text

Data Quality Issues With Physician-Rating Websites: Systematic Review (Preprint)

10.2196/preprints.15916 ◽

2019 ◽

Author(s):

Pavankumar Mulgund ◽

Raj Sharman ◽

Priya Anand ◽

Shashank Shekhar ◽

Priya Karadi

Keyword(s):

Data Quality ◽

English Language ◽

Quality Of Data ◽

Contextual Data ◽

Patient Reported ◽

Physician Rating ◽

Quality Framework ◽

Quality Issues ◽

The Impact

BACKGROUND In recent years, online physician-rating websites have become prominent and exert considerable influence on patients’ decisions. However, the quality of these decisions depends on the quality of data that these systems collect. Thus, there is a need to examine the various data quality issues with physician-rating websites. OBJECTIVE This study’s objective was to identify and categorize the data quality issues afflicting physician-rating websites by reviewing the literature on online patient-reported physician ratings and reviews. METHODS We performed a systematic literature search in ACM Digital Library, EBSCO, Springer, PubMed, and Google Scholar. The search was limited to quantitative, qualitative, and mixed-method papers published in the English language from 2001 to 2020. RESULTS A total of 423 articles were screened. From these, 49 papers describing 18 unique data quality issues afflicting physician-rating websites were included. Using a data quality framework, we classified these issues into the following four categories: intrinsic, contextual, representational, and accessible. Among the papers, 53% (26/49) reported intrinsic data quality errors, 61% (30/49) highlighted contextual data quality issues, 8% (4/49) discussed representational data quality issues, and 27% (13/49) emphasized accessibility data quality. More than half the papers discussed multiple categories of data quality issues. CONCLUSIONS The results from this review demonstrate the presence of a range of data quality issues. While intrinsic and contextual factors have been well-researched, accessibility and representational issues warrant more attention from researchers, as well as practitioners. In particular, representational factors, such as the impact of inline advertisements and the positioning of positive reviews on the first few pages, are usually deliberate and result from the business model of physician-rating websites. The impact of these factors on data quality has not been addressed adequately and requires further investigation.

Download Full-text

A model of the process of package delivering over a data transmission channel in the context of computer attacks by an intruder

Proceedings of Petersburg Transport University ◽

10.20295/1815-588x-2021-2-229-241 ◽

2021 ◽

Vol 2021 (2) ◽

pp. 229-241

Author(s):

Vera L. LUKICHEVA ◽

◽

Andrey A. PRIVALOV ◽

Daniil D. TITOV ◽

◽

...

Keyword(s):

Data Transmission ◽

Random Variables ◽

Distribution Functions ◽

Quality Of Data ◽

Attack Model ◽

Transmission Channels ◽

Performance Quality ◽

Computer Attacks ◽

The Impact

Objective: To analyze the impact of computer attacks on the performance quality of data transmission channels and channeling systems. It is also necessary to take into account the capabilities of an intruder to introduce malware into channeling systems when committing a computer attack. Methods: To determine the required design ratios, several options for setting various distribution functions characterizing the parameters used as input data and types of inbound streams have been considered, taking into account the parameters of the intruder’s computer attack model set by the values of the probability of successful attack. Mathematical modeling is carried out using the method of topological transformation of stochastic networks. The exponential, momentum and gamma distributions are considered as distribution functions of random variables. The solutions are presented for inbound streams corresponding to the Poisson, Weibull, and Pareto models. Results: The proposed approach makes it possible to assess the performance quality of data transmission channels in the context of computer attacks. These assessments make it possible to analyze the state and develop guidelines for improving the performance quality of communication channels against the destructive information impact of the intruder. Various variants of the functions of random variables distribution and various types of the inbound stream were used for modeling, making it possible to compare them, as well as to assess the possibility of using them in channels that provide users with different services. Practical importance: The modeling results can be used to build communication management decision support systems, as well as to detect attempts of unauthorized access to the telecommunications resource of transportation management systems. The proposed approach can be applied in the development of threat models to describe the capabilities of the intruder (the ‘Intruder Model’).

Download Full-text

Machine Learning for the Educational Sciences

10.31234/osf.io/3hnr6 ◽

2021 ◽

Author(s):

Sven Hilbert ◽

Stefan Coors ◽

Elisabeth Barbara Kraus ◽

Bernd Bischl ◽

Mario Frei ◽

...

Keyword(s):

Machine Learning ◽

Large Scale ◽

Decisive Role ◽

Quality Of Data ◽

Practical Applications ◽

Educational Sciences ◽

Complex Relationships ◽

The Impact ◽

Analytical Approaches

Classical statistical methods are limited in the analysis of highdimensional datasets. Machine learning (ML) provides a powerful framework for prediction by using complex relationships, often encountered in modern data with a large number of variables, cases and potentially non-linear effects. ML has turned into one of the most influential analytical approaches of this millennium and has recently become popular in the behavioral and social sciences. The impact of ML methods on research and practical applications in the educational sciences is still limited, but continuously grows as larger and more complex datasets become available through massive open online courses (MOOCs) and large scale investigations.The educational sciences are at a crucial pivot point, because of the anticipated impact ML methods hold for the field. Here, we review the opportunities and challenges of ML for the educational sciences, show how a look at related disciplines can help learning from their experiences, and argue for a philosophical shift in model evaluation. We demonstrate how the overall quality of data analysis in educational research can benefit from these methods and show how ML can play a decisive role in the validation of empirical models. In this review, we (1) provide an overview of the types of data suitable for ML, (2) give practical advice for the application of ML methods, and (3) show how ML-based tools and applications can be used to enhance the quality of education. Additionally we provide practical R code with exemplary analyses, available at https: //osf.io/ntre9/?view only=d29ae7cf59d34e8293f4c6bbde3e4ab2.

Download Full-text

Judging a commit by its cover; or can a commit message predict build failure?

10.7287/peerj.preprints.1771v1 ◽

2016 ◽

Cited By ~ 1

Author(s):

Eddie A Santos ◽

Abram Hindle

Keyword(s):

Open Source ◽

Language Models ◽

Cross Entropy ◽

N Gram ◽

Code Quality

Developers summarize their changes to code in commit messages. When a message seems “unusual,” however, this puts doubt into the quality of the code contained in the commit. We trained \(n\)-gram language models and used cross-entropy as an indicator of commit message “unusualness” of over 120 000 commits from open source projects. Build statuses collected from Travis-CI were used as a proxy for code quality. We then compared the distributions of failed and successful commits with regards to the “unusualness” of their commit message. Our analysis yielded significant results when correlating cross-entropy with build status.

Download Full-text

The Boundaries of Research in an Authoritarian State

International Journal of Qualitative Methods ◽

10.1177/1609406919876469 ◽

2019 ◽

Vol 18 ◽

pp. 160940691987646 ◽

Cited By ~ 3

Author(s):

Saltanat Janenova

Keyword(s):

Asian Country ◽

Political Environment ◽

Quality Of Data ◽

Government Officials ◽

Authoritarian State ◽

Methodological Challenges ◽

Central Asian ◽

Conducting Research ◽

The Impact

This article provides a reflective analysis of a local scholar on methodological challenges of conducting research in Kazakhstan — a post-Soviet, authoritarian, Central Asian country. It specifically addresses the problems of getting access to government officials and the quality of data, describes the strategies applied by the researcher to mitigate these obstacles, and discusses the impact of the political environment on decisions relating to the research design, ethical integrity, safety of participants and researchers, and publication dilemma. This article will be of interest both for researchers who are doing or planning to conduct research in Kazakhstan and Central Asia and those who are researching in nondemocratic contexts as methodological challenges of an authoritarian regime stretch beyond the geographical boundaries.

Download Full-text