Spoken language is fundamentally different from the written language in that it contains frequent disfluencies or parts of an utterance that are corrected by the speaker. Disfluency detection (removing these disfluencies) is desirable to clean the input for use in downstream NLP tasks. Most existing approaches to disfluency detection heavily rely on human-annotated data, which is scarce and expensive to obtain in practice. To tackle the training data bottleneck, in this work, we investigate methods for combining self-supervised learning and active learning for disfluency detection. First, we construct large-scale pseudo training data by randomly adding or deleting words from unlabeled data and propose two self-supervised pre-training tasks: (i) a tagging task to detect the added noisy words and (ii) sentence classification to distinguish original sentences from grammatically incorrect sentences. We then combine these two tasks to jointly pre-train a neural network. The pre-trained neural network is then fine-tuned using human-annotated disfluency detection training data. The self-supervised learning method can capture task-special knowledge for disfluency detection and achieve better performance when fine-tuning on a small annotated dataset compared to other supervised methods. However, limited in that the pseudo training data are generated based on simple heuristics and cannot fully cover all the disfluency patterns, there is still a performance gap compared to the supervised models trained on the full training dataset. We further explore how to bridge the performance gap by integrating active learning during the fine-tuning process. Active learning strives to reduce annotation costs by choosing the most critical examples to label and can address the weakness of self-supervised learning with a small annotated dataset. We show that by combining self-supervised learning with active learning, our model is able to match state-of-the-art performance with just about 10% of the original training data on both the commonly used English Switchboard test set and a set of in-house annotated Chinese data.
FPGA-based Physical Unclonable Functions (PUF) have emerged as a viable alternative to permanent key storage by turning effects of inaccuracies during the manufacturing process of a chip into a unique, FPGA-intrinsic secret. However, many fixed PUF designs may suffer from unsatisfactory statistical properties in terms of uniqueness, uniformity, and robustness. Moreover, a PUF signature may alter over time due to aging or changing operating conditions, rendering a PUF insecure in the worst case. As a remedy, we propose
, a novel class of FPGA-based PUF designs with tunable uniqueness and reliability characteristics. By the use of addressable shift registers available on an FPGA, we show that a wide configuration space for adjusting a device-specific PUF response is obtained without any sacrifice of randomness. In particular, we demonstrate the concept of address-tunable propagation delays, whereby we are able to increase or decrease the probability of obtaining “
”s in the PUF response. Experimental evaluations on a group of six 28 nm Xilinx Artix-7 FPGAs show that CHOICE PUFs provide a large range of configurations to allow a fine-tuning to an average uniqueness between 49% and 51%, while simultaneously achieving bit error rates below 1.5%, thus outperforming state-of-the-art PUF designs. Moreover, with only a single FPGA slice per PUF bit, CHOICE is one of the smallest PUF designs currently available for FPGAs. It is well-known that signal propagation delays are affected by temperature, as the operating temperature impacts the internal currents of transistors that ultimately make up the circuit. We therefore comprehensively investigate how temperature variations affect the PUF response and demonstrate how the tunability of CHOICE enables us to determine configurations that show a high robustness to such variations. As a case study, we present a cryptographic key generation scheme based on CHOICE PUF responses as device-intrinsic secret and investigate the design objectives resource costs, performance, and temperature robustness to show the practicability of our approach.
Pretraining large neural language models, such as BERT, has led to impressive gains on many natural language processing (NLP) tasks. However, most pretraining efforts focus on general domain corpora, such as newswire and Web. A prevailing assumption is that even domain-specific pretraining can benefit by starting from general-domain language models. In this article, we challenge this assumption by showing that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains over continual pretraining of general-domain language models. To facilitate this investigation, we compile a comprehensive biomedical NLP benchmark from publicly available datasets. Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks, leading to new state-of-the-art results across the board. Further, in conducting a thorough evaluation of modeling choices, both for pretraining and task-specific fine-tuning, we discover that some common practices are unnecessary with BERT models, such as using complex tagging schemes in named entity recognition. To help accelerate research in biomedical NLP, we have released our state-of-the-art pretrained and task-specific models for the community, and created a leaderboard featuring our BLURB benchmark (short for Biomedical Language Understanding & Reasoning Benchmark) at
MADS-box gene family members play multifarious roles in regulating the growth and development of crop plants and hold enormous promise for bolstering grain yield potential under changing global environments. Bread wheat (Triticum aestivum L.) is a key stable food crop around the globe. Until now, the available information concerning MADS-box genes in the wheat genome has been insufficient. Here, a comprehensive genome-wide analysis identified 300 high confidence MADS-box genes from the publicly available reference genome of wheat. Comparative phylogenetic analyses with Arabidopsis and rice MADS-box genes classified the wheat genes into 16 distinct subfamilies. Gene duplications were mainly identified in subfamilies containing unbalanced homeologs, pointing towards a potential mechanism for gene family expansion. Moreover, a more rapid evolution was inferred for M-type genes, as compared with MIKC-type genes, indicating their significance in understanding the evolutionary history of the wheat genome. We speculate that subfamily-specific distal telomeric duplications in unbalanced homeologs facilitate the rapid adaptation of wheat to changing environments. Furthermore, our in-silico expression data strongly proposed MADS-box genes as active guardians of plants against pathogen insurgency and harsh environmental conditions. In conclusion, we provide an entire complement of MADS-box genes identified in the wheat genome that could accelerate functional genomics efforts and possibly facilitate bridging gaps between genotype-to-phenotype relationships through fine-tuning of agronomically important traits.
X-ray diffraction technique is one of the most common methods of ascertaining protein structures, yet only 2–10% of proteins can produce diffraction-quality crystals. Several computational methods have been proposed so far to predict protein crystallization. Nevertheless, the current state-of-the-art computational methods are limited by the scarcity of experimental data. Thus, the prediction accuracy of existing models hasn’t reached the ideal level. To address the problems above, we propose a novel transfer-learning-based framework for protein crystallization prediction, named TLCrys. The framework proceeds in two steps: pre-training and fine-tuning. The pre-training step adopts attention mechanism to extract both global and local information of the protein sequences. The representation learned from the pre-training step is regarded as knowledge to be transferred and fine-tuned to enhance the performance of crystalization prediction. During pre-training, TLCrys adopts a multi-task learning method, which not only improves the learning ability of protein encoding, but also enhances the robustness and generalization of protein representation. The multi-head self-attention layer guarantees that different levels of the protein representation can be extracted by the fine-tuned step. During transfer learning, the fine-tuning strategy used by TLCrys improves the task-specialized learning ability of the network. Our method outperforms all previous predictors significantly in five crystallization stages of prediction. Furthermore, the proposed methodology can be well generalized to other protein sequence classification tasks.
Much of the earth’s surface is covered by water. As was pointed out in the 2020 edition of the World Water Development Report, climate change challenges the sustainability of global water resources, so it is important to monitor the quality of water to preserve sustainable water resources. Quality of water can be related to the structure of water crystal, the solid-state of water, so methods to understand water crystals can help to improve water quality. As a first step, a water crystal exploratory analysis has been initiated with the cooperation with the Emoto Peace Project (EPP). The 5K EPP dataset has been created as the first world-wide small dataset of water crystals. Our research focused on reducing the inherent limitations when fitting machine learning models to the 5K EPP dataset. One major result is the classification of water crystals and how to split our small dataset into several related groups. Using the 5K EPP dataset of human observations and past research on snow crystal classification, we created a simple set of visual labels to identify water crystal shapes, in 13 categories. A deep learning-based method has been used to automatically do the classification task with a subset of the label dataset. The classification achieved high accuracy when using a fine-tuning technique.
Gene overprinting occurs when point mutations within a genomic region with an existing coding sequence create a new one in another reading frame. This process is quite frequent in viral genomes either to maximize the amount of information that they encode or in response to strong selective pressure. The most frequent scenario involves two different reading frames in the same DNA strand (sense overlap). Much less frequent are cases of overlapping genes that are encoded on opposite DNA strands (antisense overlap). One such example is the antisense ORF, asp in the minus strand of the HIV-1 genome overlapping the env gene. The asp gene is highly conserved in pandemic HIV-1 strains of group M, and it is absent in non-pandemic HIV-1 groups, HIV-2, and lentiviruses infecting non-human primates, suggesting that the ~190-amino acid protein that is expressed from this gene (ASP) may play a role in virus spread. While the function of ASP in the virus life cycle remains to be elucidated, mounting evidence from several research groups indicates that ASP is expressed in vivo. There are two alternative hypotheses that could be envisioned to explain the origin of the asp ORF. On one hand, asp may have originally been present in the ancestor of contemporary lentiviruses, and subsequently lost in all descendants except for most HIV-1 strains of group M due to selective advantage. Alternatively, the asp ORF may have originated very recently with the emergence of group M HIV-1 strains from SIVcpz. Here, we used a combination of computational and statistical approaches to study the genomic region of env in primate lentiviruses to shed light on the origin, structure, and sequence evolution of the asp ORF. The results emerging from our studies support the hypothesis of a recent de novo addition of the antisense ORF to the HIV-1 genome through a process that entailed progressive removal of existing internal stop codons from SIV strains to HIV-1 strains of group M, and fine tuning of the codon sequence in env that reduced the chances of new stop codons occurring in asp. Altogether, the study supports the notion that the HIV-1 asp gene encodes an accessory protein, providing a selective advantage to the virus.
Exosomes are extracellular vesicles formed by various donor cells that regulate gene expression and cellular function in recipient cells. Exosomes derived from mesenchymal stem cells (MSC-Exos) perform the regulatory function of stem cells by transporting proteins, nucleic acids, and lipids. Intervertebral disc degeneration (IDD) is one of the main causes of low back pain, and it is characterized by a decreased number of nucleus pulposus cells, extracellular matrix decomposition, aging of the annulus fibrosus, and cartilage endplate calcification. Besides, nutrient transport and structural repair of intervertebral discs depend on bone and cartilage and are closely related to the state of the bone. Trauma, disease and aging can all cause bone injury. However, there is a lack of effective drugs against IDD and bone injury. Recent MSC-Exos fine tuning has led to significant progress in the IDD treatment and bone repair and regeneration. In this review, we looked at the uniqueness of MSC-Exos, and the potential treatment mechanisms of MSC-Exos with respect to IDD, bone defects and injuries.
Monitoring the extent of plateau forests has drawn much attention from governments given the fact that the plateau forests play a key role in global carbon circulation. Despite the recent advances in the remote-sensing applications of satellite imagery over large regions, accurate mapping of plateau forest remains challenging due to limited ground truth information and high uncertainties in their spatial distribution. In this paper, we aim to generate a better segmentation map for plateau forests using high-resolution satellite imagery with limited ground-truth data. We present the first 2 m spatial resolution large-scale plateau forest dataset of Sanjiangyuan National Nature Reserve, including 38,708 plateau forest imagery samples and 1187 handmade accurate plateau forest ground truth masks. We then propose an few-shot learning method for mapping plateau forests. The proposed method is conducted in two stages, including unsupervised feature extraction by leveraging domain knowledge, and model fine-tuning using limited ground truth data. The proposed few-shot learning method reached an F1-score of 84.23%, and outperformed the state-of-the-art object segmentation methods. The result proves the proposed few-shot learning model could help large-scale plateau forest monitoring. The dataset proposed in this paper will soon be available online for the public.