A domain-specific controlled English language for automated regulatory compliance (industrial paper)

Author(s):  
Suman Roychoudhury ◽  
Sagar Sunkle ◽  
Deepali Kholkar ◽  
Vinay Kulkarni
Author(s):  
Maja Radović ◽  
Nenad Petrović ◽  
Milorad Tošić

The requirements of state-of-the-art curricula and teaching processes in medical education have brought both new and improved the existing assessment methods. Recently, several promising methods have emerged, among them the Comprehensive Integrative Puzzle (CIP), which shows great potential. However, the construction of such questions requires high efforts of a team of experts and is time-consuming. Furthermore, despite the fact that English language is accepted as an international language, for educational purposes there is also a need for representing data and knowledge in native language. In this paper, we present an approach for automatic generation of CIP assessment questions based on using ontologies for knowledge representation. In this way, it is possible to provide multilingual support in the teaching and learning process because the same ontological concept can be applied to corresponding language expressions in different languages. The proposed approach shows promising results indicated by dramatic speeding up of construction of CIP questions compared to manual methods. The presented results represent a strong indication that adoption of ontologies for knowledge representation may enable scalability in multilingual domain-specific education regardless of the language used. High level of automation in the assessment process proven on the CIP method in medical education as one of the most challenging domains, promises high potential for new innovative teaching methodologies in other educational domains as well.


Author(s):  
Arkadipta De ◽  
Dibyanayan Bandyopadhyay ◽  
Baban Gain ◽  
Asif Ekbal

Fake news classification is one of the most interesting problems that has attracted huge attention to the researchers of artificial intelligence, natural language processing, and machine learning (ML). Most of the current works on fake news detection are in the English language, and hence this has limited its widespread usability, especially outside the English literate population. Although there has been a growth in multilingual web content, fake news classification in low-resource languages is still a challenge due to the non-availability of an annotated corpus and tools. This article proposes an effective neural model based on the multilingual Bidirectional Encoder Representations from Transformer (BERT) for domain-agnostic multilingual fake news classification. Large varieties of experiments, including language-specific and domain-specific settings, are conducted. The proposed model achieves high accuracy in domain-specific and domain-agnostic experiments, and it also outperforms the current state-of-the-art models. We perform experiments on zero-shot settings to assess the effectiveness of language-agnostic feature transfer across different languages, showing encouraging results. Cross-domain transfer experiments are also performed to assess language-independent feature transfer of the model. We also offer a multilingual multidomain fake news detection dataset of five languages and seven different domains that could be useful for the research and development in resource-scarce scenarios.


Author(s):  
Chien-Yu Lin ◽  
Mohammad Javad Koohsari ◽  
Yung Liao ◽  
Kaori Ishii ◽  
Ai Shibata ◽  
...  

Abstract Background Many desk-based workers can spend more than half of their working hours sitting, with low levels of physical activity. Workplace neighbourhood built environment may influence workers’ physical activities and sedentary behaviours on workdays. We reviewed and synthesised evidence from observational studies on associations of workplace neighbourhood attributes with domain-specific physical activity and sedentary behaviour and suggested research priorities for improving the quality of future relevant studies. Methods Published studies were obtained from nine databases (PubMed, Web of Science, PsycINFO, Scopus, Transport Research International Documentation, MEDLINE, Cochrane, Embase, and CINAHL) and crosschecked by Google Scholar. Observational studies with quantitative analyses estimating associations between workplace neighbourhood built environment attributes and workers’ physical activity or sedentary behaviour were included. Studies were restricted to those published in English language peer-reviewed journals from 2000 to 2019. Results A total of 55 studies and 455 instances of estimated associations were included. Most instances of potential associations of workplace neighbourhood built environment attributes with total or domain-specific (occupational, transport, and recreational) physical activity were non-significant. However, destination-related attributes (i.e., longer distances from workplace to home and access to car parking) were positively associated with transport-related sedentary behaviour (i.e., car driving). Conclusions The findings reinforce the case for urban design policies on designing mixed-use neighbourhoods where there are opportunities to live closer to workplaces and have access to a higher density of shops, services, and recreational facilities. Studies strengthening correspondence between the neighbourhood built environment attributes and behaviours are needed to identify and clarify potential relationships. Protocol registration The protocol of this systematic review was registered on the International Prospective Register of Systematic Reviews (PROSPERO) on 2 December 2019 (registration number: CRD42019137341).


2019 ◽  
Vol 46 (5) ◽  
pp. 683-695 ◽  
Author(s):  
Hayri Volkan Agun ◽  
Ozgur Yilmazel

Domain, genre and topic influences on author style adversely affect the performance of authorship attribution (AA) in multi-genre and multi-domain data sets. Although recent approaches to AA tasks focus on suggesting new feature sets and sampling techniques to improve the robustness of a classification system, they do not incorporate domain-specific properties to reduce the negative impact of irrelevant features on AA. This study presents a novel scaling approach, namely, bucketed common vector scaling, to efficiently reduce negative domain influence without reducing the dimensionality of existing features; therefore, this approach is easily transferable and applicable in a classification system. Classification performances on English-language competition data sets consisting of emails and articles and Turkish-language web documents consisting of blogs, articles and tweets indicate that our approach is very competitive to top-performing approaches in English competition data sets and is significantly improving the top classification performance in mixed-domain experiments on blogs, articles and tweets.


2020 ◽  
Vol 10 (7) ◽  
pp. 2221 ◽  
Author(s):  
Jurgita Kapočiūtė-Dzikienė

Accurate generative chatbots are usually trained on large datasets of question–answer pairs. Despite such datasets not existing for some languages, it does not reduce the need for companies to have chatbot technology in their websites. However, companies usually own small domain-specific datasets (at least in the form of an FAQ) about their products, services, or used technologies. In this research, we seek effective solutions to create generative seq2seq-based chatbots from very small data. Since experiments are carried out in English and morphologically complex Lithuanian languages, we have an opportunity to compare results for languages with very different characteristics. We experimentally explore three encoder–decoder LSTM-based approaches (simple LSTM, stacked LSTM, and BiLSTM), three word embedding types (one-hot encoding, fastText, and BERT embeddings), and five encoder–decoder architectures based on different encoder and decoder vectorization units. Furthermore, all offered approaches are applied to the pre-processed datasets with removed and separated punctuation. The experimental investigation revealed the advantages of the stacked LSTM and BiLSTM encoder architectures and BERT embedding vectorization (especially for the encoder). The best achieved BLUE on English/Lithuanian datasets with removed and separated punctuation was ~0.513/~0.505 and ~0.488/~0.439, respectively. Better results were achieved with the English language, because generating different inflection forms for the morphologically complex Lithuanian is a harder task. The BLUE scores fell into the range defining the quality of the generated answers as good or very good for both languages. This research was performed with very small datasets having little variety in covered topics, which makes this research not only more difficult, but also more interesting. Moreover, to our knowledge, it is the first attempt to train generative chatbots for a morphologically complex language.


Author(s):  
Rosa Rabadán ◽  
Isabel Pizarro ◽  
Hugo Sanjurjo-González

Abstract Authoring support consists of (semi)automated aids to be used at different stages during the writing process. Language information, however, tends to be restricted to areas such as spelling and grammar checking or term banks, and text construction difficulties that writers face concerning the structure of particular genres, associated sentence formulations or genre-specific vocabulary have not received proper attention. An additional gap in the research is that this support is generally addressed to English language users. This paper addresses these concerns focusing on a particular genre: the company’s directors’ report, and on Spanish language writers writing in English. A custom-made monolingual corpus has been analyzed using Bhatia (1993, 2004) and Swales (1990, 2004) definitions of genre and move combined with theme characterization. Recurrent strings for each move/step, which are conventionally associated with each rhetorical unit, were identified and formulated as “meta-strings.” The bilingual glossary includes domain-specific items as well as move/step or genre-specific lexical and phraseological options, i.e., elements used irrespective of the business, places or people involved. The results are valuable by themselves, as an analysis of the genre, but also as the empirical basis for the authoring support tool that we present here, and as language training materials.


Terminology ◽  
2010 ◽  
Vol 16 (2) ◽  
pp. 141-158 ◽  
Author(s):  
Spela Vintar

The paper describes LUIZ, a bilingual term recognition system that has been developed for the Slovene-English language pair. The system is a hybrid term extractor using morphosyntactic patterns and statistical ranking to propose domain-specific expressions for each of the two languages, whereupon translation equivalents between the languages are identified using the innovative bag-of-equivalents approach. This simple but effective method is based on the Twente word aligner to obtain a lexicon of single word translation pairs and their probability scores, which is then used to identify correspondences between multi-word terms. The bilingual term recognition system has been tested and evaluated on three parallel subcorpora from the tourism, accounting and military domain. Average precision of the term alignment component is 0.83, whereby only fully equivalent and domain-relevant terms were counted as positives. Another advantage of the described approach is the fact that we successfully detect term variants and multiple translations of a candidate multi-word term. Since our term alignment method does not require sentence-aligned corpora it can be used with comparable corpora, provided we already have a domain-specific lexicon or dictionary of single-word correspondences. The paper concludes with some thoughts on the users of term recognition systems and their needs based on our observations from the online version of the system.


2020 ◽  
Vol 95 (4) ◽  
pp. 485-523
Author(s):  
Joshua Bousquette

The present work examines nominal case marking in Wisconsin Heritage German, based on audio recordings of six speakers made in the late 1940s. Linguistic data provide positive evidence for a four-case nominal system characteristic of Standard German. At the same time, biographical and demographic information show that the heritage varieties acquired and spoken in the home often employed a different nominal system, one that often exhibited dative-accusative case syncretism and lacked genitive case—features that surfaced even when Standard German was spoken. These data strongly suggest that speakers were proficient in both their heritage variety of German, acquired through naturalistic means, as well as in Standard German, acquired through institutional support in educational and religious domains. Over time, these formal German-language domains shifted to externally oriented, English-language institutions. Standard German was no longer supported, while the heritage variety was retained in domestic and social domains. Subsequent case syncretism in Wisconsin Heritage German therefore reflects the retention of preimmigration, nonstandard varieties, rather than a morphological change in a unified heritage grammar. This work concludes by proposing a multistage model of domain-specific language shift, informed by both synchronic variation within the community as well as by social factors affecting language shift over time.


2020 ◽  
Vol 23 (2) ◽  
pp. 493-507
Author(s):  
Tanya Gibbs

Purpose The transformation of the United Arab Emirates (UAE) into an important global economic player has been accompanied by digitalization that has also left it at a risk to cybercrime. Concurrent with the rise in technology use, the UAE fast became one of the most targeted countries in the world. The purpose of this paper is to discuss how the UAE has tried to cope with accelerating levels of cyber threat using legislative and regulatory efforts as well as public- and private-sector initiatives meant to raise cybersecurity awareness. Design/methodology/approach The paper surveys the UAE’s cybersecurity legislative, regulatory and educational initiatives from 2003 to 2019. Findings Because the human factor still remains the number one reason for security breaches, robust cyber laws alone are not enough to protect against cyber threats. Building public awareness and educating internet users about cyber risks and safety have become essential components of the UAE's efforts in building a more secure cyber environment for the country. Research limitations/implications The paper relies on English-language translations of primary sources (laws) originally in Arabic, as well as English-language studies from local media. This should not be considered a problem, as English is established as the language of business and commerce in the UAE. Practical implications The paper provides a detailed overview of the country’s cybersecurity environment to guide and aide practitioners with risk assessment and legal and regulatory compliance. Originality/value The paper presents a comprehensive overview of the UAE’s cybersecurity legislative, regulatory and educational environment. It also surveys government and private sector initiatives directed in protecting the country’s cyberspace.


1990 ◽  
Vol 6 (1) ◽  
pp. 39-59 ◽  
Author(s):  
Helmut Zobl

Much current work on L2 acquisition is defined by the hypothesis that adult learners embark on the acquisition task with a language faculty whose structure is significantly less modular that than of the L1learner. The domain-specific system, which has available to it the principles and conditions of Universal Grammar, has been replaced by content-neutral, central processes and the learner's L1as the principal means by which an L2 can be internalized. An important corollary of this hypothesis is that acquisition will be piecemeal and will not evidence the effects associated with parameter setting.In this paper we attempt to demonstrate that adult L2 acquisition is module - and parameter-sensitive. The focus of the inquiry falls on the acquisition of the principle of structural government and the English language value of the agreement parameter by Japanese-speaking learners. Although the data supporting the claim come primarily from production, their analyses furnish compelling evidence that central processing, as it is currently understood, cannot account for the way attributes of these parametric choices cohere together.


Sign in / Sign up

Export Citation Format

Share Document