large corpus
Recently Published Documents


TOTAL DOCUMENTS

376
(FIVE YEARS 145)

H-INDEX

21
(FIVE YEARS 4)

2021 ◽  
Vol 13 (4) ◽  
pp. 1-35
Author(s):  
Gabriel Amaral ◽  
Alessandro Piscopo ◽  
Lucie-aimée Kaffee ◽  
Odinaldo Rodrigues ◽  
Elena Simperl

Wikidata is one of the most important sources of structured data on the web, built by a worldwide community of volunteers. As a secondary source, its contents must be backed by credible references; this is particularly important, as Wikidata explicitly encourages editors to add claims for which there is no broad consensus, as long as they are corroborated by references. Nevertheless, despite this essential link between content and references, Wikidata's ability to systematically assess and assure the quality of its references remains limited. To this end, we carry out a mixed-methods study to determine the relevance, ease of access, and authoritativeness of Wikidata references, at scale and in different languages, using online crowdsourcing, descriptive statistics, and machine learning. Building on previous work of ours, we run a series of microtasks experiments to evaluate a large corpus of references, sampled from Wikidata triples with labels in several languages. We use a consolidated, curated version of the crowdsourced assessments to train several machine learning models to scale up the analysis to the whole of Wikidata. The findings help us ascertain the quality of references in Wikidata and identify common challenges in defining and capturing the quality of user-generated multilingual structured data on the web. We also discuss ongoing editorial practices, which could encourage the use of higher-quality references in a more immediate way. All data and code used in the study are available on GitHub for feedback and further improvement and deployment by the research community.


2021 ◽  
Vol 11 (3-4) ◽  
pp. 1-23
Author(s):  
Linhao Meng ◽  
Yating Wei ◽  
Rusheng Pan ◽  
Shuyue Zhou ◽  
Jianwei Zhang ◽  
...  

Federated Learning (FL) provides a powerful solution to distributed machine learning on a large corpus of decentralized data. It ensures privacy and security by performing computation on devices (which we refer to as clients) based on local data to improve the shared global model. However, the inaccessibility of the data and the invisibility of the computation make it challenging to interpret and analyze the training process, especially to distinguish potential client anomalies. Identifying these anomalies can help experts diagnose and improve FL models. For this reason, we propose a visual analytics system, VADAF, to depict the training dynamics and facilitate analyzing potential client anomalies. Specifically, we design a visualization scheme that supports massive training dynamics in the FL environment. Moreover, we introduce an anomaly detection method to detect potential client anomalies, which are further analyzed based on both the client model’s visual and objective estimation. Three case studies have demonstrated the effectiveness of our system in understanding the FL training process and supporting abnormal client detection and analysis.


2021 ◽  
Vol 15 (4) ◽  
pp. 576-589
Author(s):  
Lyudmila Jevgenyevna Kirillova

The article considers geographical terms for designating streets and lanes in the Udmurt language based on the study of a large corpus of godonyms. In the study of microtoponyms collected by the author and other toponymists on the territory of Udmurtia and beyond of it - in the places of residence of Udmurts - the author managed to identify a significant number of words expressing these concepts. The words ulcha and uram are used to express the concept of ‘street’ in the Udmurt language, which is confirmed by the data of Udmurt toponymy. Special attention is paid to the description of common nouns with the meaning ‘lane’ recorded in microtoponyms, since this layer of vocabulary has not yet been considered in detail by anyone. The performed review indicates that a number of lexemes can act as common words used in this meaning. Taking into account different variants, the author identified 13 units in total. The etymological analysis of the analyzed geographical terms suggests that they are heterogeneous in origin. The common nouns borrowed from the Russian language are wide spread. Geographical terms of Udmurt and pre-Permian or Finno-Permian origin have a slightly lower frequency of use. A small number of lexemes are derived from the Turkic languages. Mixed Udmurt-Russian and Udmurt-Tatar formations are represented in a single number.


2021 ◽  
pp. 5-20
Author(s):  
Vasiliki Petsa ◽  
Sofia Zisimopoulou ◽  
Anastasia Natsina ◽  
Ioannis Dimitrakakis

Surveying a large corpus of Modern Greek fiction from the interwar years to the decade of the financial crisis (2010-2020) we set out to delineate the national inflection of ‘working-class fiction’ along the axes of theme and style as well as answerability, i.e. the engagement with working-class interests in distinct periods (interwar years, WWII and postwar, Metapolitefsi and beyond). Characterized by quantitative and aesthetic variability, the Greek version of the genre is shown to engage actively with topical contextual issues as well as with changing imperatives of authorial commitment and the shifting composition of the working class.


2021 ◽  
Author(s):  
Tina Nane ◽  
Nicolas Robinson-Garcia ◽  
Francois van Schalkwyk ◽  
Daniel Torres-Salinas

We model the growth of scientific literature related to COVID-19 and forecast the expected growth from 1 June 2021. Considering the significant scientific and financial efforts made by the research community to find solutions to end the COVID-19 pandemic, an unprecedented volume of scientific outputs is being produced. This questions the capacity of scientists, politicians and citizens to maintain infrastructure, digest content and take scientifically informed decisions. A crucial aspect is to make predictions to prepare for such a large corpus of scientific literature. Here we base our predictions on the ARIMA and exponential smoothing models and use two different data sources: the Dimensions and World Health Organization COVID-19 databases. These two sources have the particularity of including in the metadata information on the date in which papers were indexed. We present global predictions, plus predictions in three specific settings: by type of access (Open Access), by NLM source (PubMed and PMC), and by domain-specific repository (SSRN and MedRxiv). We conclude by discussing our findings.


Author(s):  
Aixiu An ◽  
Anne Abeillé

Abstract Contrary to most French grammars claiming that French only allows masculine agreement when mixed-gender nouns are conjoined, we show that closest conjunct agreement (CCA) does exist in contemporary French, as in other Romance languages, and is the preferred strategy for prenominal adjectives. Using data from a large corpus (FrWac) and an acceptability rating experiment, we show that (feminine) CCA is well accepted in contemporary French, and should be distinguished from attraction errors, despite the norm prescribing masculine agreement. We also show the role of the adjective position, i.e. prenominal or post-nominal, and humanness. CCA is the preferred strategy for prenominal adjectives, and non-human nouns favour CCA for post-nominal adjectives. Assuming a hierarchical structure for coordination, the closest noun is the highest in A-N order, whereas it is the lowest in N-A order. Thus CCA in prenominal position may be favoured by a shorter structural distance. One can also see CCA with a prenominal adjective as ‘early’ agreement. Regarding humanness, grammatical gender is interpreted as social gender with human nouns, and a masculine plural can refer to a mixed group. This ‘gender neutral’ plural may favour masculine agreement for human nouns, or the prescriptive norm is more influential for human nouns.


2021 ◽  
Author(s):  
Irina A. Ponomareva

The book offers a detailed study of large corpus of rock art which is little known to an international audience. The book covers not only a huge region of East Siberia but also a period spanning from Late Paleolithic to the Iron Age, providing detailed accounts of the regional archaeology and rock art through the perspective of ethnicity, identity, and symbolism.


Author(s):  
Ayana Niwa ◽  
Naoaki Okazaki ◽  
Kohei Wakimoto ◽  
Keisuke Nishiguchi ◽  
Masataka Mouri

An advertising slogan is a sentence that expresses a product or a work of art in a straightforward manner and is used for advertising and publicity. Moving the consumer's mind and attracting their interest can significantly influence sales. Although rhetorical techniques in a slogan are known to improve the effectiveness of advertising, not much attention has been devoted to analyze or automatically generate sentences with the techniques. Therefore, we constructed a large corpus of slogans and revealed the linguistic characteristics of the basic statistics and rhetorical devices. Another point of focus was antitheses, of which the usage rates are relatively high and which have a specific sentence structure and lexical constraints. The generation of a slogan that contains an antithesis necessitates the structure of sentences, known as templates, to be extracted and also requires knowledge of word pairs with semantic contrast. Thus, the next step involved analysis of the structure to extract the sentence structure and lexical knowledge about the antithesis. Despite its simple architecture, the proposed method exceeds the prediction accuracy and efficiency of a comparable method. Lexical knowledge that is not available in existing dictionaries was also extracted.


2021 ◽  
Author(s):  
Gavin Brookes ◽  
Paul Baker

Obesity is a pressing social issue and a persistently newsworthy topic for the media. This book examines the linguistic representation of obesity in the British press. It combines techniques from corpus linguistics with critical discourse studies to analyse a large corpus of newspaper articles (36 million words) representing ten years of obesity coverage. These articles are studied from a range of methodological perspectives, and analytical themes include variation between newspapers, change over time, diet and exercise, gender and social class. The volume also investigates the language that readers use when responding to obesity representations in the context of online comments. The authors reveal the power of linguistic choices to shame and stigmatise people with obesity, presenting them as irresponsible and morally deviant. Yet the analysis also demonstrates the potential for alternative representations which place greater focus on the role that social and political forces play in this topical health issue.


Sign in / Sign up

Export Citation Format

Share Document