semantic markup
Recently Published Documents


TOTAL DOCUMENTS

55
(FIVE YEARS 18)

H-INDEX

5
(FIVE YEARS 1)

Author(s):  
А.Н. Гордей

На основе третьей редакции второй версии Теории автоматического порождения архитектуры знаний (ТАПАЗ-2) предложен новый подход к семантической разметке события и формализации синтаксиса китайских и русских предложений, а также намечены пути решения проблемы автоматического определения семантической эквивалентности текстовых документов и заимствования научных идей On the basis of the third edition of the second version of Theory for Automatic Generation of Knowledge Architecture (TAPAZ–2), a new approach to the semantic markup of an event and the syntax formalization of Chinese and Russian sentences is proposed, as well as ways of solving the problem of automatically identifying the semantic equivalence of text documents and borrowing scientific ideas.


2021 ◽  
pp. 338-356
Author(s):  
Tarfah Alrashed ◽  
Dimitris Paparas ◽  
Omar Benjelloun ◽  
Ying Sheng ◽  
Natasha Noy

AbstractSemantic markup, such as , allows providers on the Web to describe content using a shared controlled vocabulary. This markup is invaluable in enabling a broad range of applications, from vertical search engines, to rich snippets in search results, to actions on emails, to many others. In this paper, we focus on semantic markup for datasets, specifically in the context of developing a vertical search engine for datasets on the Web, Google’s Dataset Search. Dataset Search relies on to identify pages that describe datasets. While was the core enabling technology for this vertical search, we also discovered that we need to address the following problem: pages from 61% of internet hosts that provide markup do not actually describe datasets. We analyze the veracity of dataset markup for Dataset Search’s Web-scale corpus and categorize pages where this markup is not reliable. We then propose a way to drastically increase the quality of the dataset metadata corpus by developing a deep neural-network classifier that identifies whether or not a page with markup is a dataset page. Our classifier achieves 96.7% recall at the 95% precision point. This level of precision enables Dataset Search to circumvent the noise in semantic markup and to use the metadata to provide high quality results to users.


2020 ◽  
Vol 4 (6) ◽  
pp. 85-91
Author(s):  
Dildora Bahodirovna Akhmedova ◽  

Background. Semantic markup is an issue that has been thoroughly studied by experts. If the first generation of language corpora was a collection of electronic texts, then a tool with a query-responsive interface was later formed into literal corporations with linguistic and extralinguistic markings. Linguistically marked corpuses were initially only morphological, then morpho-syntactic, and in recent years the perfect form of linguistic marking - the corpus with morphological, syntactic and semantic markings - has undergone a stage of development. The introduction of semantic markup into the case was initially based on theory, while semantic marking problems were explored. Yu.D. Apresyan, I.M. Boguslavskiy, B.L. Iomdin, E.V. Biryaltsev, A.M. Elizarov, N.G. Jiltsov, V.V. Ivanov, O.A. Nevzorova, V.D. Solovev, I.S. Kononenko, E.A. Sidorova, The research of E.I. Yakovchuk, E.V. Rakhilina, G.I. Kustova, O.N. Lyashevskaya, T.I. Reznikova, O.Yu. Shemanaeva, A.A. Kretov can be included in such works


Author(s):  
Arzhaana Hertek ◽  
B.C. Oorzhak ◽  
Aelita Salchak ◽  
V.S. Ondar ◽  
S.M. Dallaa

Данная статья содержит информацию о первом опыте разработки электронной базы данных лексем тувинского языка в рамках проекта «Создание базы данных лексического фонда тувинского языка» (РГНФ/РФФИ № 16-04-12020, 2016-2017 гг.). Созданные базы содержат основной фонд полнозначных лексем (имена существительные, имена прилагательные, наречия, глаголы, также местоимения), распределенных по семантическим классам, подклассам, группам, подгруппам и микрогруппам. Систематизированная база данных лексического фонда будет использоваться при дальнейшей работе по семантической разметке текстов Электронного корпуса тувинского языка, составлении разных типов словарей по тувинскому языку.This article contains information about the first experience in the development of an electronic database of Tuvan lexemes within the framework of the project "Creation of a database of the Tuvan lexical Fund" (RGNF / RFBR No. 16-04-1220, 2016-2017). The created databases contain the main body of full lexemes (nouns, adjectives, adverbs, verbs, pronouns), distributed by semantic classes, subclasses, groups, subgroups, and microgroups. Systematized database of lexicon will be used for further work on semantic markup of electronic texts corpus of the Tuvan language, compiling various types of dictionaries for Tuvan language. Databases will be used to compile different types of dictionaries. Creation of electronic databases is performed using the Access2010 database management system. Texts in the Tuvan language will be processed using the C ++ object-oriented programming system. These systems support Unicode encoding, in which all texts in Tuvan are digitized. Computer programs will be created both for computers with the Windows operating system, and for mobile devices with the Android operating system. Currently, the search is performed in the program Ехсе1. Information on the creation of databases for the project is available on the Internet on the page of the Electronic Corpus of Tuvan Language Texts http://tuvancorpus.ru/?q=content/bazy-dannyh.


2020 ◽  
Vol 10 (4) ◽  
pp. 679-691
Author(s):  
I. P. Novak ◽  
◽  
N. B. Krizhanovskaya ◽  
T. P. Boyko ◽  
N. A. Pellinen ◽  
...  

Introduction: linking of words of texts (tokens) with meanings of lemmas in the dictionary of VepKar corpus significantly facilitates further work on semantic markup of texts. In 2019, inflectional rules were developed for the Vepsian subcorpora VepKar. To the corpus on the base of these rules a function for generation of a complete paradigm on basic word forms was added. VepKar editors need to enter a large number of word forms when they create dictionary entries in three Karelian subcorpora (about 30 for names and 150 for verbs). Therefore, the development of an algorithm and a computer program for generation of word forms of the Karelian language turned out to be timely. Objective: to illustrate how you can use the list of the stems of the nominal parts of speech of two new-written dialects of the Karelian language to create rules for automatic generation of word forms. Research materials: lemmas and word forms from the Open corpus of the Vepsian and Karelian languages, the Corpus of Border Karelia, and the electronic version of the Dictionary of the Karelian language. Results and novelty of the research: grammatical patterns were studied over many years from theoretical sources, and they were also discovered through experiments. Thanks to this, the list of stems and pseudo-stems of word forms was formed for the nominal parts of speech, the system of rules for generation of word forms was developed, and the corresponding computer program is written and tested. The scientific novelty of the study lies in the first attempt to develop uniform rules for the automatic generation of word forms for two dialects of the Karelian language.


2020 ◽  
Vol 88 ◽  
pp. 01001
Author(s):  
Irina Karabulatova

The relevance of the stated problem reflects the study of the “friend-foe” dichotomy, which is clearly represented in the modern news discourse, since it reflects the most significant problems for society: migration, the COVID-19 pandemic, crime, various confrontations, problems of socially vulnerable citizens, etc. The subject of the research is to Refine the parameters for evaluating potentially dangerous texts for the subsequent creation of a library of software modules for theming and classifying news messages, including using AI technologies. Hypothesis: the proposed parameters of the system of interpretation of potentially dangerous text increase the chances of determining the prognostic level of the degree of propensity to illegal actions, so the creation of a digital library will help to quickly analyze the levels of potential dangers for the recipient. The use of digital technologies for psycholinguistic assessment of potentially dangerous texts optimizes the search and tracking of such texts, contributing to the development of measures to ensure the safety of the human psyche in conditions of massive impact on the recipient in order to change his personal attitudes. The author raises the problem of creating a single digital platform for evaluating such texts, noting the need for linguistic priority when creating semantic markup, which will allow us to qualitatively rank potentially dangerous texts. Such work requires the application of interdisciplinary efforts of specialists in the fields of linguistics, psychology, mythology, history, sociology, political science, cultural studies, mathematics, computer science and Digital Humanities. The practical value is unquestionable, since psycholinguodiagnostics of a person does not correlate with the potential danger of texts produced by such a person in society.


2020 ◽  
Vol 7 (1) ◽  
pp. 3-9
Author(s):  
Yu.N. Bartashevskaya ◽  

The article considers the problem of using Big Data in a modern economics and public life. The volumes and complexity of information are growing rapidly, but modern technologies cannot ensure their effective use. There is a lag in technologies, methods, and practices for using Big Data. The imbalance can be changed by semantic technologies, characterized by a different approach to the processing and use of data. This approach is based on the use of knowledge. Proved that despite the rather long time of the existence of semantic technologies and semantic networks, there are many obstacles to their effective application. These are the problems of accessibility of semantic content, accessibility of ontologies, their evolution, scalability and multilingualism. And since far from all the data presented on the network is created in terms of semantic markup and is unlikely to be brought to it in the future, the problem of accessibility of semantic content is one of the main ones. The article shows the difference between the semantic network and the semantic Web, and also indicates the development technologies of the latter. As the subject of study, the module of the courses of the Alfred Nobel University was selected. The composition of a separate module or a separate course is examined in detail: data on the university, lecturer, data on the provision of the course and language of its teaching, acquired skills, abilities, results and the like. A graph of the module of courses has been built on the example of the Alfred Nobel University in terms of ontology, its individual, most significant classes – components are considered. The main classes, subclasses and their contents are considered, data types (date, text, URL) are indicated. The ontological scheme has been converted to the RDF format, such as is necessary for modelling data in the semantic network and further research. The prospects for further research on the application of the selected model for representing knowledge, using the query language, obtaining and interpreting data from other universities, etc. are determined. Keywords: semantic technologies, semantic networks, ontologies, CmapTools, course module graph.


PMLA ◽  
2020 ◽  
Vol 135 (1) ◽  
pp. 165-174
Author(s):  
Susan Brown

Feminist Literary History Balances Commitment to a Different Future, One Better Than the Present with Respect to Gender, with an orientation toward the past, whose ways of knowing it seeks to supersede even as it engages with them. The revision of our cultural past through the lens of gender has, by drawing on past categorizations of authors as female, necessarily invoked problematic paradigms in the service of critique and epistemological change. The relation of the digital humanities (DH) to category work is similarly fraught. I offer here my take on the power and peril of classification based on category making in the pursuit of digital feminist literary history through the Orlando Project, an ongoing experiment in using semantic markup for online scholarship. Orlando is known for its online textbase, published with Cambridge University Press, but the team has produced a number of exploratory interfaces and translations of the material into other forms. Over the course of a quarter century of grappling with “the digital as difference” (Wernimont and Flanders 430) alongside other feminist projects, I have changed my understanding of classification as my collaborators and I have tried to represent the difference that gender analysis makes when undertaken in a computational environment. I here argue that category work, always vexed, always provisional, is crucial to realizing the potential of DH for representing, analyzing, and fostering difference.


Sign in / Sign up

Export Citation Format

Share Document