annotation scheme
Recently Published Documents


TOTAL DOCUMENTS

176
(FIVE YEARS 70)

H-INDEX

10
(FIVE YEARS 1)

2022 ◽  
Vol 9 ◽  
Author(s):  
Ryzen Benson ◽  
Mengke Hu ◽  
Annie T. Chen ◽  
Shu-Hong Zhu ◽  
Mike Conway

Background: Perceptions of tobacco, cannabis, and electronic nicotine delivery systems (ENDS) are continually evolving in the United States. Exploring these characteristics through user generated text sources may provide novel insights into product use behavior that are challenging to identify using survey-based methods. The objective of this study was to compare the topics frequently discussed among Reddit members in cannabis, tobacco, and ENDS-specific subreddits.Methods: We collected 643,070 posts on the social media site Reddit between January 2013 and December 2018. We developed and validated an annotation scheme, achieving a high level of agreement among annotators. We then manually coded a subset of 2,630 posts for their content with relation to experiences and use of the three products of interest, and further developed word cloud representations of the words contained in these posts. Finally, we applied Latent Dirichlet Allocation (LDA) topic modeling to the 643,070 posts to identify emerging themes related to cannabis, tobacco, and ENDS products being discussed on Reddit.Results: Our manual annotation process yielded 2,148 (81.6%) posts that contained a mention(s) of either cannabis, tobacco, or ENDS with 1,537 (71.5%) of these posts mentioning cannabis, 421 (19.5%) mentioning ENDS, and 264 (12.2%) mentioning tobacco. In cannabis-specific subreddits, personal experiences with cannabis, cannabis legislation, health effects of cannabis use, methods and forms of cannabis, and the cultivation of cannabis were commonly discussed topics. The discussion in tobacco-specific subreddits often focused on the discussion of brands and types of combustible tobacco, as well as smoking cessation experiences and advice. In ENDS-specific subreddits, topics often included ENDS accessories and parts, flavors and nicotine solutions, procurement of ENDS, and the use of ENDS for smoking cessation.Conclusion: Our findings highlight the posting and participation patterns of Reddit members in cannabis, tobacco, and ENDS-specific subreddits and provide novel insights into aspects of personal use regarding these products. These findings complement epidemiologic study designs and highlight the potential of using specific subreddits to explore personal experiences with cannabis, ENDS, and tobacco products.


2021 ◽  
Vol 10 (2) ◽  
pp. 411
Author(s):  
Prihantoro Prihantoro

SANTI-Morf (Prihantoro, 2021) is a new morphological analyser for Indonesian. In SANTI-Morf annotation scheme (Prihantoro, 2019), morpheme tokens are linked to their annotations. The tokens are presented in their orthographic and citation forms to allow (allo)morph or morpheme-based searches. Users can also perform retrievals on the basis of formal and functional morphological criteria as SANTI-Morf tagset encodes the analyses of morphemes’ forms (e.g. roots, clitics, affix type) and functions (e.g. passive voice, active voice, adjective degrees, etc.). Currently, the scheme is implemented in Nooj (Silberztein, 2003), a linguistic development environment. It enables users to index and annotate Indonesian texts in their local PC, and later perform searches based on morphological criteria and or tokens defined by the SANTI-Morf scheme. AbstrakSANTI-Morf (Prihantoro, 2021) adalah sebuah program analisis morfologi terbaru untuk bahasa Indonesia. Dalam skema anotasi SANTI-morf (Prihantoro, A new tagset for morphological analysis of Indonesian, 2019), setiap token morfem terhubung dengan anotasinya. Token-token ini direpresentasikan dalam bentuk ortografis dan bentuk sitasi sehingga memungkinkan pengguna untuk melakukan penelusuran berbasis (alo)morf atau morfem. Selain itu, pengguna juga bisa melakukan penelusuran berbasiskan bentuk atau fungsi morfem. Ini karena tagset analitik yang digunakan di SANTI-morf mencakup bentuk (di antaranya: akar, klitik, jenis afiksasi) dan fungsi (di antaranya: aktif, pasif, derajat ajektiva). Saat ini, SANTI-morf diimplementasikan menggunakan NooJ (Silberztein, 2003), sebuah program pengembangan aplikasi linguistik. Pengguna dapat mengindeks dan menganotasi teks berbahasa Indonesia di komputer mereka, dan selanjutnya melakukan penelusuran menggunakan kriteria morfologi dan skema tokenisasi yang digunakan di skema anotasi SANTI-morf.


2021 ◽  
Vol 9 ◽  
Author(s):  
Xin Wang ◽  
Fan Chao ◽  
Guang Yu

Background: The spread of rumors related to COVID-19 on social media has posed substantial challenges to public health governance, and thus exposing rumors and curbing their spread quickly and effectively has become an urgent task. This study aimed to assist in formulating effective strategies to debunk rumors and curb their spread on social media.Methods: A total of 2,053 original postings and 100,348 comments that replied to the postings of five false rumors related to COVID-19 (dated from January 20, 2020, to June 28, 2020) belonging to three categories, authoritative, social, and political, on Sina Weibo in China were randomly selected. To study the effectiveness of different debunking methods, a new annotation scheme was proposed that divides debunking methods into six categories: denial, further fact-checking, refutation, person response, organization response, and combination methods. Text classifiers using deep learning methods were built to automatically identify four user stances in comments that replied to debunking postings: supporting, denying, querying, and commenting stances. Then, based on stance responses, a debunking effectiveness index (DEI) was developed to measure the effectiveness of different debunking methods.Results: The refutation method with cited evidence has the best debunking effect, whether used alone or in combination with other debunking methods. For the social category of Car rumor and political category of Russia rumor, using the refutation method alone can achieve the optimal debunking effect. For authoritative rumors, a combination method has the optimal debunking effect, but the most effective combination method requires avoiding the use of a combination of a debunking method where the person or organization defamed by the authoritative rumor responds personally and the refutation method.Conclusion: The findings provide relevant insights into ways to debunk rumors effectively, support crisis management of false information, and take necessary actions in response to rumors amid public health emergencies.


Author(s):  
Michal Ptaszynski ◽  
Monika Zasko-Zielinska ◽  
Michal Marcinczuk ◽  
Gniewosz Leliwa ◽  
Marcin Fortuna ◽  
...  

In this paper, we study language used by suicidal users on Reddit social media platform. To do that, we firstly collect a large-scale dataset of Reddit posts and annotate it with highly trained and expert annotators under a rigorous annotation scheme. Next, we perform a multifaceted analysis of the dataset, including: (1) the analysis of user activity before and after posting a suicidal message, and (2) a pragmalinguistic study on the vocabulary used by suicidal users. In the second part of the analysis, we apply LIWC, a dictionary-based toolset widely used in psychology and linguistic research, which provides a wide range of linguistic category annotations on text. However, since raw LIWC scores are not sufficiently reliable, or informative, we propose a procedure to decrease the possibility of unreliable and misleading LIWC scores leading to misleading conclusions by analyzing not each category separately, but in pairs with other categories. The analysis of the results supported the validity of the proposed approach by revealing a number of valuable information on the vocabulary used by suicidal users and helped to pin-point false predictors. For example, we were able to specify that death-related words, typically associated with suicidal posts in the majority of the literature, become false predictors, when they co-occur with apostrophes, even in high-risk subreddits. On the other hand, the category-pair based disambiguation helped to specify that death becomes a predictor only when co-occurring with future-focused language, informal language, discrepancy, or 1st person pronouns. The promising applicability of the approach was additionally analyzed for its limitations, where we found out that although LIWC is a useful and easily applicable tool, the lack of any contextual processing makes it unsuitable for application in psychological and linguistic studies. We conclude that disadvantages of LIWC can be easily overcome by creating a number of high-performance AI-based classifiers trained for annotation of similar categories as LIWC, which we plan to pursue in future work.


10.2196/24471 ◽  
2021 ◽  
Vol 8 (11) ◽  
pp. e24471
Author(s):  
Stevie Chancellor ◽  
Steven A Sumner ◽  
Corinne David-Ferdon ◽  
Tahirah Ahmad ◽  
Munmun De Choudhury

Background Online communities provide support for individuals looking for help with suicidal ideation and crisis. As community data are increasingly used to devise machine learning models to infer who might be at risk, there have been limited efforts to identify both risk and protective factors in web-based posts. These annotations can enrich and augment computational assessment approaches to identify appropriate intervention points, which are useful to public health professionals and suicide prevention researchers. Objective This qualitative study aims to develop a valid and reliable annotation scheme for evaluating risk and protective factors for suicidal ideation in posts in suicide crisis forums. Methods We designed a valid, reliable, and clinically grounded process for identifying risk and protective markers in social media data. This scheme draws on prior work on construct validity and the social sciences of measurement. We then applied the scheme to annotate 200 posts from r/SuicideWatch—a Reddit community focused on suicide crisis. Results We documented our results on producing an annotation scheme that is consistent with leading public health information coding schemes for suicide and advances attention to protective factors. Our study showed high internal validity, and we have presented results that indicate that our approach is consistent with findings from prior work. Conclusions Our work formalizes a framework that incorporates construct validity into the development of annotation schemes for suicide risk on social media. This study furthers the understanding of risk and protective factors expressed in social media data. This may help public health programming to prevent suicide and computational social science research and investigations that rely on the quality of labels for downstream machine learning tasks.


Author(s):  
Gilles Jacobs ◽  
Véronique Hoste

AbstractWe present SENTiVENT, a corpus of fine-grained company-specific events in English economic news articles. The domain of event processing is highly productive and various general domain, fine-grained event extraction corpora are freely available but economically-focused resources are lacking. This work fills a large need for a manually annotated dataset for economic and financial text mining applications. A representative corpus of business news is crawled and an annotation scheme developed with an iteratively refined economic event typology. The annotations are compatible with benchmark datasets (ACE/ERE) so state-of-the-art event extraction systems can be readily applied. This results in a gold-standard dataset annotated with event triggers, participant arguments, event co-reference, and event attributes such as type, subtype, negation, and modality. An adjudicated reference test set is created for use in annotator and system evaluation. Agreement scores are substantial and annotator performance adequate, indicating that the annotation scheme produces consistent event annotations of high quality. In an event detection pilot study, satisfactory results were obtained with a macro-averaged $$F_1$$ F 1 -score of $$59\%$$ 59 % validating the dataset for machine learning purposes. This dataset thus provides a rich resource on events as training data for supervised machine learning for economic and financial applications. The dataset and related source code is made available at https://osf.io/8jec2/.


2021 ◽  
Vol 11 (1) ◽  
pp. 209-223
Author(s):  
Rosa Rabadán ◽  
Noelia Ramón ◽  
Hugo Sanjurjo-González

This paper explores the multi-layer annotation of a written domain-restricted English-Spanish comparable corpus (CLANES – Controlled LANguage English Spanish), focusing on pragmatic annotation. The annotation scheme draws on part of speech tagging and a semantic annotation scheme, i.e. the UCREL Semantic Analysis System, with some added categories to fit the food-and-drink domain represented in CLANES. These are used to build significant (pragmatic) metapatterns. Seven different pragmatic functions have been identified in our corpus, namely <STATE>, <DIRECT>, <SUGGEST>, <RECOMMEND>, <PRAISE>, <EVIDENCE> and <RELATE TO READER>. Computer scripts translate this linguistic information into regular expressions to be used in unsupervised annotation. Partial results indicate that applying lexical restrictors boosts the success rate considerably. However, metadata is preferred because of increased replicability and generality. Replicability issues and limitations encountered during testing are also addressed.


2021 ◽  
Author(s):  
Connor L. Brown ◽  
James Mullet ◽  
Fadi Hindi ◽  
James E. Stoll ◽  
Suraj Gupta ◽  
...  

ABSTRACTCurrently available databases of bacterial mobile genetic elements (MGEs) contain both “core” and accessory MGE functional modules, the latter of which are often only transiently associated with the element. The presence of these accessory genes, which are often close homologs to primarily immobile genes, limits the usability of these databases for MGE annotation. To overcome this limitation, we analysed 10,776,212 protein sequences derived from seven MGE databases to compile a comprehensive database of 6,140 manually curated protein families that are linked to the “life cycle” (integration, excision, replication/recombination/repair, transfer, and stability/defense) of all major classes of bacterial MGEs. We overlay experimental information where available to create a tiered annotation scheme of high-quality annotations and annotations inferred exclusively through bioinformatic evidence. We additionally provide an MGE-class label for each entry (e.g., plasmid, integrative element) derived from the source database, and assign a list of keywords to each entry to delineate different MGE functional modules and to facilitate annotation. The resulting database, mobileOG-db (for mobile orthologous groups), provides a simple and readily interpretable foundation for an array of MGE-centred analyses. mobileOG-db can be accessed at mobileogdb.flsi.cloud.vt.edu/, where users can browse and design, refine, and analyse custom subsets of the dynamic mobilome.


Sign in / Sign up

Export Citation Format

Share Document