The FGLOCTweet Corpus: An English tweet-based corpus for fine-grained location-detection tasks

Nicolás José Fernández-Martínez

doi:10.32714/ricl.10.01.06

The FGLOCTweet Corpus: An English tweet-based corpus for fine-grained location-detection tasks

Research in Corpus Linguistics ◽

10.32714/ricl.10.01.06 ◽

2022 ◽

Vol 10 (1) ◽

pp. 117-133

Author(s):

Nicolás José Fernández-Martínez

Keyword(s):

Social Media ◽

Language Processing ◽

Situation Awareness ◽

Spatial Information ◽

Emergency Situation ◽

Coarse Grained ◽

Text Data ◽

Emergency Responders ◽

Fine Grained ◽

Location Detection

Location detection in social-media microtexts is an important natural language processing task for emergency-based contexts where locative references are identified in text data. Spatial information obtained from texts is essential to understand where an incident happened, where people are in need of help and/or which areas have been affected. This information contributes to raising emergency situation awareness, which is then passed on to emergency responders and competent authorities to act as quickly as possible. Annotated text data are necessary for building and evaluating location-detection systems. The problem is that available corpora of tweets for location-detection tasks are either lacking or, at best, annotated with coarse-grained location types (e.g. cities, towns, countries, some buildings, etc.). To bridge this gap, we present our semi-automatically annotated corpus, the Fine-Grained LOCation Tweet Corpus (FGLOCTweet Corpus), an English tweet-based corpus for fine-grained location-detection tasks, including fine-grained locative references (i.e. geopolitical entities, natural landforms, points of interest and traffic ways) together with their surrounding locative markers (i.e. direction, distance, movement or time). It includes annotated tweet data for training and evaluation purposes, which can be used to advance research in location detection, as well as in the study of the linguistic representation of place or of the microtext genre of social media.

Download Full-text

Anatomy of a Protest: Spatial Information, Social Media, and Urban Space

Social Media + Society ◽

10.1177/2056305119897320 ◽

2020 ◽

Vol 6 (1) ◽

pp. 205630511989732

Author(s):

Alireza Karduni ◽

Eric Sauda

Keyword(s):

Social Media ◽

Public Space ◽

Urban Space ◽

Language Processing ◽

Local Community ◽

Spatial Information ◽

Social Media Data ◽

Public Events ◽

Use Of Social Media ◽

Media Data

Black Lives Matter, like many modern movements in the age of information, makes significant use of social media as well as public space to demand justice. In this article, we study the protests in response to the shooting of Keith Lamont Scott by police in Charlotte, North Carolina, on September 2016. Our goal is to measure the significance of urban space within the virtual and physical network of protesters. Using a mixed-methods approach, we identify and study urban space and social media generated by these protests. We conducted interviews with protesters who were among the first to join the Keith Lamont Scott shooting demonstrations. From the interviews, we identify places that were significant in our interviewees’ narratives. Using a combination of natural language processing and social network analysis, we analyze social media data related to the Charlotte protests retrieved from Twitter. We found that social media, local community, and public space work together to organize and motivate protests and that public events such as protests cause a discernible increase in social media activity. Finally, we find that there are two distinct communities who engage social media in different ways; one group involved with social media, local community and urban space, and a second group connected almost exclusively through social media.

Download Full-text

Innovative Deep Neural Network Modeling for Fine-Grained Chinese Entity Recognition

Electronics ◽

10.3390/electronics9061001 ◽

2020 ◽

Vol 9 (6) ◽

pp. 1001 ◽

Cited By ~ 1

Author(s):

Jingang Liu ◽

Chunhe Xia ◽

Haihua Yan ◽

Wenjing Xu

Keyword(s):

Neural Network ◽

Language Processing ◽

Short Term Memory ◽

Named Entity Recognition ◽

Training Model ◽

Entity Recognition ◽

Coarse Grained ◽

Neural Network Modeling ◽

Fine Grained ◽

Named Entity

Named entity recognition (NER) is a basic but crucial task in the field of natural language processing (NLP) and big data analysis. The recognition of named entities based on Chinese is more complicated and difficult than English, which makes the task of NER in Chinese more challenging. In particular, fine-grained named entity recognition is more challenging than traditional named entity recognition tasks, mainly because fine-grained tasks have higher requirements for the ability of automatic feature extraction and information representation of deep neural models. In this paper, we propose an innovative neural network model named En2BiLSTM-CRF to improve the effect of fine-grained Chinese entity recognition tasks. This proposed model including the initial encoding layer, the enhanced encoding layer, and the decoding layer combines the advantages of pre-training model encoding, dual bidirectional long short-term memory (BiLSTM) networks, and a residual connection mechanism. Hence, it can encode information multiple times and extract contextual features hierarchically. We conducted sufficient experiments on two representative datasets using multiple important metrics and compared them with other advanced baselines. We present promising results showing that our proposed En2BiLSTM-CRF has better performance as well as better generalization ability in both fine-grained and coarse-grained Chinese entity recognition tasks.

Download Full-text

Modeling Both Coarse-Grained and Fine-Grained Topics in Massive Text Data

2015 IEEE First International Conference on Big Data Computing Service and Applications ◽

10.1109/bigdataservice.2015.21 ◽

2015 ◽

Author(s):

Weifan Zhang ◽

Hui Zhang ◽

Yuan Zuo ◽

Deqing Wang

Keyword(s):

Coarse Grained ◽

Text Data ◽

Fine Grained

Download Full-text

Detection of Cyberbullying on Social Media Using Machine learning

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2021.38635 ◽

2021 ◽

Vol 9 (10) ◽

pp. 1401-1409

Author(s):

Mitta Roja

Keyword(s):

Machine Learning ◽

Social Media ◽

Feature Extraction ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Hate Speech ◽

Text Data ◽

Model Based

Abstract: Cyberbullying is a major problem encountered on internet that affects teenagers and also adults. It has lead to mishappenings like suicide and depression. Regulation of content on Social media platorms has become a growing need. The following study uses data from two different forms of cyberbullying, hate speech tweets from Twittter and comments based on personal attacks from Wikipedia forums to build a model based on detection of Cyberbullying in text data using Natural Language Processing and Machine learning. Threemethods for Feature extraction and four classifiers are studied to outline the best approach. For Tweet data the model provides accuracies above 90% and for Wikipedia data it givesaccuracies above 80%. Keywords: Cyberbullying, Hate speech, Personal attacks,Machine learning, Feature extraction, Twitter, Wikipedia

Download Full-text

Using Wikidata and Metaphactory to Underpin an Integrated Flora of Canada

Biodiversity Information Science and Standards ◽

10.3897/biss.3.38627 ◽

2019 ◽

Vol 3 ◽

Cited By ~ 1

Author(s):

Joel Sachs ◽

Jocelyn Pender ◽

Beatriz Lujan-Toro ◽

James Macklin ◽

Peter Haase ◽

...

Keyword(s):

North America ◽

Language Processing ◽

Sequence Data ◽

Strong Support ◽

Character Trait ◽

Coarse Grained ◽

Morphological Properties ◽

Faceted Search ◽

Fine Grained ◽

Floristic Data

We are using Wikidata and Metaphactory to build an Integrated Flora of Canada (IFC). IFC will be integrated in two senses: First, it will draw on multiple existing flora (e.g. Flora of North America, Flora of Manitoba, etc.) for content. Second, it will be a portal to related resources such as annotations, specimens, literature, and sequence data. Background We had success using Semantic Media Wiki (SMW) as the platform for an on-line representation of the Flora of North America (FNA). We used Charaparser (Cui 2012) to extract plant structures (e.g. “stem”), characters (e.g. “external texture”), and character values (e.g. “glabrous”) from the semi-structured FNA treatments. We then loaded this data into SMW, which allows us to query for taxa based on their character traits, and enables a broad range of exploratory analysis, both for purposes of hypothesis generation, and also to provide support for or against specific scientific hypotheses. Migrating to Wikidata/Wikibase We decided to explore a migration from SMW to Wikibase for three main reasons: simplified workflow; triple level provenance; and sustainability. Simplified workflow: Our workflow for our FNA-based portal includes Natural Language Processing (NLP) of coarse-grained XML to get the fine-grained XML, transforming this XML for input into SMW, and a custom SMW skin for displaying the data. We consider the coarse-grained XML to be canonical. When it changes (because we find an error, or we improve our NLP), we have to re-run the transformation, and re-load the data, which is time-consuming. Ideally, our presentation would be based on API calls to the data itself, eliminating the need to transform and re-load after every change. Provenance: Wikidata's provenance model supports having multiple, conflicting assertions for the same character trait, which is something that inevitably happens when floristic data is integrated. Sustainability: Wikidata has strong support from the Wikimedia Foundation, while SMW is increasingly seen as a legacy system. Wikibase vs. Wikidata Wikidata, however, is not a suitable home for the Integrated Flora of Canada. It is built upon a relatively small number of community curated properties, while we have ~4500 properties for the Asteraceae family alone. The model we want to pursue is to use Wikidata for a small group of core properties (e.g. accepted name, parent taxon, etc.), and to use our own instance of Wikibase for the much larger number of specialized morphological properties (e.g. adaxial leaf colour, leaf external texture, etc.) Essentially, we will be running our own Wikidata, over which we would exercise full control. Miller (2018) decribes deploying this curation model in another domain. Metaphactory Metaphactory is a suite of middleware and front-end interfaces for authoring, managing, and querying knowledge graphs, including mechanisms for faceted search and geospatial visualizations. It is also the software (together with Blazegraph) behind the Wikidata Query Service. Metaphactory provides us with a SPARQL endpoint; a templating mechanism that allows each taxonomic treatment to be rendered via a collection of SPARQL queries; reasoning capabilities (via an underlying graph database) that permit the organization of over 42,000 morphological properties; and a variety of search and discovery tools. There are a number of ways in which Wikidata and Metaphactory can work together, and we are still exploring questions such as: Will provenance be managed via named graphs, or via the Wikidata snak model?; How will data flow between the two platforms? Etc. We will report on our findings to date, and invite collaboration with related Wikimedia-based projects.

Download Full-text

Deep Learning for Identification of Alcohol on Social Media: Exploratory Analysis of Alcohol-Related Outcomes from Reddit and Twitter (Preprint)

10.2196/preprints.27314 ◽

2021 ◽

Author(s):

Benjamin Joseph Ricard ◽

Saeed Hassanpour

Keyword(s):

Social Media ◽

Deep Learning ◽

Natural Language Processing ◽

Alcohol Consumption ◽

Alcohol Abuse ◽

Natural Language ◽

Health Outcomes ◽

Language Processing ◽

Fine Grained ◽

Processing Pipeline

BACKGROUND Many social media studies have explored the ability of thematic structures, such as hashtags and subreddits, to identify information related to a wide variety of mental health disorders. However, studies and models trained on specific themed communities are often difficult to apply to different social media platforms and related outcomes. A deep learning framework using thematic structures from Reddit and Twitter can have distinct advantages for studying alcohol abuse, particularly among the youth, in the United States. OBJECTIVE This study proposes a new deep learning pipeline that uses thematic structures to identify alcohol-related content across different platforms. We applied our method on Twitter to determine the association of the prevalence of alcohol-related tweets and alcohol-related outcomes reported from the National Institute of Alcoholism and Alcohol Abuse (NIAAA), Centers for Disease Control Behavioral Risk Factor Surveillance System (CDC BRFSS), County Health Rankings, and the National Industry Classification System (NAICS). METHODS A Bidirectional Encoder Representations from Transformers (BERT) neural network learned to classify 1,302,524 Reddit posts as either alcohol-related or control subreddits. The trained model identified 24 alcohol-related hashtags from an unlabeled dataset of 843,769 random tweets. Querying alcohol-related hashtags identified 25,558,846 alcohol-related tweets, including 790,544 location-specific (geotagged) tweets. We calculated the correlation of the prevalence of alcohol-related tweets with alcohol-related outcomes, controlling for confounding effects from age, sex, income, education, and self-reported race, as recorded by the 2013-2018 American Community Survey (ACS). RESULTS Here, we present a novel natural language processing pipeline developed using Reddit alcohol-related subreddits that identifies highly specific alcohol-related Twitter hashtags. Prevalence of identified hashtags contains interpretable information about alcohol consumption at both coarse (e.g., U.S. State) and fine-grained (e.g., MMSA, County) geographical designations. This approach can expand research and interventions on alcohol abuse and other behavioral health outcomes. CONCLUSIONS Here, we present a novel natural language processing pipeline developed using Reddit alcohol-related subreddits that identifies highly specific alcohol-related Twitter hashtags. Prevalence of identified hashtags contains interpretable information about alcohol consumption at both coarse (e.g., U.S. State) and fine-grained (e.g., MMSA, County) geographical designations. This approach can expand research and interventions on alcohol abuse and other behavioral health outcomes.

Download Full-text

Using Social Media to Enhance Emergency Situation Awareness

IEEE Intelligent Systems ◽

10.1109/mis.2012.6 ◽

2012 ◽

Vol 27 (6) ◽

pp. 52-59 ◽

Cited By ~ 338

Author(s):

Jie Yin ◽

Andrew Lampert ◽

Mark Cameron ◽

Bella Robinson ◽

Robert Power

Keyword(s):

Social Media ◽

Situation Awareness ◽

Emergency Situation

Download Full-text

Correlation between dislocation, grain boundary and interface of duplex SS in stress corrosion cracking(SCC)

Proceedings, annual meeting, Electron Microscopy Society of America ◽

10.1017/s0424820100177775 ◽

1990 ◽

Vol 48 (4) ◽

pp. 928-929

Author(s):

Wang Zheng-fang ◽

Z.F. Wang

Keyword(s):

Stainless Steel ◽

Thermal Cycle ◽

Secondary Phase ◽

Corrosion Cracking ◽

Coarse Grained ◽

Test Environment ◽

Fine Grained ◽

Evaluation Test ◽

Welding Thermal Cycle ◽

Micro Analysis

The main purpose of this study highlights on the evaluation of chloride SCC resistance of the material,duplex stainless steel,OOCr18Ni5Mo3Si2 (18-5Mo) and its welded coarse grained zone(CGZ).18-5Mo is a dual phases (A+F) stainless steel with yield strength:512N/mm2 .The proportion of secondary Phase(A phase) accounts for 30-35% of the total with fine grained and homogeneously distributed A and F phases(Fig.1).After being welded by a specific welding thermal cycle to the material,i.e. Tmax=1350°C and t8/5=20s,microstructure may change from fine grained morphology to coarse grained morphology and from homogeneously distributed of A phase to a concentration of A phase(Fig.2).Meanwhile,the proportion of A phase reduced from 35% to 5-10°o.For this reason it is known as welded coarse grained zone(CGZ).In association with difference of microstructure between base metal and welded CGZ,so chloride SCC resistance also differ from each other.Test procedures:Constant load tensile test(CLTT) were performed for recording Esce-t curve by which corrosion cracking growth can be described, tf,fractured time,can also be recorded by the test which is taken as a electrochemical behavior and mechanical property for SCC resistance evaluation. Test environment:143°C boiling 42%MgCl2 solution is used.Besides, micro analysis were conducted with light microscopy(LM),SEM,TEM,and Auger energy spectrum(AES) so as to reveal the correlation between the data generated by the CLTT results and micro analysis.

Download Full-text