Joint Chinese word segmentation and punctuation prediction using deep recurrent neural network for social media data

Author(s):  
Kui Wu ◽  
Xuancong Wang ◽  
Nina Zhou ◽  
AiTi Aw ◽  
Haizhou Li
2019 ◽  
Vol 173 ◽  
pp. 117-127 ◽  
Author(s):  
Cristina Zuheros ◽  
Siham Tabik ◽  
Ana Valdivia ◽  
Eugenio Martínez-Cámara ◽  
Francisco Herrera

10.2196/13076 ◽  
2019 ◽  
Vol 6 (12) ◽  
pp. e13076
Author(s):  
Sara Melvin ◽  
Amanda Jamal ◽  
Kaitlyn Hill ◽  
Wei Wang ◽  
Sean D Young

Background Social media data can be explored as a tool to detect sleep deprivation. First-year undergraduate students in their first quarter were invited to wear sleep-tracking devices (Basis; Intel), allow us to follow them on Twitter, and complete weekly surveys regarding their sleep. Objective This study aimed to determine whether social media data can be used to monitor sleep deprivation. Methods The sleep data obtained from the device were utilized to create a tiredness model that aided in labeling the tweets as sleep deprived or not at the time of posting. Labeled data were used to train and test a gated recurrent unit (GRU) neural network as to whether or not study participants were sleep deprived at the time of posting. Results Results from the GRU neural network suggest that it is possible to classify the sleep-deprivation status of a tweet’s author with an average area under the curve of 0.68. Conclusions It is feasible to use social media to identify students’ sleep deprivation. The results add to the body of research suggesting that social media data should be further explored as a potential source for monitoring health.


2016 ◽  
Vol 2016 ◽  
pp. 1-13 ◽  
Author(s):  
Helena Gómez-Adorno ◽  
Ilia Markov ◽  
Grigori Sidorov ◽  
Juan-Pablo Posadas-Durán ◽  
Miguel A. Sanchez-Perez ◽  
...  

We introduce a lexical resource for preprocessing social media data. We show that a neural network-based feature representation is enhanced by using this resource. We conducted experiments on the PAN 2015 and PAN 2016 author profiling corpora and obtained better results when performing the data preprocessing using the developed lexical resource. The resource includes dictionaries of slang words, contractions, abbreviations, and emoticons commonly used in social media. Each of the dictionaries was built for the English, Spanish, Dutch, and Italian languages. The resource is freely available.


2021 ◽  
Author(s):  
Fei Shen ◽  
Wenting Yu ◽  
Chen Min ◽  
Qianying Ye ◽  
Chuanli Xia ◽  
...  

Text mining has been a dominant approach to extracting useful information from massive unstructured data online. But existing tools for Chinese word segmentation are not ideal for processing social media text data in Cantonese. This project developed CyberCan (https://github.com/shenfei1010/CyberCan), a lexicon of contemporary Cantonese based on more than 100 million pieces of internet texts. We compared the performance of CyberCan with existing Mandarin and Cantonese lexicons in terms of their word segmentation performance. Findings suggest that CyberCan outperforms all existing lexicons by a considerable margin.


Sign in / Sign up

Export Citation Format

Share Document