news corpus Latest Research Papers

Web crawlers are as old as the Internet and are most commonly used by search engines to visit websites and index them into repositories. They are not limited to search engines but are also widely utilized to build corpora in different domains and languages. This study developed a focused set of web crawlers for three Punjabi news websites. The web crawlers were developed to extract quality text articles and add them to a local repository to be used in further research. The crawlers were implemented using the Python programming language and were utilized to construct a corpus of more than 134,000 news articles in nine different news genres. The crawler code and extracted corpora were made publicly available to the scientific community for research purposes.

Download Full-text

A News Corpus-Based Analysis of ‘COVID-19’ Vocabulary

Korean Linguistics ◽

10.20405/kl.2021.11.93.273 ◽

2021 ◽

Vol 93 ◽

pp. 273-305

Author(s):

Suk-eui Lee

Keyword(s):

News Corpus

Download Full-text

External points of view in the PrEPUK News Corpus

Journal of Corpora and Discourse Studies ◽

10.18573/jcads.53 ◽

2021 ◽

Vol 4 (0) ◽

pp. 108

Author(s):

Luke Curtis Collins ◽

Lucy Jones

Keyword(s):

Points Of View ◽

News Corpus

Download Full-text

A Corpus-Based Study of Explicitation in Chinese-English Conference Interpreting

Journal of Language Teaching and Research ◽

10.17507/jltr.1204.23 ◽

2021 ◽

Vol 12 (6) ◽

pp. 1070-1081

Author(s):

Peiwen Xue ◽

Guobing Liu

Keyword(s):

Press Conference ◽

News Corpus ◽

Chinese And English

Based on the corpora CECIC (Chinese English conference interpretation corpus) and OENC (original English news corpus), this paper studies explicitation in the Chinese-English press conference interpreting in Chinese-English news conference interpretation in order to further explore explicitation in interpretation. This paper compares the numbers of connectives so as to provide a new typology of explicitation. In addition, it also discusses the motivations of explicitation from the aspects of the characteristics of interpretation itself, the habits of different interpreters and the different linguistic norms of Chinese and English.

Download Full-text

Pengembangan Sistem Pemantauan Sentimen Berita Berbahasa Indonesia Berdasarkan Konten dengan Long-Short-Term Memory

Jurnal Teknologi Informasi dan Ilmu Komputer ◽

10.25126/jtiik.2021854624 ◽

2021 ◽

Vol 8 (5) ◽

pp. 995

Author(s):

Dewi Yanti Liliana ◽

Nadia Nurul Hikmah ◽

Maykada Harjono

Keyword(s):

Machine Learning ◽

Digital Media ◽

Short Term Memory ◽

Online News ◽

Classification Systems ◽

Short Term ◽

Term Memory ◽

News Content ◽

Long Short Term Memory ◽

News Corpus

Kementerian Komunikasi dan Informatika (Kemkominfo) memiliki tugas salah satunya untuk mengawasi konten berita yang beredar di media digital. Dengan terus bertambahnya berita online di internet, Kemkominfo dihadapkan pada permasalahan pengklasifikasian sentimen berita yang masih dilakukan secara manual dengan membaca konten berita satu persatu lalu menangkap sentimen dari berita, yaitu sentimen positif, negatif, atau netral. Hal ini sangat melelahkan dan memakan waktu mengingat volume dan kecepatan pertumbuhan berita setiap harinya semakin masif. Untuk itu diperlukan pengembangan sistem pengklasifikasi sentimen berita daring secara otomatis untuk pemantauan berita berbahasa Indonesia. Sistem pengklasifikasi secara otomatis berbasis machine learning dilakukan dengan membangun model pembelajaran dari korpus berita yang berasal dari situs berita daring. Korpus data tersebut kemudian diproses menggunakan algoritma Long Short-Term Memory (LSTM). LSTM biasa digunakan untuk menangani kasus klasifikasi dalam berbagai bidang khususnya dengan input berupa teks sekuensial. Model LSTM diimplementasikan ke dalam aplikasi berbasis web untuk menentukan jenis dari sentimen berita. Berdasarkan hasil pengujian yang dilakukan, model LSTM yang dibuat memiliki tingkat akurasi sebesar 86%. Dengan demikian implementasi LSTM mampu menjadi suatu solusi untuk mengatasi masalah pengklasifikasian sentimen berita daring secara otomatis untuk pemantauan sentimen berita di Kemkominfo. AsbtractThe Ministry of Communication and Informatics (Kemkominfo) has one duty to monitor news content circulating in digital media. With the increasing number of online news in the internet, Kemkominfo is facing the problem of classifying news sentiment which is still done manually by reading the contents of the news one by one, and then capturing the sentiment of the news; either positive, negative, or neutral. This is very exhausting and time consuming considering the volume and speed of growth of news every day is getting massive. This requires the development of an automatic online news sentiment classification system for monitoring Indonesian news. Machine learning-based automatic classification systems are carried out by building a learning model from a news corpus originating from news sites. The data is then processed using the Long Short Term Memory (LSTM) algorithm. LSTM is commonly used to handle classification in various fields especially in a sequential input. The LSTM model is implemented into a web-based application to determine the types of news sentiment. Based on the results of the tests carried out, the LSTM model created has an accuracy rate of 86%. Thus, the implementation of LSTM is potentially become a solution to overcome the problem of automatic online news sentiment classification for the news content monitoring system at the Ministry of Communication and Information.

Download Full-text

Knowledge Graphs Representation for Event-Related E-News Articles

Machine Learning and Knowledge Extraction ◽

10.3390/make3040040 ◽

2021 ◽

Vol 3 (4) ◽

pp. 802-818

Author(s):

M.V.P.T. Lakshika ◽

H.A. Caldera

Keyword(s):

Knowledge Representation ◽

Contextual Information ◽

Background Information ◽

Knowledge Graph ◽

Learning Approaches ◽

Text Documents ◽

Precise Knowledge ◽

Knowledge Graphs ◽

News Corpus ◽

Constructing Knowledge

E-newspaper readers are overloaded with massive texts on e-news articles, and they usually mislead the reader who reads and understands information. Thus, there is an urgent need for a technology that can automatically represent the gist of these e-news articles more quickly. Currently, popular machine learning approaches have greatly improved presentation accuracy compared to traditional methods, but they cannot be accommodated with the contextual information to acquire higher-level abstraction. Recent research efforts in knowledge representation using graph approaches are neither user-driven nor flexible to deviations in the data. Thus, there is a striking concentration on constructing knowledge graphs by combining the background information related to the subjects in text documents. We propose an enhanced representation of a scalable knowledge graph by automatically extracting the information from the corpus of e-news articles and determine whether a knowledge graph can be used as an efficient application in analyzing and generating knowledge representation from the extracted e-news corpus. This knowledge graph consists of a knowledge base built using triples that automatically produce knowledge representation from e-news articles. Inclusively, it has been observed that the proposed knowledge graph generates a comprehensive and precise knowledge representation for the corpus of e-news articles.

Download Full-text

Prediction of Occupation Stress by Implementing Convolutional Neural Network Techniques

Journal of Cases on Information Technology ◽

10.4018/jcit.20210701.oa3 ◽

2021 ◽

Vol 23 (3) ◽

pp. 27-42

Author(s):

Surjeet Dalal ◽

Osamah Ibrahim Khalaf

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Working Conditions ◽

Elevated Level ◽

Human Services ◽

Low Level ◽

News Corpus ◽

The Impact ◽

Radio News

Medicinal services experts experience significant levels of word-related worry because of their working conditions. Subsequently, the point of this study is to build up a model that spotlights human services experts in order to break down the impact that activity requests, control, social help, and acknowledgment have on the probability that a specialist will experience pressure. The authors have beforehand presented a technique for pitch highlight identification utilizing a convolutional neural network (CNN) that yields great execution utilizing low-level acoustic descriptors alone, with no express span data. This paper utilizes this model for different pitch complement and lexical pressure discovery errands at the word and syllable level on the DIRNDL German radio news corpus. This research demonstrates that data on word or syllable span is encoded in the elevated level CNN include portrayal via preparing a direct relapse model on these highlights to foresee term.

Download Full-text

Gender Bias in the News: A Scalable Topic Modelling and Visualization Framework

Frontiers in Artificial Intelligence ◽

10.3389/frai.2021.664737 ◽

2021 ◽

Vol 4 ◽

Author(s):

Prashanth Rao ◽

Maite Taboada

Keyword(s):

English Language ◽

Latent Dirichlet Allocation ◽

Topic Modelling ◽

Monthly Basis ◽

Policy Makers ◽

Gender Parity ◽

News Organizations ◽

Gender Based ◽

News Corpus ◽

Key Events

We present a topic modelling and data visualization methodology to examine gender-based disparities in news articles by topic. Existing research in topic modelling is largely focused on the text mining of closed corpora, i.e., those that include a fixed collection of composite texts. We showcase a methodology to discover topics via Latent Dirichlet Allocation, which can reliably produce human-interpretable topics over an open news corpus that continually grows with time. Our system generates topics, or distributions of keywords, for news articles on a monthly basis, to consistently detect key events and trends aligned with events in the real world. Findings from 2 years worth of news articles in mainstream English-language Canadian media indicate that certain topics feature either women or men more prominently and exhibit different types of language. Perhaps unsurprisingly, topics such as lifestyle, entertainment, and healthcare tend to be prominent in articles that quote more women than men. Topics such as sports, politics, and business are characteristic of articles that quote more men than women. The data shows a self-reinforcing gendered division of duties and representation in society. Quoting female sources more frequently in a caregiving role and quoting male sources more frequently in political and business roles enshrines women’s status as caregivers and men’s status as leaders and breadwinners. Our results can help journalists and policy makers better understand the unequal gender representation of those quoted in the news and facilitate news organizations’ efforts to achieve gender parity in their sources. The proposed methodology is robust, reproducible, and scalable to very large corpora, and can be used for similar studies involving unsupervised topic modelling and language analyses.

Download Full-text

Cross-Context News Corpus for Protest Event-Related Knowledge Base Construction

Data Intelligence ◽

10.1162/dint_a_00092 ◽

2021 ◽

pp. 1-28

Author(s):

Ali Hürriyetoğlu ◽

Erdem Yörük ◽

Osman Mutlu ◽

Fırat Duruşan ◽

Çağrı Yoltar ◽

...

Keyword(s):

English Language ◽

Science Studies ◽

Text Processing ◽

Knowledge Bases ◽

Event Extraction ◽

Related Information ◽

News Source ◽

Gold Standard Corpus ◽

News Corpus ◽

Automated Text Processing

Abstract We describe a gold standard corpus of protest events that comprise various local and international English language sources from various countries. The corpus contains document-, sentence-, and token-level annotations. This corpus facilitates creating machine learning models that automatically classify news articles and extract protest event-related information, constructing knowledge bases that enable comparative social and political science studies. For each news source, the annotation starts with random samples of news articles and continues with samples drawn using active learning. Each batch of samples is annotated by two social and political scientists, adjudicated by an annotation supervisor, and improved by identifying annotation errors semi-automatically. We found that the corpus possesses the variety and quality that are necessary to develop and benchmark text classification and event extraction systems in a cross-context setting, contributing to the generalizability and robustness of automated text processing systems. This corpus and the reported results will establish a common foundation in automated protest event collection studies, which is currently lacking in the literature.

Download Full-text

Four best practices for measuring news sentiment using ‘off-the-shelf’ dictionaries: a large-scale p-hacking experiment

Computational Communication Research ◽

10.5117/ccr2021.1.001.chan ◽

2021 ◽

Vol 3 (1) ◽

pp. 1-27

Author(s):

Chung-hong Chan ◽

Joseph Bajjalieh ◽

Loretta Auvil ◽

Hartmut Wessler ◽

Scott Althaus ◽

...

Keyword(s):

Best Practices ◽

Large Scale ◽

Statistical Significance ◽

Statistical Hypothesis ◽

Validity And Reliability ◽

Presidential Approval ◽

Sentiment Dictionary ◽

News Corpus ◽

News Sentiment ◽

Different Levels

Abstract We examined the validity of 37 sentiment scores based on dictionary-based methods using a large news corpus and demonstrated the risk of generating a spectrum of results with different levels of statistical significance by presenting an analysis of relationships between news sentiment and U.S. presidential approval. We summarize our findings into four best practices: 1) use a suitable sentiment dictionary; 2) do not assume that the validity and reliability of the dictionary is ‘built-in’; 3) check for the influence of content length and 4) do not use multiple dictionaries to test the same statistical hypothesis.

Download Full-text

news corpus
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Development of Focused Crawlers for Building Large Punjabi News Corpus

A News Corpus-Based Analysis of ‘COVID-19’ Vocabulary

External points of view in the PrEPUK News Corpus

A Corpus-Based Study of Explicitation in Chinese-English Conference Interpreting

Pengembangan Sistem Pemantauan Sentimen Berita Berbahasa Indonesia Berdasarkan Konten dengan Long-Short-Term Memory

Knowledge Graphs Representation for Event-Related E-News Articles

Prediction of Occupation Stress by Implementing Convolutional Neural Network Techniques

Gender Bias in the News: A Scalable Topic Modelling and Visualization Framework

Cross-Context News Corpus for Protest Event-Related Knowledge Base Construction

Four best practices for measuring news sentiment using ‘off-the-shelf’ dictionaries: a large-scale p-hacking experiment

Export Citation Format

news corpusRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Development of Focused Crawlers for Building Large Punjabi News Corpus

A News Corpus-Based Analysis of ‘COVID-19’ Vocabulary

External points of view in the PrEPUK News Corpus

A Corpus-Based Study of Explicitation in Chinese-English Conference Interpreting

Pengembangan Sistem Pemantauan Sentimen Berita Berbahasa Indonesia Berdasarkan Konten dengan Long-Short-Term Memory

Knowledge Graphs Representation for Event-Related E-News Articles

Prediction of Occupation Stress by Implementing Convolutional Neural Network Techniques

Gender Bias in the News: A Scalable Topic Modelling and Visualization Framework

Cross-Context News Corpus for Protest Event-Related Knowledge Base Construction

Four best practices for measuring news sentiment using ‘off-the-shelf’ dictionaries: a large-scale p-hacking experiment

news corpus
Recently Published Documents