news corpus
Recently Published Documents


TOTAL DOCUMENTS

82
(FIVE YEARS 33)

H-INDEX

10
(FIVE YEARS 1)

2021 ◽  
Vol 15 (3) ◽  
pp. 205-215
Author(s):  
Gurjot Singh Mahi ◽  
Amandeep Verma

  Web crawlers are as old as the Internet and are most commonly used by search engines to visit websites and index them into repositories. They are not limited to search engines but are also widely utilized to build corpora in different domains and languages. This study developed a focused set of web crawlers for three Punjabi news websites. The web crawlers were developed to extract quality text articles and add them to a local repository to be used in further research. The crawlers were implemented using the Python programming language and were utilized to construct a corpus of more than 134,000 news articles in nine different news genres. The crawler code and extracted corpora were made publicly available to the scientific community for research purposes.


2021 ◽  
Vol 4 (0) ◽  
pp. 108
Author(s):  
Luke Curtis Collins ◽  
Lucy Jones
Keyword(s):  

2021 ◽  
Vol 12 (6) ◽  
pp. 1070-1081
Author(s):  
Peiwen Xue ◽  
Guobing Liu

Based on the corpora CECIC (Chinese English conference interpretation corpus) and OENC (original English news corpus), this paper studies explicitation in the Chinese-English press conference interpreting in Chinese-English news conference interpretation in order to further explore explicitation in interpretation. This paper compares the numbers of connectives so as to provide a new typology of explicitation. In addition, it also discusses the motivations of explicitation from the aspects of the characteristics of interpretation itself, the habits of different interpreters and the different linguistic norms of Chinese and English.


2021 ◽  
Vol 8 (5) ◽  
pp. 995
Author(s):  
Dewi Yanti Liliana ◽  
Nadia Nurul Hikmah ◽  
Maykada Harjono

<p class="Abstrak">Kementerian Komunikasi dan Informatika (Kemkominfo) memiliki tugas salah satunya untuk mengawasi konten berita yang beredar di media digital. Dengan terus bertambahnya berita online di internet, Kemkominfo dihadapkan pada permasalahan pengklasifikasian sentimen berita yang masih dilakukan secara manual dengan membaca konten berita satu persatu lalu menangkap sentimen dari berita, yaitu sentimen positif, negatif, atau netral. Hal ini sangat melelahkan dan memakan waktu mengingat volume dan kecepatan pertumbuhan berita setiap harinya semakin masif. Untuk itu diperlukan pengembangan sistem pengklasifikasi sentimen berita daring secara otomatis untuk pemantauan berita berbahasa Indonesia. Sistem pengklasifikasi secara otomatis berbasis <em>machine learning</em> dilakukan dengan membangun model pembelajaran dari korpus berita yang berasal dari situs berita daring. Korpus data tersebut kemudian diproses menggunakan algoritma <em>Long Short-Term Memory (</em>LSTM). LSTM biasa digunakan untuk menangani kasus klasifikasi dalam berbagai bidang khususnya dengan input berupa teks sekuensial. Model LSTM diimplementasikan ke dalam aplikasi berbasis web untuk menentukan jenis dari sentimen berita. Berdasarkan hasil pengujian yang dilakukan, model LSTM yang dibuat memiliki tingkat akurasi sebesar 86%. Dengan demikian implementasi LSTM mampu menjadi suatu solusi untuk mengatasi masalah pengklasifikasian sentimen berita daring secara otomatis untuk pemantauan sentimen berita di Kemkominfo.</p><p class="Abstrak"> </p><p class="Abstrak"><em><strong>Asbtract</strong></em></p><p class="Judul2"><em>The Ministry of Communication and Informatics (Kemkominfo) has one duty to monitor news content circulating in digital media. With the increasing number of online news in the internet, Kemkominfo is facing the problem of classifying news </em><em>sentiment </em><em>which is still done manually by reading the contents of the news one by one</em><em>, and then capturing the sentiment of the news; either positive, negative, or neutral</em><em>. This is very exhausting and time consuming considering the volume and speed of growth of news every day is getting massive. This requires the development of an automatic </em><em>online</em><em> news </em><em>sentiment </em><em>classification system for monitoring Indonesian news. Machine learning-based automatic classification systems are carried out by building a learning model from a news corpus originating from news sites. The data is then processed using the Long Short Term Memory (LSTM) algorithm. LSTM is commonly used to handle classification in various fields</em><em> especially in a sequential input</em><em>. The LSTM model is implemented into a web-based application to determine the </em><em>types of news sentiment</em><em>. Based on the results of the tests carried out, the LSTM model created has an accuracy rate of 86%. Thus, the implementation of LSTM is potentially become a solution to overcome the problem of automatic online news</em><em> sentiment</em><em> classification for the news content monitoring system at the Ministry of Communication and Information.</em></p><p class="Abstrak"><em><strong><br /></strong></em></p>


2021 ◽  
Vol 3 (4) ◽  
pp. 802-818
Author(s):  
M.V.P.T. Lakshika ◽  
H.A. Caldera

E-newspaper readers are overloaded with massive texts on e-news articles, and they usually mislead the reader who reads and understands information. Thus, there is an urgent need for a technology that can automatically represent the gist of these e-news articles more quickly. Currently, popular machine learning approaches have greatly improved presentation accuracy compared to traditional methods, but they cannot be accommodated with the contextual information to acquire higher-level abstraction. Recent research efforts in knowledge representation using graph approaches are neither user-driven nor flexible to deviations in the data. Thus, there is a striking concentration on constructing knowledge graphs by combining the background information related to the subjects in text documents. We propose an enhanced representation of a scalable knowledge graph by automatically extracting the information from the corpus of e-news articles and determine whether a knowledge graph can be used as an efficient application in analyzing and generating knowledge representation from the extracted e-news corpus. This knowledge graph consists of a knowledge base built using triples that automatically produce knowledge representation from e-news articles. Inclusively, it has been observed that the proposed knowledge graph generates a comprehensive and precise knowledge representation for the corpus of e-news articles.


2021 ◽  
Vol 23 (3) ◽  
pp. 27-42
Author(s):  
Surjeet Dalal ◽  
Osamah Ibrahim Khalaf

Medicinal services experts experience significant levels of word-related worry because of their working conditions. Subsequently, the point of this study is to build up a model that spotlights human services experts in order to break down the impact that activity requests, control, social help, and acknowledgment have on the probability that a specialist will experience pressure. The authors have beforehand presented a technique for pitch highlight identification utilizing a convolutional neural network (CNN) that yields great execution utilizing low-level acoustic descriptors alone, with no express span data. This paper utilizes this model for different pitch complement and lexical pressure discovery errands at the word and syllable level on the DIRNDL German radio news corpus. This research demonstrates that data on word or syllable span is encoded in the elevated level CNN include portrayal via preparing a direct relapse model on these highlights to foresee term.


2021 ◽  
Vol 4 ◽  
Author(s):  
Prashanth Rao ◽  
Maite Taboada

We present a topic modelling and data visualization methodology to examine gender-based disparities in news articles by topic. Existing research in topic modelling is largely focused on the text mining of closed corpora, i.e., those that include a fixed collection of composite texts. We showcase a methodology to discover topics via Latent Dirichlet Allocation, which can reliably produce human-interpretable topics over an open news corpus that continually grows with time. Our system generates topics, or distributions of keywords, for news articles on a monthly basis, to consistently detect key events and trends aligned with events in the real world. Findings from 2 years worth of news articles in mainstream English-language Canadian media indicate that certain topics feature either women or men more prominently and exhibit different types of language. Perhaps unsurprisingly, topics such as lifestyle, entertainment, and healthcare tend to be prominent in articles that quote more women than men. Topics such as sports, politics, and business are characteristic of articles that quote more men than women. The data shows a self-reinforcing gendered division of duties and representation in society. Quoting female sources more frequently in a caregiving role and quoting male sources more frequently in political and business roles enshrines women’s status as caregivers and men’s status as leaders and breadwinners. Our results can help journalists and policy makers better understand the unequal gender representation of those quoted in the news and facilitate news organizations’ efforts to achieve gender parity in their sources. The proposed methodology is robust, reproducible, and scalable to very large corpora, and can be used for similar studies involving unsupervised topic modelling and language analyses.


2021 ◽  
pp. 1-28
Author(s):  
Ali Hürriyetoğlu ◽  
Erdem Yörük ◽  
Osman Mutlu ◽  
Fırat Duruşan ◽  
Çağrı Yoltar ◽  
...  

Abstract We describe a gold standard corpus of protest events that comprise various local and international English language sources from various countries. The corpus contains document-, sentence-, and token-level annotations. This corpus facilitates creating machine learning models that automatically classify news articles and extract protest event-related information, constructing knowledge bases that enable comparative social and political science studies. For each news source, the annotation starts with random samples of news articles and continues with samples drawn using active learning. Each batch of samples is annotated by two social and political scientists, adjudicated by an annotation supervisor, and improved by identifying annotation errors semi-automatically. We found that the corpus possesses the variety and quality that are necessary to develop and benchmark text classification and event extraction systems in a cross-context setting, contributing to the generalizability and robustness of automated text processing systems. This corpus and the reported results will establish a common foundation in automated protest event collection studies, which is currently lacking in the literature.


2021 ◽  
Vol 3 (1) ◽  
pp. 1-27
Author(s):  
Chung-hong Chan ◽  
Joseph Bajjalieh ◽  
Loretta Auvil ◽  
Hartmut Wessler ◽  
Scott Althaus ◽  
...  

Abstract We examined the validity of 37 sentiment scores based on dictionary-based methods using a large news corpus and demonstrated the risk of generating a spectrum of results with different levels of statistical significance by presenting an analysis of relationships between news sentiment and U.S. presidential approval. We summarize our findings into four best practices: 1) use a suitable sentiment dictionary; 2) do not assume that the validity and reliability of the dictionary is ‘built-in’; 3) check for the influence of content length and 4) do not use multiple dictionaries to test the same statistical hypothesis.


Sign in / Sign up

Export Citation Format

Share Document