corpus construction
Recently Published Documents


TOTAL DOCUMENTS

112
(FIVE YEARS 53)

H-INDEX

7
(FIVE YEARS 2)

2022 ◽  
Author(s):  
Sebastião Pais ◽  
João Cordeiro ◽  
Muhammad Jamil

Abstract Nowadays, the use of language corpora for many purposes has increased significantly. General corpora exist for numerous languages, but research often needs more specialized corpora. The Web’s rapid growth has significantly improved access to thousands of online documents, highly specialized texts and comparable texts on the same subject covering several languages in electronic form. However, research has continued to concentrate on corpus annotation instead of corpus creation tools. Consequently, many researchers create their corpora, independently solve problems, and generate project-specific systems. The corpus construction is used for many NLP applications, including machine translation, information retrieval, and question-answering. This paper presents a new NLP Corpus and Services in the Cloud called HULTIG-C. HULTIG-C is characterized by various languages that include unique annotations such as keywords set, sentences set, named entity recognition set, and multiword set. Moreover, a framework incorporates the main components for license detection, language identification, boilerplate removal and document deduplication to process the HULTIG-C. Furthermore, this paper presents some potential issues related to constructing multilingual corpora from the Web.


2021 ◽  
Vol 14 ◽  
pp. 286-293
Author(s):  
Lin Shi

From the perspective of linguistic complexity, this paper explores the correlation between linguistic complexity and audience recognition in college English speech contests. By adopting the corpus construction and computer visualized data analysis, this study analyzes the speech of contestants at different levels in FLTRP Cup National English peaking Contest 2019-2020, the most authoritative college English speech contest in China. The study shows that: 1) In college English speech contests, the lexical complexity of the speech is negatively correlated with the recognition degree of the audience (i.e. the final ranking of the competition or the success of the speech). 2) The syntactic complexity of the speech exists reasonable interval to ensure good audience recognition. 3) In college English speech contests, the correlation between the lexical complexity and syntactic complexity of speeches and audience recognition is similar to the correlation between the rhetoric and audience recognition in political speeches which is obtained by previous researchers in the field of political speeches. Therefore, we think this study has a certain practical value. It provides evidence of linguistic complexity for predicting the winner of college English speech contests and helping contestants prepare for the contest better.


2021 ◽  
Vol 5 (11) ◽  
pp. 95-103
Author(s):  
Xinyu Lun

This study aims at analyzing the tendency and change in the research on construction grammar from 2010 to 2020. Descriptively, this study includes the publication year, research topic, research direction, research content, and the research methods. Twenty-four CSSCI journals were selected as the research samples using the keyword – “Construction Grammar.” The research topics mainly include Chinese construction research, foreign language construction research, and comparative studies on Chinese and other language constructions. The results showed that there are many Chinese construction research, but the other two research topics still require improvement. Ontology research was the main focus; acquisition research and teaching research are worthy for further exploration. Case studies and theoretical studies were the most concerned contents, whereas studies on language acquisition, pedagogy, and corpus construction were feeble. Qualitative description and theoretical review were the most popular methods, while empirical, quantitative, and diachronic analyses were less frequently used. After analyzing the trends, it has been predicted that the research on construction grammar would continue to heat up in the future, and there would be more research directions and contents along with diversified research methods.


2021 ◽  
Author(s):  
◽  
Paul Radford

<p>Event log messages are currently the only genuine interface through which computer systems administrators can effectively monitor their systems and assemble a mental perception of system state. The popularisation of the Internet and the accompanying meteoric growth of business-critical systems has resulted in an overwhelming volume of event log messages, channeled through mechanisms whose designers could not have envisaged the scale of the problem. Messages regarding intrusion detection, hardware status, operating system status changes, database tablespaces, and so on, are being produced at the rate of many gigabytes per day for a significant computing environment. Filtering technologies have not been able to keep up. Most messages go unnoticed; no  filtering whatsoever is performed on them, at least in part due to the difficulty of implementing and maintaining an effective filtering solution. The most commonly-deployed  filtering alternatives rely on regular expressions to match pre-defi ned strings, with 100% accuracy, which can then become ineffective as the code base for the software producing the messages 'drifts' away from those strings. The exactness requirement means all possible failure scenarios must be accurately anticipated and their events catered for with regular expressions, in order to make full use of this technique. Alternatives to regular expressions remain largely academic. Data mining, automated corpus construction, and neural networks, to name the highest-profi le ones, only produce probabilistic results and are either difficult or impossible to alter in any deterministic way. Policies are therefore not supported under these alternatives. This thesis explores a new architecture which utilises rich metadata in order to avoid the burden of message interpretation. The metadata itself is based on an intention to improve end-to-end communication and reduce ambiguity. A simple yet effective filtering scheme is also presented which fi lters log messages through a short and easily-customisable set of rules. With such an architecture, it is envisaged that systems administrators could signi ficantly improve their awareness of their systems while avoiding many of the false-positives and -negatives which plague today's fi ltering solutions.</p>


2021 ◽  
Author(s):  
◽  
Paul Radford

<p>Event log messages are currently the only genuine interface through which computer systems administrators can effectively monitor their systems and assemble a mental perception of system state. The popularisation of the Internet and the accompanying meteoric growth of business-critical systems has resulted in an overwhelming volume of event log messages, channeled through mechanisms whose designers could not have envisaged the scale of the problem. Messages regarding intrusion detection, hardware status, operating system status changes, database tablespaces, and so on, are being produced at the rate of many gigabytes per day for a significant computing environment. Filtering technologies have not been able to keep up. Most messages go unnoticed; no  filtering whatsoever is performed on them, at least in part due to the difficulty of implementing and maintaining an effective filtering solution. The most commonly-deployed  filtering alternatives rely on regular expressions to match pre-defi ned strings, with 100% accuracy, which can then become ineffective as the code base for the software producing the messages 'drifts' away from those strings. The exactness requirement means all possible failure scenarios must be accurately anticipated and their events catered for with regular expressions, in order to make full use of this technique. Alternatives to regular expressions remain largely academic. Data mining, automated corpus construction, and neural networks, to name the highest-profi le ones, only produce probabilistic results and are either difficult or impossible to alter in any deterministic way. Policies are therefore not supported under these alternatives. This thesis explores a new architecture which utilises rich metadata in order to avoid the burden of message interpretation. The metadata itself is based on an intention to improve end-to-end communication and reduce ambiguity. A simple yet effective filtering scheme is also presented which fi lters log messages through a short and easily-customisable set of rules. With such an architecture, it is envisaged that systems administrators could signi ficantly improve their awareness of their systems while avoiding many of the false-positives and -negatives which plague today's fi ltering solutions.</p>


Human Arenas ◽  
2021 ◽  
Author(s):  
Sára Bigazzi ◽  
Fanni Csernus ◽  
Anna Siegler ◽  
Ildikó Bokrétás ◽  
Sára Serdült ◽  
...  

AbstractThe representations of heroes and the heroic acts point to social values, norms, and morality of the present, creating a bridge between the past and a potential future. In this paper, a cross-cultural explorative study of heroes is presented aiming to explore general tendencies and possible patterns related to the different social contexts. Participants were reached from seven countries via social media (N = 974) for corpus construction. We asked by their choice of hero, national hero, and desired heroic action in their respective countries. A thematic analysis was conducted. Results show that there is a high rate of no choice, while among the chosen the prototypical hero is a lone moral man acting in the private (family) or public sphere (political actors). Both spheres offer the naturalization of the hero. There is a dialogical frame between the exceptional and the ordinary. Chosen heroes are dominantly contemporary males’ family members or political figures. While the purpose attributed to the personal hero is to maintain stability, the purpose attributed to the heroic actions of the public sphere is to obtain change. Similarities and differences between the seven subcorpuses are also described.


2021 ◽  
Vol 2021 ◽  
pp. 1-18
Author(s):  
Yun Zhang ◽  
Yongguo Liu ◽  
Jiajing Zhu ◽  
Zhi Chen ◽  
Dongxiao Li ◽  
...  

Cold pathogenic disease is a widespread disease in traditional Chinese medicine, which includes influenza and respiratory infection associated with high incidence and mortality. Discovering effective core drugs in Chinese medicine prescriptions for treating the disease and reducing patients’ symptoms has attracted great interest. In this paper, we explore the core drugs for curing various syndromes of cold pathogenic disease from large-scale literature. We propose a core drug discovery framework incorporating word embedding and community detection algorithms, which contains three parts: disease corpus construction, drug network generation, and core drug discovery. First, disease corpus is established by collecting and preprocessing large-scale literature about the Chinese medicine treatment of cold pathogenic disease from China National Knowledge Infrastructure. Second, we adopt the Chinese word embedding model SSP2VEC for mining the drug implication implied in the literature; then, a drug network is established by the semantic similarity among drugs. Third, the community detection method COPRA based on label propagation is adopted to reveal drug communities and identify core drugs in the drug network. We compute the community size, closeness centrality, and degree distributions of the drug network to analyse the patterns of core drugs. We acquire 4681 literature from China national knowledge infrastructure. Twelve significant drug communities are discovered, in which the top-10 drugs in every drug community are recognized as core drugs with high accuracy, and four classical prescriptions for treating different syndromes of cold pathogenic disease are discovered. The proposed framework can identify effective core drugs for curing cold pathogenic disease, and the research can help doctors to verify the compatibility laws of Chinese medicine prescriptions.


2021 ◽  
Author(s):  
Sebastião Pais ◽  
João Cordeiro ◽  
Muhammad Jamil

Abstract Nowadays, the use of language corpora for many purposes has increased significantly. General corpora exist for numerous languages, but research often needs more specialized corpora. The Web's rapid growth has significantly improved access to thousands of online documents, highly specialized texts and comparable texts on the same subject covering several languages in electronic form. However, research has continued to concentrate on corpus annotation instead of corpus creation tools. Consequently, many researchers create their own corpora, independently solve problems, and generate project-specific systems. The corpus construction is used for many NLP applications, including machine translation, information retrieval, and question-answering. This paper presents a new NLP Corpus and Services in the Cloud called HULTIG-C. HULTIG-C is characterized by various languages that include unique annotations such as keywords set, sentences set, named entity recognition set, and multiword set. Moreover, a framework incorporates the main components for license detection, language identification, boilerplate removal and document deduplication to process the HULTIG-C. Furthermore, this paper presents some potential issues related to constructing multilingual corpora from the Web.


Sign in / Sign up

Export Citation Format

Share Document