sentence segmentation
Recently Published Documents


TOTAL DOCUMENTS

78
(FIVE YEARS 18)

H-INDEX

7
(FIVE YEARS 1)

2021 ◽  
Vol 9 (2) ◽  
pp. 196-206
Author(s):  
Parmonangan R. Togatorop ◽  
Rezky Prayitno Simanjuntak ◽  
Siti Berliana Manurung ◽  
Mega Christy Silalahi

Memodelkan Entity Relationship Diagram (ERD) dapat dilakukan secara manual, namun umumnya memperoleh pemodelan ERD secara manual membutuhkan waktu yang lama. Maka, dibutuhkan pembangkit ERD dari spesifikasi kebutuhan untuk mempermudah dalam melakukan pemodelan ERD. Penelitian ini bertujuan untuk mengembangkan sebuah sistem pembangkit ERD dari spesifikasi kebutuhan dalam Bahasa Indonesia dengan menerapkan beberapa tahapan-tahapan dari Natural Language Processing (NLP) sesuai kebutuhan penelitian. Spesifikasi kebutuhan yang digunakan tim peneliti menggunakan teknik document analysis. Untuk tahapan-tahapan dari NLP yang digunakan oleh peneliti yaitu: case folding, sentence segmentation, tokenization, POS tagging, chunking dan parsing. Kemudian peneliti melakukan identifikasi terhadap kata-kata dari teks yang sudah diproses pada tahapan-tahapan dari NLP dengan metode rule-based untuk menemukan daftar kata-kata yang memenuhi dalam komponen ERD seperti: entitas, atribut, primary key dan relasi. ERD kemudian digambarkan menggunakan Graphviz berdasarkan komponen ERD yang telah diperoleh Evaluasi hasil ERD yang berhasil dibangkitkan kemudian di evaluasi  menggunakan metode evaluasi expert judgement. Dari hasil evaluasi berdasarkan beberapa studi kasus diperoleh hasil rata-rata precision, recall, F1 score berturut-turut dari tiap ahli yaitu: pada ahli 1 diperoleh 91%, 90%, 90%; pada ahli 2 diperoleh 90%, 90%, 90%; pada ahli 3 diperoleh 98%, 94%, 96%; pada ahli 4 diperoleh 93%, 93%, 93%; dan pada ahli 5 diperoleh 98%, 83%, 90%.


2021 ◽  
Vol 25 (6) ◽  
pp. 15-33
Author(s):  
Chanatip Saetia ◽  
Ekapol Chuangsuwanich ◽  
Tawunrat Chalothorn ◽  
Peerapon Vateekul

2021 ◽  
Vol 1 (13) ◽  
pp. 92-101
Author(s):  
Serhii Krivenko ◽  
Natalya Rotaniova ◽  
Yulianna Lazarevska

The scenario (narrative schemas) is some established (in society) sequence of steps to achieve the set goal and contains the most complete information about all possible ways of development of the described situation (with selection points and branches). The creation of the XML platform gave rise to a new high-tech and technologically more advanced stage in the development of the Web. As a result, the XML platform becomes a significant component in the technology of information systems development, and the tendency of their integration at the level of corporations, agencies, ministries only strengthens the position of XML in the field of information technology in general. A system for automatic detection of non-standard scripts in text messages has been developed. System programming consists of stages of ontology formation, sentence parsing and scenario comparison. the classic natural language processing (NLP) method, which supports the most common tasks such as tokenization, sentence segmentation, tagging of a part of speech, extraction of named entities, partitioning, parsing and co-referential resolution, is used for parsing sentences in the system. Maximum entropy and machine learning based on perceptrons are also possible. Ontologies are stored using OWL technology. The object-target sentence parsers with the described OWL are compared in the analysis process. From a SPARQL query on a source object, query models are returned to the table object. The table class is the base class for all table objects and provides an interface for accessing values in the rows and columns of the results table. If the table object has exactly three columns, it can be used to build a new data source object. This provides a convenient mechanism for retrieving a subset of data from one data source and adding them to another. In the context of the RDF API, a node is defined as all statements about the subject of a URI. The content of the table is compared with the semantics of the sentence. If the sentence scenario does not match the OWL ontology model, there is a possibility of atypical object actions. In this case, a conclusion is formed about the suspicion of the message. For more correct use of possibilities of the analysis of the text it is necessary to form the case of ontologies or to use existing (Akutan, Amazon, etc.) taking into account their features. To increase the ontologies of objects, it is possible to use additional neural network teaching methods.


PLoS ONE ◽  
2020 ◽  
Vol 15 (11) ◽  
pp. e0241979
Author(s):  
Friederike Tegge ◽  
Katharina Parry

The text-evaluation application Coh-Metrix and natural language processing rely on the sentence for text segmentation and analysis and frequently detect sentence limits by means of punctuation. Problems arise when target texts such as pop song lyrics do not follow formal standards of written text composition and lack punctuation in the original. In such cases it is common for human transcribers to prepare texts for analysis, often following unspecified or at least unreported rules of text normalization and relying potentially on an assumed shared understanding of the sentence as a text-structural unit. This study investigated whether the use of different transcribers to insert typographical symbols into song lyrics during the pre-processing of textual data can result in significant differences in sentence delineation. Results indicate that different transcribers (following commonly agreed-upon rules of punctuation based on their extensive experience with language and writing as language professionals) can produce differences in sentence segmentation. This has implications for the analysis results for at least some Coh-Metrix measures and highlights the problem of transcription, with potential consequences for quantification at and above sentence level. It is argued that when analyzing non-traditional written texts or transcripts of spoken language it is not possible to assume uniform text interpretation and segmentation during pre-processing. It is advisable to provide clear rules for text normalization at the pre-processing stage, and to make these explicit in documentation and publication.


Author(s):  
Kairit Sirts ◽  
Kairit Peekman

Texts obtained from web are noisy and do not necessarily follow the orthographic sentence and word boundary rules. Thus, sentence segmentation and word tokenization systems that have been developed on well-formed texts might not perform so well on unedited web texts. In this paper, we first describe the manual annotation of sentence boundaries of an Estonian web dataset and then present the evaluation results of three existing sentence segmentation and word tokenization systems on this corpus: EstNLTK, Stanza and UDPipe. While EstNLTK obtains the highest performance compared to other systems on sentence segmentation on this dataset, the sentence segmentation performance of Stanza and UDPipe remains well below the results obtained on the more well-formed Estonian UD test set.


Author(s):  
Chanatip Saetia ◽  
Tawunrat Chalothorn ◽  
Ekapol Chuangsuwanich ◽  
Peerapon Vateekul

Sign in / Sign up

Export Citation Format

Share Document