A two-stage balancing strategy based on data augmentation for imbalanced text sentiment classification

2021 ◽  
Vol 40 (5) ◽  
pp. 10073-10086
Author(s):  
Zhicheng Pang ◽  
Hong Li ◽  
Chiyu Wang ◽  
Jiawen Shi ◽  
Jiale Zhou

In practice, the class imbalance is prevalent in sentiment classification tasks, which is harmful to classifiers. Recently, over-sampling strategies based on data augmentation techniques have caught the eyes of researchers. They generate new samples by rewriting the original samples. Nevertheless, the samples to be rewritten are usually selected randomly, which means that useless samples may be selected, thus adding this type of samples. Based on this observation, we propose a novel balancing strategy for text sentiment classification. Our approach takes word replacement as foundation and can be divided into two stages, which not only can balance the class distribution of training set, but also can modify noisy data. In the first stage, we perform word replacement on specific samples instead of random samples to obtain new samples. According to the noise detection, the second stage revises the sentiment of noisy samples. Toward this aim, we propose an improved term weighting called TF-IGM-CW for imbalanced text datasets, which contributes to extracting the target rewritten samples and feature words. We conduct experiments on four public sentiment datasets. Results suggest that our method outperforms several other resampling methods and can be integrated with various classification algorithms easily.

Author(s):  
Dale E. Bockman ◽  
L. Y. Frank Wu ◽  
Alexander R. Lawton ◽  
Max D. Cooper

B-lymphocytes normally synthesize small amounts of immunoglobulin, some of which is incorporated into the cell membrane where it serves as receptor of antigen. These cells, on contact with specific antigen, proliferate and differentiate to plasma cells which synthesize and secrete large quantities of immunoglobulin. The two stages of differentiation of this cell line (generation of B-lymphocytes and antigen-driven maturation to plasma cells) are clearly separable during ontogeny and in some immune deficiency diseases. The present report describes morphologic aberrations of B-lymphocytes in two diseases in which second stage differentiation is defective.


2020 ◽  
Vol 39 (6) ◽  
pp. 8139-8147
Author(s):  
Ranganathan Arun ◽  
Rangaswamy Balamurugan

In Wireless Sensor Networks (WSN) the energy of Sensor nodes is not certainly sufficient. In order to optimize the endurance of WSN, it is essential to minimize the utilization of energy. Head of group or Cluster Head (CH) is an eminent method to develop the endurance of WSN that aggregates the WSN with higher energy. CH for intra-cluster and inter-cluster communication becomes dependent. For complete, in WSN, the Energy level of CH extends its life of cluster. While evolving cluster algorithms, the complicated job is to identify the energy utilization amount of heterogeneous WSNs. Based on Chaotic Firefly Algorithm CH (CFACH) selection, the formulated work is named “Novel Distributed Entropy Energy-Efficient Clustering Algorithm”, in short, DEEEC for HWSNs. The formulated DEEEC Algorithm, which is a CH, has two main stages. In the first stage, the identification of temporary CHs along with its entropy value is found using the correlative measure of residual and original energy. Along with this, in the clustering algorithm, the rotating epoch and its entropy value must be predicted automatically by its sensor nodes. In the second stage, if any member in the cluster having larger residual energy, shall modify the temporary CHs in the direction of the deciding set. The target of the nodes with large energy has the probability to be CHs which is determined by the above two stages meant for CH selection. The MATLAB is required to simulate the DEEEC Algorithm. The simulated results of the formulated DEEEC Algorithm produce good results with respect to the energy and increased lifetime when it is correlated with the current traditional clustering protocols being used in the Heterogeneous WSNs.


Author(s):  
Fitriah Khoirunnisa ◽  
Friska Septiani Silitonga ◽  
Veri Firmansyah

Penelitian ini bertujuan menganalisis kebutuhan petunjuk praktikum berbasis Keterampilan Proses Sains (KPS) untuk mencapai kemampuan merancang eksperimen pada materi kalor reaksi kalorimetri. Penelitian dilakukan terhadap peserta didik kelas XI SMA Negeri 2 Kota Tanjungpinang. Variabel penelitian mencakup analisis kebutuhan bahan ajar dan analisis kesesuaian Kompetensi Inti (KI) dan Kompetensi Dasar (KD). Jenis penelitian yang dilakukan adalah penelitian deskriptif kualitatif. Tahapan pertama dalam penelitian ini adalah menganalisis kebutuhan bahan ajar dengan cara membandingkan dua petunjuk praktikum yang selama ini telah digunakan di sekolah tersebut, ditinjau dari aspek struktur format penulisan, aspek kreativitas, dan aspek keterampilan proses sains yang terdapat dalam petunjuk praktikum. Sehingga didapatkan kesimpulan bahwa petunjuk praktikum yang selama ini digunakan tidak memberikan kesempatan kepada peserta didiknya untuk merancang eksperimen yang telah ditentukan. Tahapan kedua yaitu menganalisis kesesuaian kompetensi inti dan kompetensi dasar, yang bertujuan untuk menentukan indikator pencapaian kompetensi (IPK) yang akan menjadi acuan dalam mengembangkan petunjuk praktikum berbasis keterampilan proses sains. Dari kedua tahapan yang telah dilakukan maka dapat disimpulkan bahwa peserta didik memerlukan petunjuk praktikum yang mampu mengonstruksi pikiran dan mengaktifkan kinerja mereka, sehingga pendekatan Keterampilan Proses Sains menjadi pilihan dalam mengembangkan petunjuk praktikum yang sesuai dengan karakteristik kurikulum 2013.   This research aims to analyze the needs of Science Process Skills based Practical Instruction to achieve the ability to design experiments on the calor of reaction. This research was done to the students of class XI SMA Negeri 2 Tanjungpinang City. Research Variable includes the analysis of the needs of the learning materials and analysis of the suitability of the Core Competence (KI) and Basic Competence (KD). The type of research conducted is descriptive qualitative research. The first stages in this research is to analyze the needs of learning materials by comparing two practical instructions that had been implementing in the school, from the aspects of the structure of writing format, creativity, and science process skills embedded in practical instructions. The conclusion of this research that current practical instructions does not give an opportunity to the participants to design determined experiments. The second stage, namely analyzing the suitability of core competence and basic competence, which aims to determine the indicators of achievement of the competencies (GPA) which will be a reference in developing process skills-based teaching instructions science. Of the two stages that has been done then it can be concluded that learners need practical instructions to construct  thinking and and their performance, so the Science Process Skills approach is an option in developing practical instruction suitable for the characteristics of the curriculum of 2013.


2020 ◽  
Vol 64 (4) ◽  
pp. 40412-1-40412-11
Author(s):  
Kexin Bai ◽  
Qiang Li ◽  
Ching-Hsin Wang

Abstract To address the issues of the relatively small size of brain tumor image datasets, severe class imbalance, and low precision in existing segmentation algorithms for brain tumor images, this study proposes a two-stage segmentation algorithm integrating convolutional neural networks (CNNs) and conventional methods. Four modalities of the original magnetic resonance images were first preprocessed separately. Next, preliminary segmentation was performed using an improved U-Net CNN containing deep monitoring, residual structures, dense connection structures, and dense skip connections. The authors adopted a multiclass Dice loss function to deal with class imbalance and successfully prevented overfitting using data augmentation. The preliminary segmentation results subsequently served as the a priori knowledge for a continuous maximum flow algorithm for fine segmentation of target edges. Experiments revealed that the mean Dice similarity coefficients of the proposed algorithm in whole tumor, tumor core, and enhancing tumor segmentation were 0.9072, 0.8578, and 0.7837, respectively. The proposed algorithm presents higher accuracy and better stability in comparison with some of the more advanced segmentation algorithms for brain tumor images.


2014 ◽  
Vol 59 (1) ◽  
pp. 41-52 ◽  
Author(s):  
Norbert Skoczylas

Abstract The Author endeavored to consult some of the Polish experts who deal with assessing and preventing outburst hazards as to their knowledge and experience. On the basis of this knowledge, an expert system, based on fuzzy logic, was created. The system allows automatic assessment of outburst hazard. The work was completed in two stages. The first stage involved researching relevant sources and rules concerning outburst hazard, and, subsequently, determining a number of parameters measured or observed in the mining industry that are potentially connected with the outburst phenomenon and can be useful when estimating outburst hazard. Then, the Author contacted selected experts who are actively involved in preventing outburst hazard, both in the industry and science field. The experts were anonymously surveyed, which made it possible to select the parameters which are the most essential in assessing outburst hazard. The second stage involved gaining knowledge from the experts by means of a questionnaire-interview. Subjective opinions on estimating outburst hazard on the basis of the parameters selected during the first stage were then systematized using the structures typical of the expert system based on fuzzy logic.


2017 ◽  
Vol 924 (6) ◽  
pp. 6-16
Author(s):  
V.S. Tikunov ◽  
O.Yu. Chereshnia

The article presents a methodology for a comprehensive assessment of the environmental situation in Russian Federation regions based on the pollution index and the index of the ecological tension. The evaluation was carried out in two stages. At the first stage, the degree of pollution of the atmosphere, hydrosphere and lithosphere of the regions was estimated on the basis of the emission of pollutants into the atmosphere, departing from stationary sources, the formation of solid domestic wastes (SDW) and the discharge of contaminated wastewater. Based on these three indicators, a pollution index was constructed that estimates aggregate pollution level. In the second stage, the authors made the estimation of loads generated by atmospheric emissions, solid waste and waste water discharged into the territory of each region, per capita and in relation to the environmental capacity of the economy. This allows us to take into account the area of pollution, anthropogenic pressure and environmental responsibility of the population, as well as the environmental friendliness of production. On the basis of relative indicators, the index of ecological tension was created.


2020 ◽  
Vol 10 (3) ◽  
pp. 62
Author(s):  
Tittaya Mairittha ◽  
Nattaya Mairittha ◽  
Sozo Inoue

The integration of digital voice assistants in nursing residences is becoming increasingly important to facilitate nursing productivity with documentation. A key idea behind this system is training natural language understanding (NLU) modules that enable the machine to classify the purpose of the user utterance (intent) and extract pieces of valuable information present in the utterance (entity). One of the main obstacles when creating robust NLU is the lack of sufficient labeled data, which generally relies on human labeling. This process is cost-intensive and time-consuming, particularly in the high-level nursing care domain, which requires abstract knowledge. In this paper, we propose an automatic dialogue labeling framework of NLU tasks, specifically for nursing record systems. First, we apply data augmentation techniques to create a collection of variant sample utterances. The individual evaluation result strongly shows a stratification rate, with regard to both fluency and accuracy in utterances. We also investigate the possibility of applying deep generative models for our augmented dataset. The preliminary character-based model based on long short-term memory (LSTM) obtains an accuracy of 90% and generates various reasonable texts with BLEU scores of 0.76. Secondly, we introduce an idea for intent and entity labeling by using feature embeddings and semantic similarity-based clustering. We also empirically evaluate different embedding methods for learning good representations that are most suitable to use with our data and clustering tasks. Experimental results show that fastText embeddings produce strong performances both for intent labeling and on entity labeling, which achieves an accuracy level of 0.79 and 0.78 f1-scores and 0.67 and 0.61 silhouette scores, respectively.


2021 ◽  
Vol 11 (14) ◽  
pp. 6368
Author(s):  
Fátima A. Saiz ◽  
Garazi Alfaro ◽  
Iñigo Barandiaran ◽  
Manuel Graña

This paper describes the application of Semantic Networks for the detection of defects in images of metallic manufactured components in a situation where the number of available samples of defects is small, which is rather common in real practical environments. In order to overcome this shortage of data, the common approach is to use conventional data augmentation techniques. We resort to Generative Adversarial Networks (GANs) that have shown the capability to generate highly convincing samples of a specific class as a result of a game between a discriminator and a generator module. Here, we apply the GANs to generate samples of images of metallic manufactured components with specific defects, in order to improve training of Semantic Networks (specifically DeepLabV3+ and Pyramid Attention Network (PAN) networks) carrying out the defect detection and segmentation. Our process carries out the generation of defect images using the StyleGAN2 with the DiffAugment method, followed by a conventional data augmentation over the entire enriched dataset, achieving a large balanced dataset that allows robust training of the Semantic Network. We demonstrate the approach on a private dataset generated for an industrial client, where images are captured by an ad-hoc photometric-stereo image acquisition system, and a public dataset, the Northeastern University surface defect database (NEU). The proposed approach achieves an improvement of 7% and 6% in an intersection over union (IoU) measure of detection performance on each dataset over the conventional data augmentation.


2021 ◽  
Vol 189 ◽  
pp. 292-299
Author(s):  
Caroline Sabty ◽  
Islam Omar ◽  
Fady Wasfalla ◽  
Mohamed Islam ◽  
Slim Abdennadher

Sign in / Sign up

Export Citation Format

Share Document