Implementasi Sentimen Analysis Pengolahan Kata Berbasis Algoritma Map Reduce Menggunakan Hadoop

Sentiment analysis is a field of text and information based research. Text documents in this language come from the web about socialization issues. The method used in this study uses algorithmic maps to calculate from a word that will be used to find a meaning in the context of public opinion. The map algorithm reduces the retrieval of data sets and converts them into a data set, data collection of individuals separated into tuples. The stages of the map algorithm reduce reading input data in the form of text stored in HDFS (Hadoop Distributed File System) then it will be processed according to the key and the value has been changed into tuple form. The next step is to process the shuffel and reduce it which will then produce a process from the data set that is processed. Furthermore, the research data uses sentiment analysis by using a map algorithm to reduce the amount of data that is very good

Download Full-text

Learning emotional word embeddings for sentiment analysis

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-201993 ◽

2021 ◽

pp. 1-13

Author(s):

Qingtian Zeng ◽

Xishi Zhao ◽

Xiaohui Hu ◽

Hua Duan ◽

Zhongying Zhao ◽

...

Keyword(s):

Sentiment Analysis ◽

Language Processing ◽

State Of The Art ◽

Research Problem ◽

Emotional Word ◽

Classification Model ◽

Data Sets ◽

Word Embeddings ◽

Real World Data ◽

Text Documents

Word embeddings have been successfully applied in many natural language processing tasks due to its their effectiveness. However, the state-of-the-art algorithms for learning word representations from large amounts of text documents ignore emotional information, which is a significant research problem that must be addressed. To solve the above problem, we propose an emotional word embedding (EWE) model for sentiment analysis in this paper. This method first applies pre-trained word vectors to represent document features using two different linear weighting methods. Then, the resulting document vectors are input to a classification model and used to train a text sentiment classifier, which is based on a neural network. In this way, the emotional polarity of the text is propagated into the word vectors. The experimental results on three kinds of real-world data sets demonstrate that the proposed EWE model achieves superior performances on text sentiment prediction, text similarity calculation, and word emotional expression tasks compared to other state-of-the-art models.

Download Full-text

Sharing Open Data in Agriculture

Advances in Library and Information Science - Open Access Implications for Sustainable Social, Political, and Economic Development ◽

10.4018/978-1-7998-5018-2.ch013 ◽

2021 ◽

pp. 244-266

Author(s):

Liah Shonhe

Keyword(s):

Agricultural Sector ◽

Open Data ◽

Research Data ◽

Data Sets ◽

Research Activity ◽

African Countries ◽

Data Set ◽

Data Repositories ◽

Bibliographic Data ◽

Prolific Authors

The main focus of the study was to explore the practices of open data sharing in the agricultural sector, including establishing the research outputs concerning open data in agriculture. The study adopted a desktop research methodology based on literature review and bibliographic data from WoS database. Bibliometric indicators discussed include yearly productivity, most prolific authors, and enhanced countries. Study findings revealed that research activity in the field of agriculture and open access is very low. There were 36 OA articles and only 6 publications had an open data badge. Most researchers do not yet embrace the need to openly publish their data set despite the availability of numerous open data repositories. Unfortunately, most African countries are still lagging behind in management of agricultural open data. The study therefore recommends that researchers should publish their research data sets as OA. African countries need to put more efforts in establishing open data repositories and implementing the necessary policies to facilitate OA.

Download Full-text

Applying the K-Means Algorithm in Big Raw Data Sets with Hadoop and MapReduce

Business Intelligence ◽

10.4018/978-1-4666-9562-7.ch062 ◽

2016 ◽

pp. 1220-1243

Author(s):

Ilias K. Savvas ◽

Georgia N. Sofianidou ◽

M-Tahar Kechadi

Keyword(s):

Big Data ◽

Clustering Algorithm ◽

File System ◽

Large Data ◽

Large Data Sets ◽

Distributed File System ◽

Data Sets ◽

Raw Data ◽

Hadoop Distributed File System ◽

Access To Data

Big data refers to data sets whose size is beyond the capabilities of most current hardware and software technologies. The Apache Hadoop software library is a framework for distributed processing of large data sets, while HDFS is a distributed file system that provides high-throughput access to data-driven applications, and MapReduce is software framework for distributed computing of large data sets. Huge collections of raw data require fast and accurate mining processes in order to extract useful knowledge. One of the most popular techniques of data mining is the K-means clustering algorithm. In this study, the authors develop a distributed version of the K-means algorithm using the MapReduce framework on the Hadoop Distributed File System. The theoretical and experimental results of the technique prove its efficiency; thus, HDFS and MapReduce can apply to big data with very promising results.

Download Full-text

A Machine Learning-Based Lexicon Approach for Sentiment Analysis

International Journal of Technology and Human Interaction ◽

10.4018/ijthi.2020040102 ◽

2020 ◽

Vol 16 (2) ◽

pp. 8-22

Author(s):

Tirath Prasad Sahu ◽

Sarang Khandekar

Keyword(s):

Machine Learning ◽

Sentiment Analysis ◽

Main Idea ◽

Machine Learning Algorithms ◽

Online Review ◽

Product Reviews ◽

Huge Amount ◽

Text Documents ◽

Proposed Model ◽

The Web

Sentiment analysis can be a very useful aspect for the extraction of useful information from text documents. The main idea for sentiment analysis is how people think for a particular online review, i.e. product reviews, movie reviews, etc. Sentiment analysis is the process where these reviews are classified as positive or negative. The web is enriched with huge amount of reviews which can be analyzed to make it meaningful. This article presents the use of lexicon resources for sentiment analysis of different publicly available reviews. First, the polarity shift of reviews is handled by negations. Intensifiers, punctuation and acronyms are also taken into consideration during the processing phase. Second, words are extracted which have some opinion; these words are then used for computing score. Third, machine learning algorithms are applied and the experimental results show that the proposed model is effective in identifying the sentiments of reviews and opinions.

Download Full-text

Pengembangan Database Genbank UAI-Bioinformatics Menggunakan Sistem Terdistribusi

JURNAL Al-AZHAR INDONESIA SERI SAINS DAN TEKNOLOGI ◽

10.36722/sst.v2i3.138 ◽

2015 ◽

Vol 2 (3) ◽

pp. 170

Author(s):

Ade Jamal ◽

Denny Hermawan ◽

Muhammad Nugraha

Keyword(s):

Input Data ◽

File System ◽

Distributed Processing ◽

Distributed File System ◽

Search Process ◽

The Third ◽

Source Organism ◽

Hadoop Distributed File System ◽

The University ◽

Local File

Abstrak – Telah dilakukan penelitian tentang pengolahan terdistribusi data genbank menggunakan Hadoop Distributed Filesystem (HDFS) dengan tujuan mengetahui efektifitas pengolahan data genbank khususnya pada pencarian sequens dengan data masukan yang berukuran besar. Penelitian dilakukan di Laboratorium Jaringan Universitas Al Azhar Indonesia dengan menggunakan 6 komputer dan satu server dimana dalam Hadoop menjadi 7 node dengan rincian 1 namenode, 7 datanode, 1 secondary namenode. Dengan eksperimen HDFS menggunakan 1 node, 2 node, 4 node, 6 node, dan 7 node dibandingkan dengan Local Filesystem. Hasil menunjukan proses pencarian sequens data genbank menggunakan 1 – 7 node pada skenario eksperimen pertama dengan output yang menampilkan hasil 3 field (Locus, Definition, dan Authors), skenario eksperimen kedua dengan output yang menampilkan hasil 3 field (Locus, Authors, dan Origin), dan skenario eksperimen ketiga menggunakan HDFS dan LFS dengan output yang menampilkan seluruh field yang terdapat dalam data genbank (Locus, Definition, Accesion, Version, Keywords, Source, Organism, Reference, Authors, Title, Journal, Pubmed, Comment, Features, dan Origin). Evaluasi menunjukan bahwa proses pencarian sequens data genbank menggunakan HDFS dengan 7 node adalah 4 kali lebih cepat dibandingkan dengan menggunakan 1 node. Sedangkan perbedaan waktu pada penggunaan HDFS dengan 1 node adalah 1.02 kali lebih cepat dibandingkan dengan Local Filesystem dengan 4 core processor. Abstract - A research on distributed processing of GenBank data using Hadoop Distributed File System GenBank (HDFS) in order to know the effectiveness of data processing, especially in the search sequences with large input data. Research conducted at the Network Laboratory of the University of Al Azhar Indonesia using 6 computers and a server where the Hadoop to 7 nodes with details 1 namenode, 7 datanode, 1 secondary namenode. With HDFS experiments using 1 node, node 2, node 4, node 6, and 7 nodes compared with the Local Filesystem. The results show the search process of data GenBank sequences using 1-7 nodes in the first experiment scenario with an output that displays the results of 3 fields (Locus, Definition, and Authors), a second experiment scenario with an output that displays the results of 3 fields (Locus, Authors, and Origin) , and the third experiment scenarios using HDFS and LFS with output that displays all the data fields contained in GenBank (Locus, Definition, Accesion, Version, Keywords, Source, Organism, Reference, Authors, Title, Journal, Pubmed, Comment, Features, and Origin). Evaluation shows that the search process of data GenBank sequences using HDFS with 7 nodes is 4 times faster than using one node. While the time difference in the use of HDFS with one node is 1:02 times faster than the Local File System with 4 core processor. Keywords – genbank, sequens, distributed computing, Hadoop, HDFS

Download Full-text

Assessing GlobWat model sensitivity to climate forcing

10.5194/egusphere-egu2020-9911 ◽

2020 ◽

Author(s):

Banafsheh Abdollahi ◽

Rolf Hut ◽

Nick van de Giesen

Keyword(s):

Input Data ◽

Large Scale ◽

Hydrological Modeling ◽

Hydrological Cycle ◽

Climate Forcing ◽

Irrigated Agriculture ◽

Data Sets ◽

Climate Data ◽

Data Set ◽

Water Abstraction

Irrigation is crucial for sustaining food security for the growing population around the world. Irrigation affects the hydrological cycle both directly, during the process of water abstraction and irrigation, and indirectly, because of infrastructures that have been built in support of irrigation, such as canals, dams, reservoirs, and drainage systems. For evaluating the availability of freshwater resources in the light of growing food demand, modeling the global hydrological cycle is vital. The GlobWat model is one of the models that have been designed for large scale hydrological modeling, with a specific focus on considering irrigated agriculture water use. Both models&#8217; underlying assumptions and the global input data sets used to feed the model could be sources of uncertainty in the output. One of the most challenging input data sets in global hydrological models is the climate input data set. There are several climate forcings available on a global scale like ERA5 and ERA-Interim. In this study, we assess the sensitivity of the GlobWat model to these climate forcing. Pre-processing climate data at a large scale used to be difficult. Recently, this has become much easier by data and scripts provided by eWaterCycle team at the eSience center, Amsterdam, The Netherlands. We will use eWaterCycle's freely available data sources for our assessment and then we will compare the model results with observed data at a local scale.

Download Full-text

A SURVEY ON THE CURES FOR THE CURSE OF DIMENSIONALITY IN BIG DATA

Asian Journal of Pharmaceutical and Clinical Research ◽

10.22159/ajpcr.2017.v10s1.19755 ◽

2017 ◽

Vol 10 (13) ◽

pp. 355 ◽

Cited By ~ 1

Author(s):

Reshma Remesh ◽

Pattabiraman. V

Keyword(s):

Dimensionality Reduction ◽

Input Data ◽

Principal Component ◽

Kernel Principal Component Analysis ◽

High Dimensional ◽

Data Sets ◽

Learning Approaches ◽

Data Set ◽

Reduction Techniques ◽

Dimensionality Reduction Techniques

Dimensionality reduction techniques are used to reduce the complexity for analysis of high dimensional data sets. The raw input data set may have large dimensions and it might consume time and lead to wrong predictions if unnecessary data attributes are been considered for analysis. So using dimensionality reduction techniques one can reduce the dimensions of input data towards accurate prediction with less cost. In this paper the different machine learning approaches used for dimensionality reductions such as PCA, SVD, LDA, Kernel Principal Component Analysis and Artificial Neural Network have been studied.

Download Full-text

Research on parallel data processing of data mining platform in the background of cloud computing

Journal of Intelligent Systems ◽

10.1515/jisys-2020-0113 ◽

2021 ◽

Vol 30 (1) ◽

pp. 479-486

Author(s):

Lingrui Bu ◽

Hui Zhang ◽

Haiyan Xing ◽

Lijun Wu

Keyword(s):

Data Mining ◽

Data Processing ◽

Parallel Algorithm ◽

Large Scale ◽

File System ◽

Large Data ◽

Distributed File System ◽

Data Set ◽

Traditional Algorithm ◽

Hadoop Distributed File System

Abstract The efficient processing of large-scale data has very important practical value. In this study, a data mining platform based on Hadoop distributed file system was designed, and then K-means algorithm was improved with the idea of max-min distance. On Hadoop distributed file system platform, the parallelization was realized by MapReduce. Finally, the data processing effect of the algorithm was analyzed with Iris data set. The results showed that the parallel algorithm divided more correct samples than the traditional algorithm; in the single-machine environment, the parallel algorithm ran longer; in the face of large data sets, the traditional algorithm had insufficient memory, but the parallel algorithm completed the calculation task; the acceleration ratio of the parallel algorithm was raised with the expansion of cluster size and data set size, showing a good parallel effect. The experimental results verifies the reliability of parallel algorithm in big data processing, which makes some contributions to further improve the efficiency of data mining.

Download Full-text

Fostering Open Science at WSL with the EnviDat Environmental Data Portal

10.7287/peerj.preprints.27211v1 ◽

2018 ◽

Author(s):

Ionut Iosifescu Enescu ◽

Marielle Fraefel ◽

Gian-Kasper Plattner ◽

Lucia Espona-Pernas ◽

Dominik Haas-Artho ◽

...

Keyword(s):

Open Science ◽

Research Data ◽

Environmental Data ◽

Data Sets ◽

Institutional Research ◽

Data Set ◽

Digital Resources ◽

Federal Institute ◽

Data Policy ◽

Data Portal

EnviDat is the institutional research data portal of the Swiss Federal Institute for Forest, Snow and Landscape WSL. The portal is designed to provide solutions for efficient, unified and managed access to the WSL’s comprehensive reservoir of monitoring and research data, in accordance with the WSL data policy. Through EnviDat, WSL is fostering open science, making curated, quality-controlled, publication-ready research data accessible. Data producers can document author contributions for a particular data set through the EnviDat-DataCRediT taxonomy. The publication of research data sets can be complemented with additional digital resources, such as, e.g., supplementary documentation, processing software or detailed descriptions of code (i.e. as Jupyter Notebooks). The EnviDat Team is working towards generic solutions for enhancing open science, in line with WSL’s commitment to accessible research data.

Download Full-text

The effect of Z-Score standardization (normalization) on binary input due the speed of learning in back-propagation neural network

Iraqi Journal of Information & Communications Technology ◽

10.31987/ijict.1.3.41 ◽

2019 ◽

Vol 1 (3) ◽

pp. 42-48

Author(s):

Mohammed Z. Al-Faiz ◽

Ali A. Ibrahim ◽

Sarmad M. Hadi

Keyword(s):

Neural Network ◽

Input Data ◽

Back Propagation ◽

Large Data ◽

Back Propagation Neural Network ◽

Data Sets ◽

Z Score ◽

Data Set ◽

The Neural Network ◽

Time Required

The speed of learning in neural network environment is considered as the most effective parameter spatially in large data sets. This paper tries to minimize the time required for the neural network to fully understand and learn about the data by standardize input data. The paper showed that the Z-Score standardization of input data significantly decreased the number of epoochs required for the network to learn. This paper also proved that the binary dataset is a serious limitation for the convergence of neural network, so the standardization is a must in such case where the 0’s inputs simply neglect the connections in the neural network. The data set used in this paper are features extracted from gel electrophoresis images and that open the door for using artificial intelligence in such areas.

Download Full-text