scholarly journals Implementasi Sentimen Analysis Pengolahan Kata Berbasis Algoritma Map Reduce Menggunakan Hadoop

2018 ◽  
Vol 4 (1) ◽  
pp. 11-16
Author(s):  
Fawaid Badri

Sentiment analysis is a field of text and information based research. Text documents in this language come from the web about socialization issues. The method used in this study uses algorithmic maps to calculate from a word that will be used to find a meaning in the context of public opinion. The map algorithm reduces the retrieval of data sets and converts them into a data set, data collection of individuals separated into tuples. The stages of the map algorithm reduce reading input data in the form of text stored in HDFS (Hadoop Distributed File System) then it will be processed according to the key and the value has been changed into tuple form. The next step is to process the shuffel and reduce it which will then produce a process from the data set that is processed. Furthermore, the research data uses sentiment analysis by using a map algorithm to reduce the amount of data that is very good

2021 ◽  
pp. 1-13
Author(s):  
Qingtian Zeng ◽  
Xishi Zhao ◽  
Xiaohui Hu ◽  
Hua Duan ◽  
Zhongying Zhao ◽  
...  

Word embeddings have been successfully applied in many natural language processing tasks due to its their effectiveness. However, the state-of-the-art algorithms for learning word representations from large amounts of text documents ignore emotional information, which is a significant research problem that must be addressed. To solve the above problem, we propose an emotional word embedding (EWE) model for sentiment analysis in this paper. This method first applies pre-trained word vectors to represent document features using two different linear weighting methods. Then, the resulting document vectors are input to a classification model and used to train a text sentiment classifier, which is based on a neural network. In this way, the emotional polarity of the text is propagated into the word vectors. The experimental results on three kinds of real-world data sets demonstrate that the proposed EWE model achieves superior performances on text sentiment prediction, text similarity calculation, and word emotional expression tasks compared to other state-of-the-art models.


Author(s):  
Liah Shonhe

The main focus of the study was to explore the practices of open data sharing in the agricultural sector, including establishing the research outputs concerning open data in agriculture. The study adopted a desktop research methodology based on literature review and bibliographic data from WoS database. Bibliometric indicators discussed include yearly productivity, most prolific authors, and enhanced countries. Study findings revealed that research activity in the field of agriculture and open access is very low. There were 36 OA articles and only 6 publications had an open data badge. Most researchers do not yet embrace the need to openly publish their data set despite the availability of numerous open data repositories. Unfortunately, most African countries are still lagging behind in management of agricultural open data. The study therefore recommends that researchers should publish their research data sets as OA. African countries need to put more efforts in establishing open data repositories and implementing the necessary policies to facilitate OA.


2016 ◽  
pp. 1220-1243
Author(s):  
Ilias K. Savvas ◽  
Georgia N. Sofianidou ◽  
M-Tahar Kechadi

Big data refers to data sets whose size is beyond the capabilities of most current hardware and software technologies. The Apache Hadoop software library is a framework for distributed processing of large data sets, while HDFS is a distributed file system that provides high-throughput access to data-driven applications, and MapReduce is software framework for distributed computing of large data sets. Huge collections of raw data require fast and accurate mining processes in order to extract useful knowledge. One of the most popular techniques of data mining is the K-means clustering algorithm. In this study, the authors develop a distributed version of the K-means algorithm using the MapReduce framework on the Hadoop Distributed File System. The theoretical and experimental results of the technique prove its efficiency; thus, HDFS and MapReduce can apply to big data with very promising results.


2020 ◽  
Vol 16 (2) ◽  
pp. 8-22
Author(s):  
Tirath Prasad Sahu ◽  
Sarang Khandekar

Sentiment analysis can be a very useful aspect for the extraction of useful information from text documents. The main idea for sentiment analysis is how people think for a particular online review, i.e. product reviews, movie reviews, etc. Sentiment analysis is the process where these reviews are classified as positive or negative. The web is enriched with huge amount of reviews which can be analyzed to make it meaningful. This article presents the use of lexicon resources for sentiment analysis of different publicly available reviews. First, the polarity shift of reviews is handled by negations. Intensifiers, punctuation and acronyms are also taken into consideration during the processing phase. Second, words are extracted which have some opinion; these words are then used for computing score. Third, machine learning algorithms are applied and the experimental results show that the proposed model is effective in identifying the sentiments of reviews and opinions.


2015 ◽  
Vol 2 (3) ◽  
pp. 170
Author(s):  
Ade Jamal ◽  
Denny Hermawan ◽  
Muhammad Nugraha

<p class="Default"><em>Abstrak</em> – <strong>T</strong><strong>elah dilakukan penelitian tentang </strong><strong>pengolahan terdistribusi data genbank menggunakan <em>Hadoop Distributed Filesystem </em>(HDFS) dengan tujuan mengetahui efektifitas pengolahan data genbank khususnya pada pencarian sequens dengan data masukan yang berukuran besar.</strong><strong> Penelitian dilakukan di </strong><strong>L</strong><strong>aboratorium </strong><strong>Jaringan Universitas Al Azhar Indonesia dengan menggunakan 6 komputer dan satu <em>server</em> dimana dalam <em>Hadoop</em> menjadi 7 <em>node</em> dengan rincian 1 <em>namenode</em>, 7 <em>datanode</em>, 1 secondary <em>namenode</em>. Dengan eksperimen HDFS menggunakan 1 <em>node</em>, 2 <em>node</em>, 4 <em>node</em>, 6 <em>node</em>, dan 7 <em>node</em> dibandingkan dengan <em>Local Filesystem</em>. Hasil menunjukan proses pencarian sequens data genbank menggunakan 1 – 7 <em>node</em> pada skenario eksperimen pertama dengan <em>output</em> yang menampilkan hasil 3 <em>field</em> <em>(Locus, Definition, </em>dan<em> Authors</em>), skenario eksperimen kedua dengan <em>output</em> yang menampilkan hasil 3 <em>field</em> <em>(Locus, Authors, </em>dan<em> Origin)</em>, dan skenario eksperimen ketiga menggunakan HDFS dan LFS dengan <em>output</em> yang menampilkan seluruh <em>field</em> yang terdapat dalam data genbank (</strong><strong><em>Locus, Definition, Accesion, Version, Keywords, Source, Organism, Reference, Authors, Title, Journal, Pubmed, Comment, Features, </em></strong><strong>dan<em> Origin</em></strong><strong>). Evaluasi menunjukan bahwa proses pencarian sequens data genbank menggunakan HDFS dengan 7 <em>node</em> adalah 4 kali lebih cepat dibandingkan dengan menggunakan 1 <em>node</em>. Sedangkan perbedaan waktu pada penggunaan HDFS dengan 1 <em>node</em> adalah 1.02 kali lebih cepat dibandingkan dengan <em>Local Filesystem</em> dengan 4 <em>core</em> <em>processor</em>.</strong></p><p class="Default"><strong> </strong></p><p><em>Abstract </em><strong>- A research on distributed processing of GenBank data using Hadoop Distributed File System GenBank (HDFS) in order to know the effectiveness of data processing, especially in the search sequences with large input data. Research conducted at the Network Laboratory of the University of Al Azhar Indonesia using 6 computers and a server where the Hadoop to 7 nodes with details 1 namenode, 7 datanode, 1 secondary namenode. With HDFS experiments using 1 node, node 2, node 4, node 6, and 7 nodes compared with the Local Filesystem. The results show the search process of data GenBank sequences using 1-7 nodes in the first experiment scenario with an output that displays the results of 3 fields (Locus, Definition, and Authors), a second experiment scenario with an output that displays the results of 3 fields (Locus, Authors, and Origin) , and the third experiment scenarios using HDFS and LFS with output that displays all the data fields contained in GenBank (Locus, Definition, Accesion, Version, Keywords, Source, Organism, Reference, Authors, Title, Journal, Pubmed, Comment, Features, and Origin). Evaluation shows that the search process of data GenBank sequences using HDFS with 7 nodes is 4 times faster than using one node. While the time difference in the use of HDFS with one node is 1:02 times faster than the Local File System with 4 core processor.</strong></p><p><strong><em> </em></strong></p><p><strong><em></em></strong><strong><em>Keywords </em></strong><em>–  genbank, sequens, distributed computing, Hadoop, HDFS</em></p>


2020 ◽  
Author(s):  
Banafsheh Abdollahi ◽  
Rolf Hut ◽  
Nick van de Giesen

&lt;p&gt;Irrigation is crucial for sustaining food security for the growing population around the world. Irrigation affects the hydrological cycle both directly, during the process of water abstraction and irrigation, and indirectly, because of infrastructures that have been built in support of irrigation, such as canals, dams, reservoirs, and drainage systems. For evaluating the availability of freshwater resources in the light of growing food demand, modeling the global hydrological cycle is vital. The GlobWat model is one of the models that have been designed for large scale hydrological modeling, with a specific focus on considering irrigated agriculture water use. Both models&amp;#8217; underlying assumptions and the global input data sets used to feed the model could be sources of uncertainty in the output. One of the most challenging input data sets in global hydrological models is the climate input data set. There are several climate forcings available on a global scale like ERA5 and ERA-Interim. In this study, we assess the sensitivity of the GlobWat model to these climate forcing. Pre-processing climate data at a large scale used to be difficult. Recently, this has become much easier by data and scripts provided by eWaterCycle team at the eSience center, Amsterdam, The Netherlands. We will use eWaterCycle's freely available data sources for our assessment and then we will compare the model results with observed data at a local scale.&lt;/p&gt;


2017 ◽  
Vol 10 (13) ◽  
pp. 355 ◽  
Author(s):  
Reshma Remesh ◽  
Pattabiraman. V

Dimensionality reduction techniques are used to reduce the complexity for analysis of high dimensional data sets. The raw input data set may have large dimensions and it might consume time and lead to wrong predictions if unnecessary data attributes are been considered for analysis. So using dimensionality reduction techniques one can reduce the dimensions of input data towards accurate prediction with less cost. In this paper the different machine learning approaches used for dimensionality reductions such as PCA, SVD, LDA, Kernel Principal Component Analysis and Artificial Neural Network  have been studied.


2021 ◽  
Vol 30 (1) ◽  
pp. 479-486
Author(s):  
Lingrui Bu ◽  
Hui Zhang ◽  
Haiyan Xing ◽  
Lijun Wu

Abstract The efficient processing of large-scale data has very important practical value. In this study, a data mining platform based on Hadoop distributed file system was designed, and then K-means algorithm was improved with the idea of max-min distance. On Hadoop distributed file system platform, the parallelization was realized by MapReduce. Finally, the data processing effect of the algorithm was analyzed with Iris data set. The results showed that the parallel algorithm divided more correct samples than the traditional algorithm; in the single-machine environment, the parallel algorithm ran longer; in the face of large data sets, the traditional algorithm had insufficient memory, but the parallel algorithm completed the calculation task; the acceleration ratio of the parallel algorithm was raised with the expansion of cluster size and data set size, showing a good parallel effect. The experimental results verifies the reliability of parallel algorithm in big data processing, which makes some contributions to further improve the efficiency of data mining.


2018 ◽  
Author(s):  
Ionut Iosifescu Enescu ◽  
Marielle Fraefel ◽  
Gian-Kasper Plattner ◽  
Lucia Espona-Pernas ◽  
Dominik Haas-Artho ◽  
...  

EnviDat is the institutional research data portal of the Swiss Federal Institute for Forest, Snow and Landscape WSL. The portal is designed to provide solutions for efficient, unified and managed access to the WSL’s comprehensive reservoir of monitoring and research data, in accordance with the WSL data policy. Through EnviDat, WSL is fostering open science, making curated, quality-controlled, publication-ready research data accessible. Data producers can document author contributions for a particular data set through the EnviDat-DataCRediT taxonomy. The publication of research data sets can be complemented with additional digital resources, such as, e.g., supplementary documentation, processing software or detailed descriptions of code (i.e. as Jupyter Notebooks). The EnviDat Team is working towards generic solutions for enhancing open science, in line with WSL’s commitment to accessible research data.


2019 ◽  
Vol 1 (3) ◽  
pp. 42-48
Author(s):  
Mohammed Z. Al-Faiz ◽  
Ali A. Ibrahim ◽  
Sarmad M. Hadi

The speed of learning in neural network environment is considered as the most effective parameter spatially in large data sets. This paper tries to minimize the time required for the neural network to fully understand and learn about the data by standardize input data. The paper showed that the Z-Score standardization of input data significantly decreased the number of epoochs required for the network to learn. This paper also proved that the binary dataset is a serious limitation for the convergence of neural network, so the standardization is a must in such case where the 0’s inputs simply neglect the connections in the neural network. The data set used in this paper are features extracted from gel electrophoresis images and that open the door for using artificial intelligence in such areas.


Sign in / Sign up

Export Citation Format

Share Document