scholarly journals Using Natural Language Preprocessing Architecture (NLPA) for Big Data Text Sources

2020 ◽  
Vol 2020 ◽  
pp. 1-13
Author(s):  
María Novo-Lourés ◽  
Reyes Pavón ◽  
Rosalía Laza ◽  
David Ruano-Ordas ◽  
Jose R. Méndez

During the last years, big data analysis has become a popular means of taking advantage of multiple (initially valueless) sources to find relevant knowledge about real domains. However, a large number of big data sources provide textual unstructured data. A proper analysis requires tools able to adequately combine big data and text-analysing techniques. Keeping this in mind, we combined a pipelining framework (BDP4J (Big Data Pipelining For Java)) with the implementation of a set of text preprocessing techniques in order to create NLPA (Natural Language Preprocessing Architecture), an extendable open-source plugin implementing preprocessing steps that can be easily combined to create a pipeline. Additionally, NLPA incorporates the possibility of generating datasets using either a classical token-based representation of data or newer synset-based datasets that would be further processed using semantic information (i.e., using ontologies). This work presents a case study of NLPA operation covering the transformation of raw heterogeneous big data into different dataset representations (synsets and tokens) and using the Weka application programming interface (API) to launch two well-known classifiers.

Author(s):  
Ashwini T ◽  
Sahana LM ◽  
Mahalakshmi E ◽  
Shweta S Padti

— Analysis of consistent and structured data has seen huge success in past decades. Where the analysis of unstructured data in the form of multimedia format remains a challenging task. YouTube is one of the most used and popular social media tool. The main aim of this paper is to analyze the data that is generated from YouTube that can be mined and utilized. API (Application Programming Interface) and going to be stored in Hadoop Distributed File System (HDFS). Dataset can be analyzed using MapReduce. Which is used to identify the video categories in which most number of videos are uploaded. The objective of this paper is to demonstrate Hadoop framework, to process and handle big data there are many components. In the existing method, big data can be analyzed and processed in multiple stages by using MapReduce. Due to huge space consumption of each job, Implementing iterative map reduce jobs is expensive. A Hive method is used to analyze the big data to overcome the drawbacks of existing methods, which is the state-ofthe-art method. The hive works by extracting the YouTube information by generating API (Application Programming Interface) key and uses the SQL queries.


Author(s):  
Di Wu ◽  
Xiao-Yuan Jing ◽  
Haowen Chen ◽  
Xiaohui Kong ◽  
Jifeng Xuan

Application Programming Interface (API) tutorial is an important API learning resource. To help developers learn APIs, an API tutorial is often split into a number of consecutive units that describe the same topic (i.e. tutorial fragment). We regard a tutorial fragment explaining an API as a relevant fragment of the API. Automatically recommending relevant tutorial fragments can help developers learn how to use an API. However, existing approaches often employ supervised or unsupervised manner to recommend relevant fragments, which suffers from much manual annotation effort or inaccurate recommended results. Furthermore, these approaches only support developers to input exact API names. In practice, developers often do not know which APIs to use so that they are more likely to use natural language to describe API-related questions. In this paper, we propose a novel approach, called Tutorial Fragment Recommendation (TuFraRec), to effectively recommend relevant tutorial fragments for API-related natural language questions, without much manual annotation effort. For an API tutorial, we split it into fragments and extract APIs from each fragment to build API-fragment pairs. Given a question, TuFraRec first generates several clarification APIs that are related to the question. We use clarification APIs and API-fragment pairs to construct candidate API-fragment pairs. Then, we design a semi-supervised metric learning (SML)-based model to find relevant API-fragment pairs from the candidate list, which can work well with a few labeled API-fragment pairs and a large number of unlabeled API-fragment pairs. In this way, the manual effort for labeling the relevance of API-fragment pairs can be reduced. Finally, we sort and recommend relevant API-fragment pairs based on the recommended strategy. We evaluate TuFraRec on 200 API-related natural language questions and two public tutorial datasets (Java and Android). The results demonstrate that on average TuFraRec improves NDCG@5 by 0.06 and 0.09, and improves Mean Reciprocal Rank (MRR) by 0.07 and 0.09 on two tutorial datasets as compared with the state-of-the-art approach.


2020 ◽  
pp. 004912412092621
Author(s):  
C. Ben Gibson ◽  
Jeannette Sutton ◽  
Sarah K. Vos ◽  
Carter T. Butts

Microblogging sites have become important data sources for studying network dynamics and information transmission. Both areas of study, however, require accurate counts of indegree, or follower counts; unfortunately, collection of complete time series on follower counts can be limited by application programming interface constraints, system failures, or temporal constraints. In addition, there is almost always a time difference between the point at which follower counts are queried and the time a user posts a tweet. Here, we consider the use of three classes of simple, easily implemented methods for follower imputation: polynomial functions, splines, and generalized linear models. We evaluate the performance of each method via a case study of accounts from 236 health organizations during the 2014 Ebola outbreak. For accurate interpolation and extrapolation, we find that negative binomial regression, modeled separately for each account, using time as an interval variable, accurately recovers missing values while retaining narrow prediction intervals.


Author(s):  
Manraj Singh Bains ◽  
Shriniwas S. Arkatkar ◽  
K. S. Anbumani ◽  
Siva Subramaniam

This study aimed to develop a microsimulation model for optimizing toll plaza operations in relation to operational cost and level of service for users. A well-calibrated and validated simulation model was developed in PTV Vissim, and several scenarios were simulated to test their efficacy at improving toll plaza operations. Data collected included classified entry traffic volume at the toll plaza, service time for different payment categories, percentage of lane utilization, and travel time while crossing the toll plaza. For modeling lane selection for vehicles, the PTV Vissim component object model application programming interface—which enables dynamic route choice—was used. From the results it was observed that the simulation model accurately represented the current operations at the toll plaza. Scenarios, such as implementing a number plate recognition technology and segregating lanes for different vehicle types to improve the level of service, were evaluated with the simulation model.


Author(s):  
Ichiro Kobayashi ◽  
◽  
Toru Sugimoto ◽  
Shino Iwashita ◽  
Michiaki Iwazume ◽  
...  

We propose a computer communication protocol based on natural language called "language protocol", communication using the protocol, and an interface enabling connection any communication standard, called a "language application programming interface". We use simulation to confirm that the proposed methods provide a flexible communication environment for any communication object.


2015 ◽  
Vol 6 (2) ◽  
Author(s):  
Stan Ruecker ◽  
Peter Hodges ◽  
Nayaab Lokhadwala ◽  
Szu-Ying Ching ◽  
Jennifer Windsor ◽  
...  

An Application Programming Interface (API) can serve as a mechanism for separating interface concerns on the one hand from data and processing on the other, allowing for easier implementation of alternative human-computer interfaces. The API can also be used as a sounding board for ideas about what an interface should and should not accomplish. Our discussion will take as its case study our recent work in designing experimental interfaces for the visual construction of Boolean queries, for a project we have previously called the Mandala Browser.


Author(s):  
John Anderson Gómez Múnera ◽  
Alejandro Giraldo Quintero

The considerable increase in computation of the optimal control problems has in many cases overflowed the computing capacity available to handle complex systems in real time. For this reason, alternatives such as parallel computing are studied in this article, where the problem is worked out by distributing the tasks among several processors in order to accelerate the computation and to analyze and investigate the reduction of the total time of calculation the incremental gradually the processors used in it. We explore the use of these methods with a case study represented in a rolling mill process, and in turn making use of the strategy of updating the Phase Finals values for the construction of the final penalty matrix for the solution of the differential Riccati Equation. In addition, the order of the problem studied is increasing gradually for compare the improvements achieved in the models with major dimension. Parallel computing alternatives are also studied through multiple processing elements within a single machine or in a cluster via OpenMP, which is an application programming interface (API) that allows the creation of shared memory programs.


DYNA ◽  
2018 ◽  
Vol 85 (205) ◽  
pp. 363-370
Author(s):  
Nelson Ivan Herrera-Herrera ◽  
Sergio Luján-Mora ◽  
Estevan Ricardo Gómez-Torres

Este estudio tiene como finalidad presentar un análisis de la utilización e integración de herramientas tecnológicas que ayudan a tomar decisiones en situaciones de congestión vehicular. La ciudad de Quito-Ecuador es considerada como un caso de estudio para el trabajo realizado. La investigación se presenta en función del desarrollo de una aplicación, haciendo uso de herramientas Big Data (Apache Flume, Apache Hadoop, Apache Pig), que permiten el procesamiento de gran cantidad de información que se requiere recolectar, almacenar y procesar. Uno de los aspectos innovadores de la aplicación es el uso de la red social Twitter como fuente de origen de datos. Para esto se utilizó su interfaz de programación de aplicaciones (Application Programming Interface, API), la cual permite tomar datos de esta red social en tiempo real e identificar puntos probables de congestión. Este estudio presenta resultados de pruebas realizadas con la aplicación, durante un período de 9 meses.


2021 ◽  
Vol 5 (2) ◽  
pp. 304-313
Author(s):  
Tigor Nirman Simanjuntak ◽  
Setia Pramana

This study aims to conduct analysis to determine the trend of sentiment on tweets about Covid-19 in Indonesia from the Twitter accounts overseas on big data perspective. The data was obtained from Twitter in the period of April 2020, with the word query "Indonesian Corona Virus" from foreign user accounts in English. The process of retrieving data comes from Twitter tweets by crawling the text using Twitter's API (Application Programming Interface) by employing Python programming language. Twitter was chosen because it is very fast and easy to spread through status updates from and among the user accounts. The number of tweets obtained was 8,740 in text format, with a total engagement of 217,316. The data was sorted from the tweets with the largest to smallest engagement, then cleaned from unnecessary fonts and symbols as well as typo words and abbreviations. The sentiment classification was carried out by analytical tools, extracting information with text mining, into positive, negative, and neutral polarity. To sharpen the analysis, the cleaned data was selected only with the largest engagement until those with 100 engagements; then was grouped into 30 sub-topics to be analyzed. The interesting facts are found that most tweets and sub-topics were dominated by the negative sentiment; and some unthinkable sub-topics were talked by many users.


Sign in / Sign up

Export Citation Format

Share Document