pseudo data
Recently Published Documents


TOTAL DOCUMENTS

81
(FIVE YEARS 25)

H-INDEX

17
(FIVE YEARS 2)

Author(s):  
Mieradilijiang Maimaiti ◽  
Yang Liu ◽  
Huanbo Luan ◽  
Zegao Pan ◽  
Maosong Sun

Data augmentation is an approach for several text generation tasks. Generally, in the machine translation paradigm, mainly in low-resource language scenarios, many data augmentation methods have been proposed. The most used approaches for generating pseudo data mainly lay in word omission, random sampling, or replacing some words in the text. However, previous methods barely guarantee the quality of augmented data. In this work, we try to build the data by using paraphrase embedding and POS-Tagging. Namely, we generate the fake monolingual corpus by replacing the main four POS-Tagging labels, such as noun, adjective, adverb, and verb, based on both the paraphrase table and their similarity. We select the bigger corpus size of the paraphrase table with word level and obtain the word embedding of each word in the table, then calculate the cosine similarity between these words and tagged words in the original sequence. In addition, we exploit the ranking algorithm to choose highly similar words to reduce semantic errors and leverage the POS-Tagging replacement to mitigate syntactic error to some extent. Experimental results show that our augmentation method consistently outperforms all previous SOTA methods on the low-resource language pairs in seven language pairs from four corpora by 1.16 to 2.39 BLEU points.


2021 ◽  
Vol 16 (11) ◽  
pp. T11008
Author(s):  
M.J. Lee ◽  
B.R. Ko ◽  
S. Ahn

Abstract A real-time Data Acquisition (DAQ) system for the CULTASK axion haloscope experiment was constructed and tested. The CULTASK is an experiment to search for cosmic axions using resonant cavities, to detect photons from axion conversion through the inverse Primakoff effect in a few GHz frequency range in a very high magnetic field and at an ultra low temperature. The constructed DAQ system utilizes a Field Programmable Gate Array (FPGA) for data processing and Fast Fourier Transformation. This design along with a custom Ethernet packet designed for real-time data transfer enables 100% DAQ efficiency, which is the key feature compared with a commercial spectrum analyzer. This DAQ system is optimally designed for RF signal detection in the axion experiment, with 100 Hz frequency resolution and 500 kHz analysis window. The noise level of the DAQ system averaged over 100,000 measurements is around -111.7 dBm. From a pseudo-data analysis, an improvement of the signal-to-noise ratio due to repeating and averaging the measurements using this real-time DAQ system was confirmed.


2021 ◽  
Vol 2021 ◽  
pp. 1-10
Author(s):  
Juan Zhao

In order to effectively optimize the machine online translation system and improve its translation efficiency and translation quality, this study uses the deep separable convolution neural network algorithm to construct a machine online translation model and evaluates the quality on the basis of pseudo data learning. In order to verify the performance of the model, the regression performance experiment of the model, the method performance experiment of generating pseudo data for specific tasks, the sorting task performance experiment of the model, and the machine translation quality comparison experiment are designed. RMSE and MAE were used to evaluate the regression task performance of the model. Spearman rank correlation coefficient and delta AVG value were used to evaluate the sorting task performance of the model. The experimental results show that the MAE and RMSE values of the model are decreased by 2.28% and 1.39%, respectively, compared with the baseline system under the same experimental conditions, and the Spearman and delta AVG values are increased by 132% and 100.7%, respectively, compared with the baseline system. The method of generating pseudo data for specific tasks needs less data and can make the translation system reach a better level faster. When the number of instances is more than 10, the quality score of the model output is higher than that of Google translation whose similarity is more than 0.8.


2021 ◽  
Vol 18 (5) ◽  
pp. 700-711
Author(s):  
Jun Wang ◽  
Junxing Cao ◽  
Jiachun You ◽  
Ming Cheng ◽  
Peng Zhou

Abstract Well logging helps geologists find hidden oil, natural gas and other resources. However, well log data are systematically insufficient because they can only be obtained by drilling, which involves costly and time-consuming field trials. Additionally, missing or distorted well log data are common in old oilfields owing to shutdowns, poor borehole conditions, damaged instruments and so on. As a workaround, pseudo-data can be generated from actual field data. In this study, we propose a spatio-temporal neural network (STNN) algorithm, which is built by leveraging the combined strengths of a convolutional neural network (CNN) and a long short-term memory network (LSTM). The STNN exploits the ability of the CNN to effectively extract features related to pseudo-well log data and the ability of the LSTM to extract the key features from well log data along the depth direction. The STNN method allows full consideration of the well log data trend with depth, the correlation across different log series and the actual depth accumulation effect. The method proved successful in predicting acoustic sonic log data from gamma-ray, density, compensated neutron, formation resistivity and borehole diameter logs. Results show that the proposed method achieves higher prediction accuracy because it takes into account the spatio-temporal information of well logs.


2021 ◽  
Vol 11 (11) ◽  
pp. 5241
Author(s):  
Samuel Ruiz-Arrebola ◽  
Damián Guirado ◽  
Mercedes Villalobos ◽  
Antonio M. Lallena

Purpose:To analyze the capabilities of different classical mathematical models to describe the growth of multicellular spheroids simulated with an on-lattice agent-based Monte Carlo model that has already been validated. Methods: The exponential, Gompertz, logistic, potential, and Bertalanffy models have been fitted in different situations to volume data generated with a Monte Carlo agent-based model that simulates the spheroid growth. Two samples of pseudo-data, obtained by assuming different variability in the simulation parameters, were considered. The mathematical models were fitted to the whole growth curves and also to parts of them, thus permitting to analyze the predictive power (both prospective and retrospective) of the models. Results: The consideration of the data obtained with a larger variability of the simulation parameters increases the width of the χ2 distributions obtained in the fits. The Gompertz model provided the best fits to the whole growth curves, yielding an average value of the χ2 per degree of freedom of 3.2, an order of magnitude smaller than those found for the other models. Gompertz and Bertalanffy models gave a similar retrospective prediction capability. In what refers to prospective prediction power, the Gompertz model showed by far the best performance. Conclusions: The classical mathematical models that have been analyzed show poor prediction capabilities to reproduce the MTS growth data not used to fit them. Within these poor results, the Gompertz model proves to be the one that better describes the growth data simulated. The simulation of the growth of tumors or multicellular spheroids permits to have follow-up periods longer than in the usual experimental studies and with a much larger number of samples: this has permitted performing the type of analysis presented here.


2021 ◽  
Author(s):  
George Karabatsos

Abstract Approximate Bayesian Computation (ABC) can provide inferences from the (approximate) posterior distribution based on intractable likelihoods. The quality of ABC inferences relies on the choice of tolerance for the distance between the observed data summary statistics, and the pseudo-data summary statistics simulated from the likelihood, used within the context of an algorithm which samples from the approximate posterior. However, the ABC literature does not provide an automatic method to select the best tolerance level for the given dataset at hand, and in ABC practice finding the best tolerance level can be time consuming. This note introduces a fast automatic estimator of the tolerance, based on the parametric bootstrap. After the tolerance estimate is calculated, it can then be input into any suitable importance sampling or MCMC algorithm to approximate from the target approximate posterior distribution. This tolerance estimator is illustrated through ABC analyses of simulated and real datasets involving several intractable likelihood models. This includes the analysis of a real 23,000-node network dataset involving stochastic search model selection.


2021 ◽  
Author(s):  
Petro Uruci ◽  
Anthoula D. Drosatou ◽  
Dontavious Sippial ◽  
Spyros N. Pandis

<p>Secondary organic aerosol (SOA) constitutes a major fraction of the total organic aerosol (OA) in the atmosphere. SOA is formed by the partitioning onto pre-existent particles of low vapor pressure products of the oxidation of volatile, intermediate volatility, and semivolatile organic compounds. Oxidation of the precursor molecules results a myriad of organic products making the detailed analysis of smog chamber experiments difficult and the incorporation of the corresponding results into chemical transport models (CTMs) challenging. The volatility basis set (VBS) is a framework that has been designed to help bridge the gap between laboratory measurements and CTMs. It describes the volatility distribution of the OA and the SOA. The parametrization of SOA formation for the VBS has been traditionally based on fitting yield measurements of smog chamber experiments. To reduce the uncertainty of this approach we developed an algorithm to estimate parameters such as volatility product distribution, effective vaporization enthalpy, and accommodation coefficient combining SOA yield measurements with thermograms (from thermodenuders) and areograms (from isothermal dilution chambers) from different experiments and laboratories. The algorithm was first evaluated with “pseudo-data” produced from the simulation of the corresponding processes assuming SOA with known properties. The results showed excellent agreement and low uncertainties when the volatility range and the mass loadings range of the yield measurements coincide. One of the major features of our approach is that it estimates the uncertainty of the resulting parameterization for different atmospheric conditions (temperature, concentration levels, etc.). In the last step of the work, the use of the algorithm with realistic smog laboratory data is demonstrated.</p>


Author(s):  
Haipeng Sun ◽  
Rui Wang ◽  
Masao Utiyama ◽  
Benjamin Marie ◽  
Kehai Chen ◽  
...  

Unsupervised neural machine translation (UNMT) has achieved remarkable results for several language pairs, such as French–English and German–English. Most previous studies have focused on modeling UNMT systems; few studies have investigated the effect of UNMT on specific languages. In this article, we first empirically investigate UNMT for four diverse language pairs (French/German/Chinese/Japanese–English). We confirm that the performance of UNMT in translation tasks for similar language pairs (French/German–English) is dramatically better than for distant language pairs (Chinese/Japanese–English). We empirically show that the lack of shared words and different word orderings are the main reasons that lead UNMT to underperform in Chinese/Japanese–English. Based on these findings, we propose several methods, including artificial shared words and pre-ordering, to improve the performance of UNMT for distant language pairs. Moreover, we propose a simple general method to improve translation performance for all these four language pairs. The existing UNMT model can generate a translation of a reasonable quality after a few training epochs owing to a denoising mechanism and shared latent representations. However, learning shared latent representations restricts the performance of translation in both directions, particularly for distant language pairs, while denoising dramatically delays convergence by continuously modifying the training data. To avoid these problems, we propose a simple, yet effective and efficient, approach that (like UNMT) relies solely on monolingual corpora: pseudo-data-based unsupervised neural machine translation. Experimental results for these four language pairs show that our proposed methods significantly outperform UNMT baselines.


Author(s):  
Dang Kien Cuong ◽  
Duong Ton Dam ◽  
Duong Ton Thai Duong

The bootstrap is one of the method of studying statistical math which this article uses it but is a major tool for studying and evaluating the values of parameters in probability distribution. Overview of the theory of infinite distribution functions. The tool to deal with the problems raised in the paper is the mathematical methods of random analysis by theory of random process and multivariate statistics. Observations (realisations of a stationary process) are not independent, but dependence in time series is relatively simple example of dependent data. Through a simulation study we found that the pseudo data generated from the bootstrap method always showed a weaker dependence among the observations than the time series they were sampled from, hence we can draw the conclusion that even by re-sampling blocks instead of single observations we will lose some of structural from of the original sample. A potential difficulty by the using of likelihood methods for the GEV concerns the regularity conditions that are required for the usual asymptotic properties associated with the maximum likelihood estimator to be valid. To estimate the value of a parameter in GEV we can use classical methods of mathematical statistics such as the maximum likelihood method or the least squares method, but they all require a certain number samples for verification. For the bootstrap method, this is obviously not needed; here we use the limit theorems of probability theory and multivariate statistics to solve the problem even if there is only one sample data. That is the important practical significance that our paper wants to convey. In predictive analysis problems, in case the actual data is incomplete, not long enough, we can use bootstrap to add data.


Sign in / Sign up

Export Citation Format

Share Document