scholarly journals Spatial structure, parameter nonlinearity, and intelligent algorithms in constructing pedotransfer functions from large-scale soil legacy data

2020 ◽  
Vol 10 (1) ◽  
Author(s):  
Poulamee Chakraborty ◽  
Bhabani S. Das ◽  
Hitesh B. Vasava ◽  
Niranjan Panigrahi ◽  
Priyabrata Santra

Abstract Pedotransfer function (PTF) approach is a convenient way for estimating difficult-to-measure soil properties from basic soil data. Typically, PTFs are developed using a large number of samples collected from small (regional) areas for training and testing a predictive model. National soil legacy databases offer an opportunity to provide soil data for developing PTFs although legacy data are sparsely distributed covering large areas. Here, we examined the Indian soil legacy (ISL) database to select a comprehensive training dataset for estimating cation exchange capacity (CEC) as a test case in the PTF approach. Geostatistical and correlation analyses showed that legacy data entail diverse spatial and correlation structure needed in building robust PTFs. Through non-linear correlation measures and intelligent predictive algorithms, we developed a methodology to extract an efficient training dataset from the ISL data for estimating CEC with high prediction accuracy. The selected training data had comparable spatial variation and nonlinearity in parameters for training and test datasets. Thus, we identified specific indicators for constructing robust PTFs from legacy data. Our results open a new avenue to use large volume of existing soil legacy data for developing region-specific PTFs without the need for collecting new soil data.

2020 ◽  
Vol 27 ◽  
Author(s):  
Zaheer Ullah Khan ◽  
Dechang Pi

Background: S-sulfenylation (S-sulphenylation, or sulfenic acid) proteins, are special kinds of post-translation modification, which plays an important role in various physiological and pathological processes such as cytokine signaling, transcriptional regulation, and apoptosis. Despite these aforementioned significances, and by complementing existing wet methods, several computational models have been developed for sulfenylation cysteine sites prediction. However, the performance of these models was not satisfactory due to inefficient feature schemes, severe imbalance issues, and lack of an intelligent learning engine. Objective: In this study, our motivation is to establish a strong and novel computational predictor for discrimination of sulfenylation and non-sulfenylation sites. Methods: In this study, we report an innovative bioinformatics feature encoding tool, named DeepSSPred, in which, resulting encoded features is obtained via n-segmented hybrid feature, and then the resampling technique called synthetic minority oversampling was employed to cope with the severe imbalance issue between SC-sites (minority class) and non-SC sites (majority class). State of the art 2DConvolutional Neural Network was employed over rigorous 10-fold jackknife cross-validation technique for model validation and authentication. Results: Following the proposed framework, with a strong discrete presentation of feature space, machine learning engine, and unbiased presentation of the underline training data yielded into an excellent model that outperforms with all existing established studies. The proposed approach is 6% higher in terms of MCC from the first best. On an independent dataset, the existing first best study failed to provide sufficient details. The model obtained an increase of 7.5% in accuracy, 1.22% in Sn, 12.91% in Sp and 13.12% in MCC on the training data and12.13% of ACC, 27.25% in Sn, 2.25% in Sp, and 30.37% in MCC on an independent dataset in comparison with 2nd best method. These empirical analyses show the superlative performance of the proposed model over both training and Independent dataset in comparison with existing literature studies. Conclusion : In this research, we have developed a novel sequence-based automated predictor for SC-sites, called DeepSSPred. The empirical simulations outcomes with a training dataset and independent validation dataset have revealed the efficacy of the proposed theoretical model. The good performance of DeepSSPred is due to several reasons, such as novel discriminative feature encoding schemes, SMOTE technique, and careful construction of the prediction model through the tuned 2D-CNN classifier. We believe that our research work will provide a potential insight into a further prediction of S-sulfenylation characteristics and functionalities. Thus, we hope that our developed predictor will significantly helpful for large scale discrimination of unknown SC-sites in particular and designing new pharmaceutical drugs in general.


Author(s):  
Shaolei Wang ◽  
Zhongyuan Wang ◽  
Wanxiang Che ◽  
Sendong Zhao ◽  
Ting Liu

Spoken language is fundamentally different from the written language in that it contains frequent disfluencies or parts of an utterance that are corrected by the speaker. Disfluency detection (removing these disfluencies) is desirable to clean the input for use in downstream NLP tasks. Most existing approaches to disfluency detection heavily rely on human-annotated data, which is scarce and expensive to obtain in practice. To tackle the training data bottleneck, in this work, we investigate methods for combining self-supervised learning and active learning for disfluency detection. First, we construct large-scale pseudo training data by randomly adding or deleting words from unlabeled data and propose two self-supervised pre-training tasks: (i) a tagging task to detect the added noisy words and (ii) sentence classification to distinguish original sentences from grammatically incorrect sentences. We then combine these two tasks to jointly pre-train a neural network. The pre-trained neural network is then fine-tuned using human-annotated disfluency detection training data. The self-supervised learning method can capture task-special knowledge for disfluency detection and achieve better performance when fine-tuning on a small annotated dataset compared to other supervised methods. However, limited in that the pseudo training data are generated based on simple heuristics and cannot fully cover all the disfluency patterns, there is still a performance gap compared to the supervised models trained on the full training dataset. We further explore how to bridge the performance gap by integrating active learning during the fine-tuning process. Active learning strives to reduce annotation costs by choosing the most critical examples to label and can address the weakness of self-supervised learning with a small annotated dataset. We show that by combining self-supervised learning with active learning, our model is able to match state-of-the-art performance with just about 10% of the original training data on both the commonly used English Switchboard test set and a set of in-house annotated Chinese data.


2021 ◽  
Vol 2021 ◽  
pp. 1-11 ◽  
Author(s):  
Yue Wu

In the field of computer science, data mining is a hot topic. It is a mathematical method for identifying patterns in enormous amounts of data. Image mining is an important data mining technique involving a variety of fields. In image mining, art image organization is an interesting research field worthy of attention. The classification of art images into several predetermined sets is referred to as art image categorization. Image preprocessing, feature extraction, object identification, object categorization, object segmentation, object classification, and a variety of other approaches are all part of it. The purpose of this paper is to suggest an improved boosting algorithm that employs a specific method of traditional and simple, yet weak classifiers to create a complex, accurate, and strong classifier image as well as a realistic image. This paper investigated the characteristics of cartoon images, realistic images, painting images, and photo images, created color variance histogram features, and used them for classification. To execute classification experiments, this paper uses an image database of 10471 images, which are randomly distributed into two portions that are used as training data and test data, respectively. The training dataset contains 6971 images, while the test dataset contains 3478 images. The investigational results show that the planned algorithm has a classification accuracy of approximately 97%. The method proposed in this paper can be used as the basis of automatic large-scale image classification and has strong practicability.


2019 ◽  
Vol 11 (19) ◽  
pp. 2190
Author(s):  
Kushiyama ◽  
Matsuoka

After a large-scale disaster, many damaged buildings are demolished and treated as disaster waste. Though the weight of disaster waste was estimated two months after the 2016 earthquake in Kumamoto, Japan, the estimated weight was significantly different from the result when the disaster waste disposal was completed in March 2018. The amount of disaster waste generated is able to be estimated by an equation by multiplying the total number of severely damaged and partially damaged buildings by the coefficient of generated weight per building. We suppose that the amount of disaster waste would be affected by the conditions of demolished buildings, namely, the areas and typologies of building structures, but this has not yet been clarified. Therefore, in this study, we aimed to use geographic information system (GIS) map data to create a time series GIS map dataset with labels of demolished and remaining buildings in Mashiki town for the two-year period prior to the completion of the disaster waste disposal. We used OpenStreetMap (OSM) data as the base data and time series SPOT images observed in the two years following the Kumamoto earthquake to label all demolished and remaining buildings in the GIS map dataset. To effectively label the approximately 16,000 buildings in Mashiki town, we calculated an indicator that shows the possibility of the buildings to be classified as the remaining and demolished buildings from a change of brightness in SPOT images. We classified 5701 demolished buildings from 16,106 buildings, as of March 2018, by visual interpretation of the SPOT and Pleiades images with reference to this indicator. We verified that the number of demolished buildings was almost the same as the number reported by Mashiki municipality. Moreover, we assessed the accuracy of our proposed method: The F-measure was higher than 0.9 using the training dataset, which was verified by a field survey and visual interpretation, and included the labels of the 55 demolished and 55 remaining buildings. We also assessed the accuracy of the proposed method by applying it to all the labels in the OSM dataset, but the F-measure was 0.579. If we applied test data including balanced labels of the other 100 demolished and 100 remaining buildings, which were other than the training data, the F-measure was 0.790 calculated from the SPOT image of 25 March 2018. Our proposed method performed better for the balanced classification but not for imbalanced classification. We studied the examples of image characteristics of correct and incorrect estimation by thresholding the indicator.


2020 ◽  
Vol 2020 ◽  
pp. 1-7
Author(s):  
Bin Li ◽  
Jianmin Yao

The performance of a machine translation system (MTS) depends on the quality and size of the training data. How to extend the training dataset for the MTS in specific domains with effective methods to enhance the performance of machine translation needs to be explored. A method for selecting in-domain bilingual sentence pairs based on the topic information is proposed. With the aid of the topic relevance of the bilingual sentence pairs to the target domain, subsets of sentence pairs related to the texts to be translated are selected from a large-scale bilingual corpus to train the translation system in specific domains to improve the translation quality for in-domain texts. Through the test, the bilingual sentence pairs are selected by using the proposed method, and further the MTS is trained. In this way, the translation performance is greatly enhanced.


2019 ◽  
Author(s):  
Yosuke Toda ◽  
Fumio Okura ◽  
Jun Ito ◽  
Satoshi Okada ◽  
Toshinori Kinoshita ◽  
...  

Incorporating deep learning in the image analysis pipeline has opened the possibility of introducing precision phenotyping in the field of agriculture. However, to train the neural network, a sufficient amount of training data must be prepared, which requires a time-consuming manual data annotation process that often becomes the limiting step. Here, we show that an instance segmentation neural network (Mask R-CNN) aimed to phenotype the barley seed morphology of various cultivars, can be sufficiently trained purely by a synthetically generated dataset. Our attempt is based on the concept of domain randomization, where a large amount of image is generated by randomly orienting the seed object to a virtual canvas. After training with such a dataset, performance based on recall and the average Precision of the real-world test dataset achieved 96% and 95%, respectively. Applying our pipeline enables extraction of morphological parameters at a large scale, enabling precise characterization of the natural variation of barley from a multivariate perspective. Importantly, we show that our approach is effective not only for barley seeds but also for various crops including rice, lettuce, oat, and wheat, and thus supporting the fact that the performance benefits of this technique is generic. We propose that constructing and utilizing such synthetic data can be a powerful method to alleviate human labor costs needed to prepare the training dataset for deep learning in the agricultural domain.


2019 ◽  
Vol 141 (11) ◽  
Author(s):  
Ramin Bostanabad ◽  
Yu-Chin Chan ◽  
Liwei Wang ◽  
Ping Zhu ◽  
Wei Chen

Abstract We introduce a novel method for Gaussian process (GP) modeling of massive datasets called globally approximate Gaussian process (GAGP). Unlike most large-scale supervised learners such as neural networks and trees, GAGP is easy to fit and can interpret the model behavior, making it particularly useful in engineering design with big data. The key idea of GAGP is to build a collection of independent GPs that use the same hyperparameters but randomly distribute the entire training dataset among themselves. This is based on our observation that the GP hyperparameter approximations change negligibly as the size of the training data exceeds a certain level, which can be estimated systematically. For inference, the predictions from all GPs in the collection are pooled, allowing the entire training dataset to be efficiently exploited for prediction. Through analytical examples, we demonstrate that GAGP achieves very high predictive power matching (and in some cases exceeding) that of state-of-the-art supervised learning methods. We illustrate the application of GAGP in engineering design with a problem on data-driven metamaterials, using it to link reduced-dimension geometrical descriptors of unit cells and their properties. Searching for new unit cell designs with desired properties is then achieved by employing GAGP in inverse optimization.


Author(s):  
Yi-Quan Li ◽  
Hao-Sen Chang ◽  
Daw-Tung Lin

In the field of computer vision, large-scale image classification tasks are both important and highly challenging. With the ongoing advances in deep learning and optical character recognition (OCR) technologies, neural networks designed to perform large-scale classification play an essential role in facilitating OCR systems. In this study, we developed an automatic OCR system designed to identify up to 13,070 large-scale printed Chinese characters by using deep learning neural networks and fine-tuning techniques. The proposed framework comprises four components, including training dataset synthesis and background simulation, image preprocessing and data augmentation, the process of training the model, and transfer learning. The training data synthesis procedure is composed of a character font generation step and a background simulation process. Three background models are proposed to simulate the factors of the background noise and anti-counterfeiting patterns on ID cards. To expand the diversity of the synthesized training dataset, rotation and zooming data augmentation are applied. A massive dataset comprising more than 19.6 million images was thus created to accommodate the variations in the input images and improve the learning capacity of the CNN model. Subsequently, we modified the GoogLeNet neural architecture by replacing the FC layer with a global average pooling layer to avoid overfitting caused by a massive amount of training data. Consequently, the number of model parameters was reduced. Finally, we employed the transfer learning technique to further refine the CNN model using a small number of real data samples. Experimental results show that the overall recognition performance of the proposed approach is significantly better than that of prior methods and thus demonstrate the effectiveness of proposed framework, which exhibited a recognition accuracy as high as 99.39% on the constructed real ID card dataset.


2020 ◽  
Vol 2020 (10) ◽  
pp. 181-1-181-7
Author(s):  
Takahiro Kudo ◽  
Takanori Fujisawa ◽  
Takuro Yamaguchi ◽  
Masaaki Ikehara

Image deconvolution has been an important issue recently. It has two kinds of approaches: non-blind and blind. Non-blind deconvolution is a classic problem of image deblurring, which assumes that the PSF is known and does not change universally in space. Recently, Convolutional Neural Network (CNN) has been used for non-blind deconvolution. Though CNNs can deal with complex changes for unknown images, some CNN-based conventional methods can only handle small PSFs and does not consider the use of large PSFs in the real world. In this paper we propose a non-blind deconvolution framework based on a CNN that can remove large scale ringing in a deblurred image. Our method has three key points. The first is that our network architecture is able to preserve both large and small features in the image. The second is that the training dataset is created to preserve the details. The third is that we extend the images to minimize the effects of large ringing on the image borders. In our experiments, we used three kinds of large PSFs and were able to observe high-precision results from our method both quantitatively and qualitatively.


2021 ◽  
Vol 15 (3) ◽  
pp. 1-27
Author(s):  
Yan Liu ◽  
Bin Guo ◽  
Daqing Zhang ◽  
Djamal Zeghlache ◽  
Jingmin Chen ◽  
...  

Store site recommendation aims to predict the value of the store at candidate locations and then recommend the optimal location to the company for placing a new brick-and-mortar store. Most existing studies focus on learning machine learning or deep learning models based on large-scale training data of existing chain stores in the same city. However, the expansion of chain enterprises in new cities suffers from data scarcity issues, and these models do not work in the new city where no chain store has been placed (i.e., cold-start problem). In this article, we propose a unified approach for cold-start store site recommendation, Weighted Adversarial Network with Transferability weighting scheme (WANT), to transfer knowledge learned from a data-rich source city to a target city with no labeled data. In particular, to promote positive transfer, we develop a discriminator to diminish distribution discrepancy between source city and target city with different data distributions, which plays the minimax game with the feature extractor to learn transferable representations across cities by adversarial learning. In addition, to further reduce the risk of negative transfer, we design a transferability weighting scheme to quantify the transferability of examples in source city and reweight the contribution of relevant source examples to transfer useful knowledge. We validate WANT using a real-world dataset, and experimental results demonstrate the effectiveness of our proposed model over several state-of-the-art baseline models.


Sign in / Sign up

Export Citation Format

Share Document