Bootstrapping Dependency Treebank of Urdu Noisy Text

This paper describes how bootstrapping was used to extend the development of the Urdu Noisy Text dependency treebank. To overcome the bottleneck of manually annotating corpus for a new domain of user-generated text, MaltParser, an opensource, data-driven dependency parser, is used to bootstrap the treebank in semi-automatic manner for corpus annotation after being trained on 500 tweet Urdu Noisy Text Dependency Treebank. Total four bootstrapping iterations were performed. At the end of each iteration, 300 Urdu tweets were automatically tagged, and the performance of parser model was evaluated against the development set. 75 automatically tagged tweets were randomly selected out of pre-tagged 300 tweets for manual correction, which were then added in the training set for parser retraining. Finally, at the end of last iteration, parser performance was evaluated against test set. The final supervised bootstrapping model obtains a LA of 72.1%, UAS of 75.7% and LAS of 64.9%, which is a significant improvement over baseline score of 69.8% LA, 74% UAS, and 62.9% LAS

Complexity ◽  
2021 ◽  
Vol 2021 ◽  
pp. 1-13
Author(s):  
Chuan Xiang ◽  
Zejun Ren ◽  
Pengfei Shi ◽  
Hongge Zhao

The rolling bearing is an extremely important basic mechanical device. The diagnosis of its fault play an important role in the safe and stable operation of the mechanical system. This study proposed an approach, based on the Fast Fourier Transform (FFT) with Decimation-In-Time (DIT) and XGBoost algorithm, to identify the fault type of bearing quickly and accurately. Firstly, the original vibration signal of rolling bearing was transformed by DIT-FFT and divided into the training set and test set. Next, the training set was used to train the fault diagnosis XGBoost model, and the test set was used to validate the well-trained XGBoost model. Finally, the proposed approach was compared with some common methods. It is demonstrated that the proposed approach is able to diagnose and identify the fault type of bearing quickly with almost 99% accuracy. It is more accurate than Machine Learning (89.88%), Ensemble Learning (93.25%), and Deep Learning (95%). This approach is suitable for the fault diagnosis of rolling bearing.


2020 ◽  
Author(s):  
Xin Yi See ◽  
Benjamin Reiner ◽  
Xuelan Wen ◽  
T. Alexander Wheeler ◽  
Channing Klein ◽  
...  

<div> <div> <div> <p>Herein, we describe the use of iterative supervised principal component analysis (ISPCA) in de novo catalyst design. The regioselective synthesis of 2,5-dimethyl-1,3,4-triphenyl-1H- pyrrole (C) via Ti- catalyzed formal [2+2+1] cycloaddition of phenyl propyne and azobenzene was targeted as a proof of principle. The initial reaction conditions led to an unselective mixture of all possible pyrrole regioisomers. ISPCA was conducted on a training set of catalysts, and their performance was regressed against the scores from the top three principal components. Component loadings from this PCA space along with k-means clustering were used to inform the design of new test catalysts. The selectivity of a prospective test set was predicted in silico using the ISPCA model, and only optimal candidates were synthesized and tested experimentally. This data-driven predictive-modeling workflow was iterated, and after only three generations the catalytic selectivity was improved from 0.5 (statistical mixture of products) to over 11 (> 90% C) by incorporating 2,6-dimethyl- 4-(pyrrolidin-1-yl)pyridine as a ligand. The successful development of a highly selective catalyst without resorting to long, stochastic screening processes demonstrates the inherent power of ISPCA in de novo catalyst design and should motivate the general use of ISPCA in reaction development. </p> </div> </div> </div>


2021 ◽  
Vol 12 (2) ◽  
Author(s):  
Mohammad Haekal ◽  
Henki Bayu Seta ◽  
Mayanda Mega Santoni

Untuk memprediksi kualitas air sungai Ciliwung, telah dilakukan pengolahan data-data hasil pemantauan secara Online Monitoring dengan menggunakan Metode Data Mining. Pada metode ini, pertama-tama data-data hasil pemantauan dibuat dalam bentuk tabel Microsoft Excel, kemudian diolah menjadi bentuk Pohon Keputusan yang disebut Algoritma Pohon Keputusan (Decision Tree) mengunakan aplikasi WEKA. Metode Pohon Keputusan dipilih karena lebih sederhana, mudah dipahami dan mempunyai tingkat akurasi yang sangat tinggi. Jumlah data hasil pemantauan kualitas air sungai Ciliwung yang diolah sebanyak 5.476 data. Hasil klarifikasi dengan Pohon Keputusan, dari 5.476 data ini diperoleh jumlah data yang mengindikasikan sungai Ciliwung Tidak Tercemar sebanyak 1.059 data atau sebesar 19,3242%, dan yang mengindikasikan Tercemar sebanyak 4.417 data atau 80,6758%. Selanjutnya data-data hasil pemantauan ini dievaluasi menggunakan 4 Opsi Tes (Test Option) yaitu dengan Use Training Set, Supplied Test Set, Cross-Validation folds 10, dan Percentage Split 66%. Hasil evaluasi dengan 4 opsi tes yang digunakan ini, semuanya menunjukkan tingkat akurasi yang sangat tinggi, yaitu diatas 99%. Dari data-data hasil peneltian ini dapat diprediksi bahwa sungai Ciliwung terindikasi sebagai sungai tercemar bila mereferensi kepada Peraturan Pemerintah Republik Indonesia nomor 82 tahun 2001 dan diketahui pula bahwa penggunaan aplikasi WEKA dengan Algoritma Pohon Keputusan untuk mengolah data-data hasil pemantauan dengan mengambil tiga parameter (pH, DO dan Nitrat) adalah sangat akuran dan tepat. Kata Kunci : Kualitas air sungai, Data Mining, Algoritma Pohon Keputusan, Aplikasi WEKA.


2009 ◽  
Vol 7 (4) ◽  
pp. 846-856 ◽  
Author(s):  
Andrey Toropov ◽  
Alla Toropova ◽  
Emilio Benfenati

AbstractUsually, QSPR is not used to model organometallic compounds. We have modeled the octanol/water partition coefficient for organometallic compounds of Na, K, Ca, Cu, Fe, Zn, Ni, As, and Hg by optimal descriptors calculated with simplified molecular input line entry system (SMILES) notations. The best model is characterized by the following statistics: n=54, r2=0.9807, s=0.677, F=2636 (training set); n=26, r2=0.9693, s=0.969, F=759 (test set). Empirical criteria for the definition of the applicability domain for these models are discussed.


2021 ◽  
Vol 11 (5) ◽  
pp. 2039
Author(s):  
Hyunseok Shin ◽  
Sejong Oh

In machine learning applications, classification schemes have been widely used for prediction tasks. Typically, to develop a prediction model, the given dataset is divided into training and test sets; the training set is used to build the model and the test set is used to evaluate the model. Furthermore, random sampling is traditionally used to divide datasets. The problem, however, is that the performance of the model is evaluated differently depending on how we divide the training and test sets. Therefore, in this study, we proposed an improved sampling method for the accurate evaluation of a classification model. We first generated numerous candidate cases of train/test sets using the R-value-based sampling method. We evaluated the similarity of distributions of the candidate cases with the whole dataset, and the case with the smallest distribution–difference was selected as the final train/test set. Histograms and feature importance were used to evaluate the similarity of distributions. The proposed method produces more proper training and test sets than previous sampling methods, including random and non-random sampling.


Author(s):  
Rui Guo ◽  
Xiaobin Hu ◽  
Haoming Song ◽  
Pengpeng Xu ◽  
Haoping Xu ◽  
...  

Abstract Purpose To develop a weakly supervised deep learning (WSDL) method that could utilize incomplete/missing survival data to predict the prognosis of extranodal natural killer/T cell lymphoma, nasal type (ENKTL) based on pretreatment 18F-FDG PET/CT results. Methods One hundred and sixty-seven patients with ENKTL who underwent pretreatment 18F-FDG PET/CT were retrospectively collected. Eighty-four patients were followed up for at least 2 years (training set = 64, test set = 20). A WSDL method was developed to enable the integration of the remaining 83 patients with incomplete/missing follow-up information in the training set. To test generalization, these data were derived from three types of scanners. Prediction similarity index (PSI) was derived from deep learning features of images. Its discriminative ability was calculated and compared with that of a conventional deep learning (CDL) method. Univariate and multivariate analyses helped explore the significance of PSI and clinical features. Results PSI achieved area under the curve scores of 0.9858 and 0.9946 (training set) and 0.8750 and 0.7344 (test set) in the prediction of progression-free survival (PFS) with the WSDL and CDL methods, respectively. PSI threshold of 1.0 could significantly differentiate the prognosis. In the test set, WSDL and CDL achieved prediction sensitivity, specificity, and accuracy of 87.50% and 62.50%, 83.33% and 83.33%, and 85.00% and 75.00%, respectively. Multivariate analysis confirmed PSI to be an independent significant predictor of PFS in both the methods. Conclusion The WSDL-based framework was more effective for extracting 18F-FDG PET/CT features and predicting the prognosis of ENKTL than the CDL method.


2018 ◽  
Vol 19 (11) ◽  
pp. 3423 ◽  
Author(s):  
Ting Wang ◽  
Lili Tang ◽  
Feng Luan ◽  
M. Natália D. S. Cordeiro

Organic compounds are often exposed to the environment, and have an adverse effect on the environment and human health in the form of mixtures, rather than as single chemicals. In this paper, we try to establish reliable and developed classical quantitative structure–activity relationship (QSAR) models to evaluate the toxicity of 99 binary mixtures. The derived QSAR models were built by forward stepwise multiple linear regression (MLR) and nonlinear radial basis function neural networks (RBFNNs) using the hypothetical descriptors, respectively. The statistical parameters of the MLR model provided were N (number of compounds in training set) = 79, R2 (the correlation coefficient between the predicted and observed activities)= 0.869, LOOq2 (leave-one-out correlation coefficient) = 0.864, F (Fisher’s test) = 165.494, and RMS (root mean square) = 0.599 for the training set, and Next (number of compounds in external test set) = 20, R2 = 0.853, qext2 (leave-one-out correlation coefficient for test set)= 0.825, F = 30.861, and RMS = 0.691 for the external test set. The RBFNN model gave the statistical results, namely N = 79, R2 = 0.925, LOOq2 = 0.924, F = 950.686, RMS = 0.447 for the training set, and Next = 20, R2 = 0.896, qext2 = 0.890, F = 155.424, RMS = 0.547 for the external test set. Both of the MLR and RBFNN models were evaluated by some statistical parameters and methods. The results confirm that the built models are acceptable, and can be used to predict the toxicity of the binary mixtures.


2021 ◽  
Author(s):  
Xiaokai Yan ◽  
Chiying Xiao ◽  
Kunyan Yue ◽  
Min Chen ◽  
Hang Zhou

Abstract Background: Change in the genome plays a crucial role in cancerogenesis and many biomarkers can be used as effective prognostic indicators in diverse tumors. Currently, although many studies have constructed some predictive models for hepatocellular carcinoma (HCC) based on molecular signatures, the performance of which is unsatisfactory. To fill this shortcoming, we hope to construct a novel and accurate prognostic model with multi-omics data to guide prognostic assessments of HCC. Methods: The TCGA training set was used to identify crucial biomarkers and construct single-omic prognostic models through difference analysis, univariate Cox, and LASSO/stepwise Cox analysis. Then the performances of single-omic models were evaluated and validated through survival analysis, Harrell’s concordance index (C-index), and receiver operating characteristic (ROC) curve, in the TCGA test set and external cohorts. Besides, a comprehensive model based on multi-omics data was constructed via multiple Cox analysis, and the performance of which was evaluated in the TCGA training set and TCGA test set. Results: We identified 16 key mRNAs, 20 key lncRNAs, 5 key miRNAs, 5 key CNV genes, and 7 key SNPs which were significantly associated with the prognosis of HCC, and constructed 5 single-omic models which showed relatively good performance in prognostic prediction with c-index ranged from 0.63 to 0.75 in the TCGA training set and test set. Besides, we validated the mRNA model and the SNP model in two independent external datasets respectively, and good discriminating abilities were observed through survival analysis (P < 0.05). Moreover, the multi-omics model based on mRNA, lncRNA, miRNA, CNV, and SNP information presented a quite strong predictive ability with c-index over 0.80 and all AUC values at 1,3,5-years more than 0.84.Conclusion: In this study, we identified many biomarkers that may help study underlying carcinogenesis mechanisms in HCC, and constructed five single-omic models and an integrated multi-omics model that may provide effective and reliable guides for prognosis assessment and treatment decision-making.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Yong Zhu ◽  
Yingfan Mao ◽  
Jun Chen ◽  
Yudong Qiu ◽  
Yue Guan ◽  
...  

AbstractTo investigate the ability of CT-based radiomics signature for pre-and postoperatively predicting the early recurrence of intrahepatic mass-forming cholangiocarcinoma (IMCC) and develop radiomics-based prediction models. Institutional review board approved this study. Clinicopathological characteristics, contrast-enhanced CT images, and radiomics features of 125 IMCC patients (35 with early recurrence and 90 with non-early recurrence) were retrospectively reviewed. In the training set of 92 patients, preoperative model, pathological model, and combined model were developed by multivariate logistic regression analysis to predict the early recurrence (≤ 6 months) of IMCC, and the prediction performance of different models were compared using the Delong test. The developed models were validated by assessing their prediction performance in test set of 33 patients. Multivariate logistic regression analysis identified solitary, differentiation, energy- arterial phase (AP), inertia-AP, and percentile50th-portal venous phase (PV) to construct combined model for predicting early recurrence of IMCC [the area under the curve (AUC) = 0.917; 95% CI 0.840–0.965]. While the AUC of pathological model and preoperative model were 0.741 (95% CI 0.637–0.828) and 0.844 (95% CI 0.751–0.912), respectively. The AUC of the combined model was significantly higher than that of the preoperative model (p = 0.049) or pathological model (p = 0.002) in training set. In test set, the combined model also showed higher prediction performance. CT-based radiomics signature is a powerful predictor for early recurrence of IMCC. Preoperative model (constructed with homogeneity-AP and standard deviation-AP) and combined model (constructed with solitary, differentiation, energy-AP, inertia-AP, and percentile50th-PV) can improve the accuracy for pre-and postoperatively predicting the early recurrence of IMCC.


2018 ◽  
Vol 6 (1) ◽  
pp. 1
Author(s):  
Qomariyatul Hasanah ◽  
Anang Andrianto ◽  
Muhammad Arief Hidayat

Sistem informasi posyandu ibu hamil dapat mengelola data kesehatan ibu hamil yang berkaitan dengan faktor resiko kehamilan. Faktor resiko kehamilan berdasarkan ketentuan Kartu Skor Poedji Rochyati (KSPR) digunakan bidan untuk menentukan resiko kehamilan dengan memberikan skor pada masing-masing parameter. KSPR memiliki kelemahan tidak dapat memberikan skor pada parameter yang belum pasti sehingga jika belum diketahui dengan pasti maka dianggap tidak terjadi. Konsep membaca pola data yang diadopsi dari teknik datamining menggunakan metode klasifikasi naive bayes dapat menjadi alternatif untuk kelemahan KSPR tersebut yaitu dengan mengklasifikasikan resiko kehamilan. Metode naïve bayes menghitung probabilitas parameter tertentu berdasarkan data pada periode sebelumnya yang telah ditentukan sebagai data training, berdasarkan hasil perhitungan tersebut dapat diketahui resiko kehamilan secara tepat sesuai parameter yang telah diketahui. Metode naïve bayes dipilih karena memiliki tingkat akurasi yang cukup tinggi daripada metode klasifikasi lainnya. Sistem informasi ini dibangun berbasis website agar dapat diakses secara mudah oleh beberapa posyandu yang berbeda tempat. Sistem dibangun mengadopsi dari model Waterfall. Sistem informasi posyandu ibu hamil dirancang dan dibangun dengan tiga (3) hak akses yaitu admin, bidan dan kader dengan masing-masing fitur yang dapat memudahkan penggunanya. Hasil dari penelitian ini adalah sistem informasi posyandu ibu hamil dengan penerapan klasifikasi resiko kehamilan menggunakan metode naïve bayes, dengan tingkat akurasi ketika menggunakan 17 atribut didapatkan 53.913%, 19 atribut didapatkan 54.348%, , 21 atribut didapatkan 54.783%, dan 22 atribut didapatkan 56.957%. Tingkat akurasi klasifikasi diperoleh menggunakan metode pengujian menggunakan Ten-Fold Cross Validation dimana training set dibagi menjadi 10 kelompok, jika kelompok 1 dijadikan test set maka kelompok 2 hingga 10 menjadi training set. Kata Kunci: Posyandu, Resiko Kehamilan, Waterfall, Datamining, Klasifikasi, Naïve bayes


Sign in / Sign up

Export Citation Format

Share Document