scholarly journals Application of machine learning algorithms in MBR simulation under big data platform

2020 ◽  
Vol 15 (4) ◽  
pp. 1238-1247
Author(s):  
Weiwei Li ◽  
Chunqing Li ◽  
Tao Wang

Abstract Membrane bioreactors (MBRs) are a sewage treatment process that combines membrane separation with bioreactor technology. It has great advantages in sewage treatment. Membrane fouling hinders MBR process development, however. Studies have shown that the degree of membrane fouling can be judged using the membrane flux rate. In this study, principal component analysis was used to extract the main factors affecting membrane fouling, then the random forest algorithm on the Hadoop big data platform was used to establish an MBR membrane flux prediction model, which was tested. In order to verify the model's effectiveness, BP neural network and SVM support vector machine models were established using the same experimental data. The experimental results from the different models were compared, and the results showed that the random forest algorithm gave the best MBR membrane flux predictions.

2021 ◽  
Vol 2021 ◽  
pp. 1-9
Author(s):  
Yao Huimin

With the development of cloud computing and distributed cluster technology, the concept of big data has been expanded and extended in terms of capacity and value, and machine learning technology has also received unprecedented attention in recent years. Traditional machine learning algorithms cannot solve the problem of effective parallelization, so a parallelization support vector machine based on Spark big data platform is proposed. Firstly, the big data platform is designed with Lambda architecture, which is divided into three layers: Batch Layer, Serving Layer, and Speed Layer. Secondly, in order to improve the training efficiency of support vector machines on large-scale data, when merging two support vector machines, the “special points” other than support vectors are considered, that is, the points where the nonsupport vectors in one subset violate the training results of the other subset, and a cross-validation merging algorithm is proposed. Then, a parallelized support vector machine based on cross-validation is proposed, and the parallelization process of the support vector machine is realized on the Spark platform. Finally, experiments on different datasets verify the effectiveness and stability of the proposed method. Experimental results show that the proposed parallelized support vector machine has outstanding performance in speed-up ratio, training time, and prediction accuracy.


2019 ◽  
Vol 20 (S2) ◽  
Author(s):  
Varun Khanna ◽  
Lei Li ◽  
Johnson Fung ◽  
Shoba Ranganathan ◽  
Nikolai Petrovsky

Abstract Background Toll-like receptor 9 is a key innate immune receptor involved in detecting infectious diseases and cancer. TLR9 activates the innate immune system following the recognition of single-stranded DNA oligonucleotides (ODN) containing unmethylated cytosine-guanine (CpG) motifs. Due to the considerable number of rotatable bonds in ODNs, high-throughput in silico screening for potential TLR9 activity via traditional structure-based virtual screening approaches of CpG ODNs is challenging. In the current study, we present a machine learning based method for predicting novel mouse TLR9 (mTLR9) agonists based on features including count and position of motifs, the distance between the motifs and graphically derived features such as the radius of gyration and moment of Inertia. We employed an in-house experimentally validated dataset of 396 single-stranded synthetic ODNs, to compare the results of five machine learning algorithms. Since the dataset was highly imbalanced, we used an ensemble learning approach based on repeated random down-sampling. Results Using in-house experimental TLR9 activity data we found that random forest algorithm outperformed other algorithms for our dataset for TLR9 activity prediction. Therefore, we developed a cross-validated ensemble classifier of 20 random forest models. The average Matthews correlation coefficient and balanced accuracy of our ensemble classifier in test samples was 0.61 and 80.0%, respectively, with the maximum balanced accuracy and Matthews correlation coefficient of 87.0% and 0.75, respectively. We confirmed common sequence motifs including ‘CC’, ‘GG’,‘AG’, ‘CCCG’ and ‘CGGC’ were overrepresented in mTLR9 agonists. Predictions on 6000 randomly generated ODNs were ranked and the top 100 ODNs were synthesized and experimentally tested for activity in a mTLR9 reporter cell assay, with 91 of the 100 selected ODNs showing high activity, confirming the accuracy of the model in predicting mTLR9 activity. Conclusion We combined repeated random down-sampling with random forest to overcome the class imbalance problem and achieved promising results. Overall, we showed that the random forest algorithm outperformed other machine learning algorithms including support vector machines, shrinkage discriminant analysis, gradient boosting machine and neural networks. Due to its predictive performance and simplicity, the random forest technique is a useful method for prediction of mTLR9 ODN agonists.


2014 ◽  
Vol 2014 ◽  
pp. 1-8 ◽  
Author(s):  
Chunqing Li ◽  
Zixiang Yang ◽  
Hongying Yan ◽  
Tao Wang

It is one of the important issues in the field of today's sewage treatment of researching the MBR membrane flux prediction for membrane fouling. Firstly this paper used the principal component analysis method to achieve dimensionality and correlation of input variables and obtained the three major factors affecting membrane fouling most obvious: MLSS, total resistance, and operating pressure. Then it used the BP neural network to establish the system model of the MBR intelligent simulation, the relationship between three parameters, and membrane flux characterization of the degree of membrane fouling, because the BP neural network has slow training speed, is sensitive to the initial weights and the threshold, is easy to fall into local minimum points, and so on. So this paper used genetic algorithm to optimize the initial weights and the threshold of BP neural network and established the membrane fouling prediction model based on GA-BP network. As this research had shown, under the same conditions, the BP network model optimized by GA of MBR membrane fouling is better than that not optimized for prediction effect of membrane flux. It demonstrates that the GA-BP network model of MBR membrane fouling is more suitable for simulation of MBR membrane fouling process, comparing with the BP network.


2020 ◽  
Author(s):  
chuanxin qiu

This paper uses the random forest algorithm model to quantify and predict the monetary policy of the People's Bank of China under the input of 16 indicators macroeconomic indicators. It is compared with three other machine learning algorithms (CART decision tree, support vector machine and neural network algorithm), discrete selection model and combined prediction model. The results show that the random forest algorithm shows better prediction accuracy in predicting the direction of the central bank's monetary policy.


2020 ◽  
Author(s):  
chuanxin qiu

This paper uses the random forest algorithm model to quantify and predict the monetary policy of the People's Bank of China under the input of 16 indicators macroeconomic indicators. It is compared with three other machine learning algorithms (CART decision tree, support vector machine and neural network algorithm), discrete selection model and combined prediction model. The results show that the random forest algorithm shows better prediction accuracy in predicting the direction of the central bank's monetary policy.


2020 ◽  
Author(s):  
chuanxin qiu

This paper uses the random forest algorithm model to quantify and predict the monetary policy of the People's Bank of China under the input of 16 indicators macroeconomic indicators. It is compared with three other machine learning algorithms (CART decision tree, support vector machine and neural network algorithm), discrete selection model and combined prediction model. The results show that the random forest algorithm shows better prediction accuracy in predicting the direction of the central bank's monetary policy.


2020 ◽  
Vol 1 (1) ◽  
pp. 42-50
Author(s):  
Hanna Arini Parhusip ◽  
Bambang Susanto ◽  
Lilik Linawati ◽  
Suryasatriya Trihandaru ◽  
Yohanes Sardjono ◽  
...  

The article presents the study of several machine learning algorithms that are used to study breast cancer data with 33 features from 569 samples. The purpose of this research is to investigate the best algorithm for classification of breast cancer. The data may have different scales with different large range one to the other features and hence the data are transformed before the data are classified. The used classification methods in machine learning are logistic regression, k-nearest neighbor, Naive bayes classifier, support vector machine, decision tree and random forest algorithm. The original data and the transformed data are classified with size of data test is 0.3. The SVM and Naive Bayes algorithms have no improvement of accuracy with random forest gives the best accuracy among all. Therefore the size of data test is reduced to 0.25 leading to improve all algorithms in transformed data classifications. However, random forest algorithm still gives the best accuracy.


2021 ◽  
Vol 6 (2) ◽  
pp. 213
Author(s):  
Nadya Intan Mustika ◽  
Bagus Nenda ◽  
Dona Ramadhan

This study aims to implement a machine learning algorithm in detecting fraud based on historical data set in a retail consumer financing company. The outcome of machine learning is used as samples for the fraud detection team. Data analysis is performed through data processing, feature selection, hold-on methods, and accuracy testing. There are five machine learning methods applied in this study: Logistic Regression, K-Nearest Neighbor (KNN), Decision Tree, Random Forest, and Support Vector Machine (SVM). Historical data are divided into two groups: training data and test data. The results show that the Random Forest algorithm has the highest accuracy with a training score of 0.994999 and a test score of 0.745437. This means that the Random Forest algorithm is the most accurate method for detecting fraud. Further research is suggested to add more predictor variables to increase the accuracy value and apply this method to different financial institutions and different industries.


2019 ◽  
Vol 9 (1) ◽  
Author(s):  
Yantao Ma ◽  
Jun Ji ◽  
Yun Huang ◽  
Huimin Gao ◽  
Zhiying Li ◽  
...  

AbstractBipolar disorder (BPD) is often confused with major depression, and current diagnostic questionnaires are subjective and time intensive. The aim of this study was to develop a new Bipolar Diagnosis Checklist in Chinese (BDCC) by using machine learning to shorten the Affective Disorder Evaluation scale (ADE) based on an analysis of registered Chinese multisite cohort data. In order to evaluate the importance of each item of the ADE, a case-control study of 360 bipolar disorder (BPD) patients, 255 major depressive disorder (MDD) patients and 228 healthy (no psychiatric diagnosis) controls (HCs) was conducted, spanning 9 Chinese health facilities participating in the Comprehensive Assessment and Follow-up Descriptive Study on Bipolar Disorder (CAFÉ-BD). The BDCC was formed by selected items from the ADE according to their importance as calculated by a random forest machine learning algorithm. Five classical machine learning algorithms, namely, a random forest algorithm, support vector regression (SVR), the least absolute shrinkage and selection operator (LASSO), linear discriminant analysis (LDA) and logistic regression, were used to retrospectively analyze the aforementioned cohort data to shorten the ADE. Regarding the area under the receiver operating characteristic (ROC) curve (AUC), the BDCC had high AUCs of 0.948, 0.921, and 0.923 for the diagnosis of MDD, BPD, and HC, respectively, despite containing only 15% (17/113) of the items from the ADE. Traditional scales can be shortened using machine learning analysis. By shortening the ADE using a random forest algorithm, we generated the BDCC, which can be more easily applied in clinical practice to effectively enhance both BPD and MDD diagnosis.


Identification of drug-target interaction (DTI) is an important challenge for research and development in the pharmaceutical industry. Biomedicine researchers have stepped from in vitro and in vivo experiments to in-silico methods for fast results. In the recent past, machine learning algorithms have become very popular for DTI predictions. This paper presents an ensemble approach- Random forest algorithm for DTI predictions. The performance of proposed approach is evaluated with respect to Matrix factorization, genetic algorithm, Support vector machines, K-nearest neighbor, Decision Trees and Logistic Regression over 4 benchmark datasets with diverse properties. The algorithm is evaluated over Accuracy and average ranking. Results establish that random forest algorithm is more suitable or DTI predictions as compared to other algorithms.


Sign in / Sign up

Export Citation Format

Share Document