scholarly journals Assessment of predictive models for estimation of water consumption in public preschool buildings

Author(s):  
Nebojša M. Jurišević ◽  
◽  
Dušan R. Gordić ◽  
Vladimir Vukašinović ◽  
Arso M. Vukicevic ◽  
...  

Preschool buildings are among the biggest water consumers in the public buildings sector, which efficient management of water consumption could make considerable savings in city budgets. The aim of this study was twofold: 1) to assess prognostic performances of 21 parameters that influence the water consumption and 2) to assess performances of two different approaches (statistical and machine learning-based) with 6 various predictive models for the estimation of water consumption by using the observed parameters. The considered data set was collected from the total share of public preschool buildings in the city of Kragujevac, Serbia, over a three-year period. Top-performing statistical-based model was Multiple Linear Regression, while the best machine learning method was Random Forest. Particularly, Random Forest gained the best overall performances while the Multiple linear regression showed the same precision as the Random Forest when dealing with buildings that consume more than 200 m3/month. It is found that both methods provide satisfying estimates, leaving for potential users to choose between better performances (Random Forest) or usability (Multiple Linear Regression).

2021 ◽  
Vol 931 (1) ◽  
pp. 012013
Author(s):  
Le Thi Nhut Suong ◽  
A V Bondarev ◽  
E V Kozlova

Abstract Geochemical studies of organic matter in source rocks play an important role in predicting the oil and gas accumulation of any territory, especially in oil and gas shale. For deep understanding, pyrolytic analyses are often carried out on samples before and after extraction of hydrocarbon with chloroform. However, extraction is a laborious and time-consuming process and the workload of laboratory equipment and time doubles. In this work, machine learning regression algorithms is applied for forecasting S2ex based on the pyrolytic analytic result of non-extracted samples. This study is carried out using more than 300 samples from 3 different wells in Bazhenov formation, Western Siberia. For developing a prediction model, 5 different machine learning regression algorithms including Multiple Linear Regression, Polynomial Regression, Support vector regression, Decision tree and Random forest have been tested and compared. The performance of these algorithms is examined by R-squared coefficient. The data of the X2 well was used for building a model. Simultaneously, this data is divided into 2 parts – 80% for training and 20% for checking. The model also was used for prediction of wells X1 and X3. Then, these predictive results were compared with the real results, which had been obtained from standard experiments. Despite limited amount of data, the result exceeded all expectations. The result of prediction also showcases that the relationship between before and after extraction parameters are complex and non-linear. The proof is R2 value of Multiple Linear Regression and Polynomial Regression is negative, which means the model is broken. However, Random forest and Decision tree give us a good performance. With the same algorithms, we can apply for prediction all geochemical parameters by depth or utilize them for well-logging data.


2019 ◽  
Vol 8 (9) ◽  
pp. 382 ◽  
Author(s):  
Marcos Ruiz-Álvarez ◽  
Francisco Alonso-Sarria ◽  
Francisco Gomariz-Castillo

Several methods have been tried to estimate air temperature using satellite imagery. In this paper, the results of two machine learning algorithms, Support Vector Machines and Random Forest, are compared with Multiple Linear Regression and Ordinary kriging. Several geographic, remote sensing and time variables are used as predictors. The validation is carried out using two different approaches, a leave-one-out cross validation in the spatial domain and a spatio-temporal k-block cross-validation, and four different statistics on a daily basis, allowing the use of ANOVA to compare the results. The main conclusion is that Random Forest produces the best results (R 2 = 0.888 ± 0.026, Root mean square error = 3.01 ± 0.325 using k-block cross-validation). Regression methods (Support Vector Machine, Random Forest and Multiple Linear Regression) are calibrated with MODIS data and several predictors easily calculated from a Digital Elevation Model. The most important variables in the Random Forest model were satellite temperature, potential irradiation and cdayt, a cosine transformation of the julian day.


2020 ◽  
Vol 12 (5) ◽  
pp. 41-51
Author(s):  
Shaimaa Mahmoud ◽  
◽  
Mahmoud Hussein ◽  
Arabi Keshk

Opinion mining in social networks data is considered as one of most important research areas because a large number of users interact with different topics on it. This paper discusses the problem of predicting future products rate according to users’ comments. Researchers interacted with this problem by using machine learning algorithms (e.g. Logistic Regression, Random Forest Regression, Support Vector Regression, Simple Linear Regression, Multiple Linear Regression, Polynomial Regression and Decision Tree). However, the accuracy of these techniques still needs to be improved. In this study, we introduce an approach for predicting future products rate using LR, RFR, and SVR. Our data set consists of tweets and its rate from 1:5. The main goal of our approach is improving the prediction accuracy about existing techniques. SVR can predict future product rate with a Mean Squared Error (MSE) of 0.4122, Linear Regression model predict with a Mean Squared Error of 0.4986 and Random Forest Regression can predict with a Mean Squared Error of 0.4770. This is better than the existing approaches accuracy.


2019 ◽  
Vol 12 (8) ◽  
pp. 4211-4239 ◽  
Author(s):  
Sharad Vikram ◽  
Ashley Collier-Oxandale ◽  
Michael H. Ostertag ◽  
Massimiliano Menarini ◽  
Camron Chermak ◽  
...  

Abstract. Advances in ambient environmental monitoring technologies are enabling concerned communities and citizens to collect data to better understand their local environment and potential exposures. These mobile, low-cost tools make it possible to collect data with increased temporal and spatial resolution, providing data on a large scale with unprecedented levels of detail. This type of data has the potential to empower people to make personal decisions about their exposure and support the development of local strategies for reducing pollution and improving health outcomes. However, calibration of these low-cost instruments has been a challenge. Often, a sensor package is calibrated via field calibration. This involves colocating the sensor package with a high-quality reference instrument for an extended period and then applying machine learning or other model fitting technique such as multiple linear regression to develop a calibration model for converting raw sensor signals to pollutant concentrations. Although this method helps to correct for the effects of ambient conditions (e.g., temperature) and cross sensitivities with nontarget pollutants, there is a growing body of evidence that calibration models can overfit to a given location or set of environmental conditions on account of the incidental correlation between pollutant levels and environmental conditions, including diurnal cycles. As a result, a sensor package trained at a field site may provide less reliable data when moved, or transferred, to a different location. This is a potential concern for applications seeking to perform monitoring away from regulatory monitoring sites, such as personal mobile monitoring or high-resolution monitoring of a neighborhood. We performed experiments confirming that transferability is indeed a problem and show that it can be improved by collecting data from multiple regulatory sites and building a calibration model that leverages data from a more diverse data set. We deployed three sensor packages to each of three sites with reference monitors (nine packages total) and then rotated the sensor packages through the sites over time. Two sites were in San Diego, CA, with a third outside of Bakersfield, CA, offering varying environmental conditions, general air quality composition, and pollutant concentrations. When compared to prior single-site calibration, the multisite approach exhibits better model transferability for a range of modeling approaches. Our experiments also reveal that random forest is especially prone to overfitting and confirm prior results that transfer is a significant source of both bias and standard error. Linear regression, on the other hand, although it exhibits relatively high error, does not degrade much in transfer. Bias dominated in our experiments, suggesting that transferability might be easily increased by detecting and correcting for bias. Also, given that many monitoring applications involve the deployment of many sensor packages based on the same sensing technology, there is an opportunity to leverage the availability of multiple sensors at multiple sites during calibration to lower the cost of training and better tolerate transfer. We contribute a new neural network architecture model termed split-NN that splits the model into two stages, in which the first stage corrects for sensor-to-sensor variation and the second stage uses the combined data of all the sensors to build a model for a single sensor package. The split-NN modeling approach outperforms multiple linear regression, traditional two- and four-layer neural networks, and random forest models. Depending on the training configuration, compared to random forest the split-NN method reduced error 0 %–11 % for NO2 and 6 %–13 % for O3.


Longevity depends on various facets such as economic growth of the country, along with the health innovations of the region. Along with the prophecy of existence, we also figure out how sensitive a particular mainland is to few chronic diseases. These factors have a robust impact on the potential life span of the population. We study the biological and economical aspects of continents and their countries to predict the life expectancy of the population and to perceive the probability of the continent possessing long standing diseases like measles, HIV/AIDS, etc. Our research is conducted on the theory that exhibits the dependency or correlation of life expectancy with the various factors which includes the health factors as well as the economic factors. Two Machine learning algorithms simple linear regression, multiple linear regression are used for predicting the expectancy of life over different continents, whereas, decision tree algorithm, random forest algorithm, and were applied to classify the likelihood of occurrence of the disease. On comparing and contrasting various algorithms, we can infer that, multiple linear regression produces the most accurate results as to what the average life expectancy of the population would be given the current features of the continent like the adult mortality rate, alcohol consumption rate, infant deaths, the GDP of the country, average percentage expenditure of the population on health care and treatments, schooling rate, and other such features. On the other hand, we study five diseases namely, HIV/AIDS, measles, diphtheria, hepatitis B and polio. The experiment concluded that, on majority, random forest produces better results of classification based on the economic factors of the combination of various countries of different continents


2020 ◽  
Vol 214 ◽  
pp. 02050
Author(s):  
Zhen Sun ◽  
Shangmei Zhao

This paper analyzed and compared the forecast effect of three machine learning algorithms (multiple linear regression, random forest and LSTM network) in stock price forecast using the closing price data of NASDAQ ETF and data of statistical factors. The test results show that the prediction effect of the closing price data is better than that of statistical factors, but the difference is not significant. Multiple linear regression is most suitable for stock price forecast. The second is random forest, which is prone to overfitting. The forecast effect of LSTM network is the worst and the values of RMSE and MAPE were the highest. The forecast effect of future stock price using closing price of NASDAQ ETF is better than that using statistical factors, but the difference is not significant.


Author(s):  
Jun Pei ◽  
Zheng Zheng ◽  
Hyunji Kim ◽  
Lin Song ◽  
Sarah Walworth ◽  
...  

An accurate scoring function is expected to correctly select the most stable structure from a set of pose candidates. One can hypothesize that a scoring function’s ability to identify the most stable structure might be improved by emphasizing the most relevant atom pairwise interactions. However, it is hard to evaluate the relevant importance for each atom pair using traditional means. With the introduction of machine learning methods, it has become possible to determine the relative importance for each atom pair present in a scoring function. In this work, we use the Random Forest (RF) method to refine a pair potential developed by our laboratory (GARF6) by identifying relevant atom pairs that optimize the performance of the potential on our given task. Our goal is to construct a machine learning (ML) model that can accurately differentiate the native ligand binding pose from candidate poses using a potential refined by RF optimization. We successfully constructed RF models on an unbalanced data set with the ‘comparison’ concept and, the resultant RF models were tested on CASF-2013.5 In a comparison of the performance of our RF models against 29 scoring functions, we found our models outperformed the other scoring functions in predicting the native pose. In addition, we used two artificial designed potential models to address the importance of the GARF potential in the RF models: (1) a scrambled probability function set, which was obtained by mixing up atom pairs and probability functions in GARF, and (2) a uniform probability function set, which share the same peak positions with GARF but have fixed peak heights. The results of accuracy comparison from RF models based on the scrambled, uniform, and original GARF potential clearly showed that the peak positions in the GARF potential are important while the well depths are not. <br>


2021 ◽  
Author(s):  
Christian Thiele ◽  
Gerrit Hirschfeld ◽  
Ruth von Brachel

AbstractRegistries of clinical trials are a potential source for scientometric analysis of medical research and serve important functions for the research community and the public at large. Clinical trials that recruit patients in Germany are usually registered in the German Clinical Trials Register (DRKS) or in international registries such as ClinicalTrials.gov. Furthermore, the International Clinical Trials Registry Platform (ICTRP) aggregates trials from multiple primary registries. We queried the DRKS, ClinicalTrials.gov, and the ICTRP for trials with a recruiting location in Germany. Trials that were registered in multiple registries were linked using the primary and secondary identifiers and a Random Forest model based on various similarity metrics. We identified 35,912 trials that were conducted in Germany. The majority of the trials was registered in multiple databases. 32,106 trials were linked using primary IDs, 26 were linked using a Random Forest model, and 10,537 internal duplicates on ICTRP were identified using the Random Forest model after finding pairs with matching primary or secondary IDs. In cross-validation, the Random Forest increased the F1-score from 96.4% to 97.1% compared to a linkage based solely on secondary IDs on a manually labelled data set. 28% of all trials were registered in the German DRKS. 54% of the trials on ClinicalTrials.gov, 43% of the trials on the DRKS and 56% of the trials on the ICTRP were pre-registered. The ratio of pre-registered studies and the ratio of studies that are registered in the DRKS increased over time.


2019 ◽  
Vol 12 (3) ◽  
pp. 1209-1225 ◽  
Author(s):  
Christoph A. Keller ◽  
Mat J. Evans

Abstract. Atmospheric chemistry models are a central tool to study the impact of chemical constituents on the environment, vegetation and human health. These models are numerically intense, and previous attempts to reduce the numerical cost of chemistry solvers have not delivered transformative change. We show here the potential of a machine learning (in this case random forest regression) replacement for the gas-phase chemistry in atmospheric chemistry transport models. Our training data consist of 1 month (July 2013) of output of chemical conditions together with the model physical state, produced from the GEOS-Chem chemistry model v10. From this data set we train random forest regression models to predict the concentration of each transported species after the integrator, based on the physical and chemical conditions before the integrator. The choice of prediction type has a strong impact on the skill of the regression model. We find best results from predicting the change in concentration for long-lived species and the absolute concentration for short-lived species. We also find improvements from a simple implementation of chemical families (NOx = NO + NO2). We then implement the trained random forest predictors back into GEOS-Chem to replace the numerical integrator. The machine-learning-driven GEOS-Chem model compares well to the standard simulation. For ozone (O3), errors from using the random forests (compared to the reference simulation) grow slowly and after 5 days the normalized mean bias (NMB), root mean square error (RMSE) and R2 are 4.2 %, 35 % and 0.9, respectively; after 30 days the errors increase to 13 %, 67 % and 0.75, respectively. The biases become largest in remote areas such as the tropical Pacific where errors in the chemistry can accumulate with little balancing influence from emissions or deposition. Over polluted regions the model error is less than 10 % and has significant fidelity in following the time series of the full model. Modelled NOx shows similar features, with the most significant errors occurring in remote locations far from recent emissions. For other species such as inorganic bromine species and short-lived nitrogen species, errors become large, with NMB, RMSE and R2 reaching >2100 % >400 % and <0.1, respectively. This proof-of-concept implementation takes 1.8 times more time than the direct integration of the differential equations, but optimization and software engineering should allow substantial increases in speed. We discuss potential improvements in the implementation, some of its advantages from both a software and hardware perspective, its limitations, and its applicability to operational air quality activities.


Author(s):  
Mert Gülçür ◽  
Ben Whiteside

AbstractThis paper discusses micromanufacturing process quality proxies called “process fingerprints” in micro-injection moulding for establishing in-line quality assurance and machine learning models for Industry 4.0 applications. Process fingerprints that we present in this study are purely physical proxies of the product quality and need tangible rationale regarding their selection criteria such as sensitivity, cost-effectiveness, and robustness. Proposed methods and selection reasons for process fingerprints are also justified by analysing the temporally collected data with respect to the microreplication efficiency. Extracted process fingerprints were also used in a multiple linear regression scenario where they bring actionable insights for creating traceable and cost-effective supervised machine learning models in challenging micro-injection moulding environments. Multiple linear regression model demonstrated %84 accuracy in predicting the quality of the process, which is significant as far as the extreme process conditions and product features are concerned.


Sign in / Sign up

Export Citation Format

Share Document