Statistical modelling and inference

Author(s):  
Max A. Little

The modern view of statistical machine learning and signal processing is that the central task is one of finding good probabilistic models for the joint distribution over all the variables in the problem. We can then make `queries' of this model, also known as inferences, to determine optimal parameter values or signals. Hence, the importance of statistical methods to this book cannot be overstated. This chapter is an in-depth exploration of what this probabilistic modeling entails, the origins of the concepts involved, how to perform inferences and how to test the quality of a model produced this way.

Author(s):  
Max A. Little

This chapter provides an overview of generating samples from random variables with a given (joint) distribution, and using these samples to find quantities of interest from digital signals. This task plays a fundamental role in many problems in statistical machine learning and signal processing. For example, effectively simulating the behaviour of the statistical model offers a viable alternative to optimization problems arising from some models for signals with large numbers of variables.


Author(s):  
Max A. Little

Digital signal processing (DSP) is one of the ‘foundational’ engineering topics of the modern world, without which technologies such the mobile phone, television, CD and MP3 players, WiFi and radar, would not be possible. A relative newcomer by comparison, statistical machine learning is the theoretical backbone of exciting technologies such as automatic techniques for car registration plate recognition, speech recognition, stock market prediction, defect detection on assembly lines, robot guidance and autonomous car navigation. Statistical machine learning exploits the analogy between intelligent information processing in biological brains and sophisticated statistical modelling and inference. DSP and statistical machine learning are of such wide importance to the knowledge economy that both have undergone rapid changes and seen radical improvements in scope and applicability. Both make use of key topics in applied mathematics such as probability and statistics, algebra, calculus, graphs and networks. Intimate formal links between the two subjects exist and because of this many overlaps exist between the two subjects that can be exploited to produce new DSP tools of surprising utility, highly suited to the contemporary world of pervasive digital sensors and high-powered and yet cheap, computing hardware. This book gives a solid mathematical foundation to, and details the key concepts and algorithms in, this important topic.


2021 ◽  
Vol 11 (15) ◽  
pp. 6955
Author(s):  
Andrzej Rysak ◽  
Magdalena Gregorczyk

This study investigates the use of the differential transform method (DTM) for integrating the Rössler system of the fractional order. Preliminary studies of the integer-order Rössler system, with reference to other well-established integration methods, made it possible to assess the quality of the method and to determine optimal parameter values that should be used when integrating a system with different dynamic characteristics. Bifurcation diagrams obtained for the Rössler fractional system show that, compared to the RK4 scheme-based integration, the DTM results are more resistant to changes in the fractionality of the system.


2021 ◽  
pp. 108-119
Author(s):  
D. V. Shalyapin ◽  
D. L. Bakirov ◽  
M. M. Fattahov ◽  
A. D. Shalyapina ◽  
V. G. Kuznetsov

In domestic and world practice, despite the measures applied and developed to improve the quality of well casing, there is a problem of leaky structures in almost 50 % of completed wells. The study of actual data using classical methods of statistical analysis (regression and variance analyses) doesn't allow us to model the process with sufficient accuracy that requires the development of a new approach to the study of the attachment process. It is proposed to use the methods of machine learning and neural network modeling to identify the most important parameters and their synergistic impact on the target variables that affect the quality of well casing. The formulas necessary for translating the numerical values of the results of acoustic and gamma-gamma cementometry into categorical variables to improve the quality of probabilistic models are determined. A database consisting of 93 parameters for 934 wells of fields located in Western Siberia has been formed. The analysis of fastening of production columns of horizontal wells of four stratigraphic arches is carried out, the most weighty variables and regularities of their influence on target indicators are established. Recommendations are formulated to improve the quality of well casing by correcting the effects of acoustic and gamma-gamma logging on the results.


Author(s):  
Soo Min Kwon ◽  
Anand D. Sarwate

Statistical machine learning algorithms often involve learning a linear relationship between dependent and independent variables. This relationship is modeled as a vector of numerical values, commonly referred to as weights or predictors. These weights allow us to make predictions, and the quality of these weights influence the accuracy of our predictions. However, when the dependent variable inherently possesses a more complex, multidimensional structure, it becomes increasingly difficult to model the relationship with a vector. In this paper, we address this issue by investigating machine learning classification algorithms with multidimensional (tensor) structure. By imposing tensor factorizations on the predictors, we can better model the relationship, as the predictors would take the form of the data in question. We empirically show that our approach works more efficiently than the traditional machine learning method when the data possesses both an exact and an approximate tensor structure. Additionally, we show that estimating predictors with these factorizations also allow us to solve for fewer parameters, making computation more feasible for multidimensional data.


Author(s):  
Katherine Dagon ◽  
Benjamin M. Sanderson ◽  
Rosie A. Fisher ◽  
David M. Lawrence

Abstract. Land models are essential tools for understanding and predicting terrestrial processes and climate–carbon feedbacks in the Earth system, but uncertainties in their future projections are poorly understood. Improvements in physical process realism and the representation of human influence arguably make models more comparable to reality but also increase the degrees of freedom in model configuration, leading to increased parametric uncertainty in projections. In this work we design and implement a machine learning approach to globally calibrate a subset of the parameters of the Community Land Model, version 5 (CLM5) to observations of carbon and water fluxes. We focus on parameters controlling biophysical features such as surface energy balance, hydrology, and carbon uptake. We first use parameter sensitivity simulations and a combination of objective metrics including ranked global mean sensitivity to multiple output variables and non-overlapping spatial pattern responses between parameters to narrow the parameter space and determine a subset of important CLM5 biophysical parameters for further analysis. Using a perturbed parameter ensemble, we then train a series of artificial feed-forward neural networks to emulate CLM5 output given parameter values as input. We use annual mean globally aggregated spatial variability in carbon and water fluxes as our emulation and calibration targets. Validation and out-of-sample tests are used to assess the predictive skill of the networks, and we utilize permutation feature importance and partial dependence methods to better interpret the results. The trained networks are then used to estimate global optimal parameter values with greater computational efficiency than achieved by hand tuning efforts and increased spatial scale relative to previous studies optimizing at a single site. By developing this methodology, our framework can help quantify the contribution of parameter uncertainty to overall uncertainty in land model projections.


2021 ◽  
Vol 50 (Supplement_1) ◽  
Author(s):  
Carla Bernardo ◽  
David Gonzalez-Chica ◽  
Jackie Roseleur ◽  
Luke Grzeskowiak ◽  
Nigel Stocks

Abstract Focus and outcomes for participants Modern technologies offer innovative ways of monitoring health outcomes. Electronic medical records (EMRs) stored in primary care databases provide comprehensive data on infectious and chronic conditions such as diagnosis, medications prescribed, vaccinations, laboratory results, and clinical assessments. Moreover, they allow the possibility of creating a retrospective cohort that can be tracked over time. This rich source of data can be used to generate results that support health policymakers to improve access, reduce health costs, and increase the quality of care. The symposium will discuss the use and future of routinely-collected EMR databases in monitoring health outcomes, using as an example studies based on the MedicineInsight program, a large general practice Australian database including more than 3.5 million patients. This symposium welcomes epidemiologists, researchers and health policymakers who are interested in primary care settings, big data analysis, and artificial intelligence. Rationale for the symposium, including for its inclusion in the Congress EMRs are becoming an important tool for monitoring health outcomes in different high-income countries and settings. However, most countries lack a national primary care database collating EMRs for research purposes. Monitoring of population health conditions is usually performed through surveys, surveillance systems, or census that tend to be expensive or performed over longer time intervals. In contrast, EMR databases are a useful and low-cost method to monitor health outcomes and have shown consistent results compared to other data sources. Although these databases only include individuals attending primary health settings, they tend to resemble the sociodemographic distribution from census data, as in countries such as Australia up to 90% of the population visit these services annually. Results from primary care-based EMRs can be used to inform practices and improve health policies. Analysis from EMRs can be used to identify, for example, those with undiagnosed medical conditions or patients who have not received recommended screenings or immunisations, therefore assessing the impact of government programmes. At a practice-level, healthcare staff can have better access to comprehensive patient histories, improving monitoring of people with certain conditions, such as chronic cardiac, respiratory, metabolic, neurological, or immunological diseases. This information provides feedback to primary care providers about the quality of their care and might help them develop targeted strategies for the most-needed areas or groups. Another benefit of EMRs is the possibility of using statistical modelling and machine learning to improve prediction of health outcomes and medical management, supporting general practitioners with decision making on the best management approach. In Australia, the MedicineInsight program is a large general practice database that since 2011 has been routinely collecting information from over 650 general practices varying in size, billing methods, and type of services offered, and from all Australian states and regions. In the last few years, diverse researchers have used MedicineInsight to investigate infectious and chronic diseases, immunization coverage, prescribed medications, medical management, and temporal trends in primary care. Despite being initially created for monitoring how medicines and medical tests are used, MedicineInsight has overcome some of the legal, ethical, social and resource-related barriers associated with the use of EMRs for research purposes through the involvement of a data governance committee responsible for the ethical, privacy and security aspects of any research using this data, and through applying data quality criteria to their data extraction. This symposium will discuss advances in the use of primary care databases for monitoring health outcomes using as an example the research activities performed based on the Australian MedicineInsight program. These discussions will also cover challenges in the use of this database and possible methodological innovations, such as statistical modelling or machine learning, that could be used to improve monitoring of the epidemiology and management of health conditions. Presentation program The use of large general practice databases for monitoring health outcomes in Australia: infectious and chronic conditions (Professor Nigel Stocks) How routinely collected electronic health records from MedicineInsight can help inform policy, research and health systems to improve health outcomes (Ms Rachel Hayhurst) Influenza-like illness in Australia: how can we improve surveillance systems in Australia using electronic medical records? (Dr Carla Bernardo) Long term use of opioids in Australian general practice (Dr David Gonzalez) Using routinely collected electronic health records to evaluate Quality Use of Medicines for women’s reproductive health (Dr Luke Grzeskowiak) The use of electronic medical records and machine learning to identify hypertensive patients and factors associated with controlled hypertension (Ms Jackie Roseleur) Names of presenters Professor Nigel Stocks, The University of Adelaide Ms Rachel Hayhurst, NPS MedicineWise Dr Carla Bernardo, The University of Adelaide Dr David Gonzalez-Chica, The University of Adelaide Dr Luke Grzeskowiak, The University of Adelaide Ms Jackie Roseleur, The University of Adelaide


ADMET & DMPK ◽  
2020 ◽  
Vol 8 (1) ◽  
pp. 29-77 ◽  
Author(s):  
Alex Avdeef

The accurate prediction of solubility of drugs is still problematic. It was thought for a long time that shortfalls had been due the lack of high-quality solubility data from the chemical space of drugs. This study considers the quality of solubility data, particularly of ionizable drugs. A database is described, comprising 6355 entries of intrinsic solubility for 3014 different molecules, drawing on 1325 citations. In an earlier publication, many factors affecting the quality of the measurement had been discussed, and suggestions were offered to improve ways of extracting more reliable information from legacy data. Many of the suggestions have been implemented in this study. By correcting solubility for ionization (i.e., deriving intrinsic solubility, S0) and by normalizing temperature (by transforming measurements performed in the range 10-50 °C to 25 °C), it can now be estimated that the average interlaboratory reproducibility is 0.17 log unit. Empirical methods to predict solubility at best have hovered around the root mean square error (RMSE) of 0.6 log unit. Three prediction methods are compared here: (a) Yalkowsky’s general solubility equation (GSE), (b) Abraham solvation equation (ABSOLV), and (c) Random Forest regression (RFR) statistical machine learning. The latter two methods were trained using the new database. The RFR method outperforms the other two models, as anticipated. However, the ability to predict the solubility of drugs to the level of the quality of data is still out of reach. The data quality is not the limiting factor in prediction. The statistical machine learning methodologies are probably up to the task. Possibly what’s missing are solubility data from a few sparsely-covered chemical space of drugs (particularly of research compounds). Also, new descriptors which can better differentiate the factors affecting solubility between molecules could be critical for narrowing the gap between the accuracy of the prediction models and that of the experimental data.


Author(s):  
Max A. Little

Statistical machine learning and signal processing are topics in applied mathematics, which are based upon many abstract mathematical concepts. Defining these concepts clearly is the most important first step in this book. The purpose of this chapter is to introduce these foundational mathematical concepts. It also justifies the statement that much of the art of statistical machine learning as applied to signal processing, lies in the choice of convenient mathematical models that happen to be useful in practice. Convenient in this context means that the algebraic consequences of the choice of mathematical modeling assumptions are in some sense manageable. The seeds of this manageability are the elementary mathematical concepts upon which the subject is built.


2021 ◽  
Vol 23 (07) ◽  
pp. 62-70
Author(s):  
Nagesh B ◽  
◽  
Dr. M. Uttara Kumari ◽  

Audio processing is an important branch under the signal processing domain. It deals with the manipulation of the audio signals to achieve a task like filtering, data compression, speech processing, noise suppression, etc. which improves the quality of the audio signal. For applications such as natural language processing, speech generation, automatic speech recognition, the conventional algorithms aren’t sufficient. There is a need for machine learning or deep learning algorithms which can be implemented so that the audio signal processing can be achieved with good results and accuracy. In this paper, a review of the various algorithms used by researchers in the past has been described and gives the appropriate algorithm that can be used for the respective applications.


Sign in / Sign up

Export Citation Format

Share Document