+612 9045 4394
 
CHECKOUT
Neural Networks for Conditional Probability Estimation : Forecasting Beyond Point Predictions - Dirk Husmeier

Neural Networks for Conditional Probability Estimation

Forecasting Beyond Point Predictions

Paperback

Published: 22nd February 1999
Ships: 5 to 9 business days
5 to 9 business days
$169.73
or 4 easy payments of $42.43 with Learn more

Conventional applications of neural networks usually predict a single value as a function of given inputs. In forecasting, for example, a standard objective is to predict the future value of some entity of interest on the basis of a time series of past measurements or observations. Typical training schemes aim to minimise the sum of squared deviations between predicted and actual values (the 'targets'), by which, ideally, the network learns the conditional mean of the target given the input. If the underlying conditional distribution is Gaus- sian or at least unimodal, this may be a satisfactory approach. However, for a multimodal distribution, the conditional mean does not capture the relevant features of the system, and the prediction performance will, in general, be very poor. This calls for a more powerful and sophisticated model, which can learn the whole conditional probability distribution. Chapter 1 demonstrates that even for a deterministic system and 'be- nign' Gaussian observational noise, the conditional distribution of a future observation, conditional on a set of past observations, can become strongly skewed and multimodal. In Chapter 2, a general neural network structure for modelling conditional probability densities is derived, and it is shown that a universal approximator for this extended task requires at least two hidden layers. A training scheme is developed from a maximum likelihood approach in Chapter 3, and the performance ofthis method is demonstrated on three stochastic time series in chapters 4 and 5.

List of Figuresp. xxi
Introductionp. 1
Conventional forecasting and Takens' embedding theoremp. 1
Implications of observational noisep. 5
Implications of dynamic noisep. 9
Examplep. 10
Conclusionp. 16
Objective of this bookp. 16
A Universal Approximator Network for Predicting Conditional Probability Densitiesp. 21
Introductionp. 21
A single-hidden-layer networkp. 22
An additional hidden layerp. 23
Regaining the conditional probability densityp. 25
Moments of the conditional probability densityp. 26
Interpretation of the network parametersp. 28
Gaussian mixture modelp. 29
Derivative-of-sigmoid versus Gaussian mixture modelp. 30
Comparison with other approachesp. 31
Predicting local error barsp. 31
Indirect methodp. 31
Complete kernel expansion: Conditional Density Estimation Network (CDEN) and Mixture Density Network (MDN)p. 32
Distorted Probability Mixture Network (DPMN)p. 32
Mixture of Experts (ME) and Hierarchical Mixture of Experts (HME)p. 33
Soft histogramp. 33
Summaryp. 34
Appendix: The moment generating function for the DSM networkp. 35
A Maximum Likelihood Training Schemep. 39
The cost functionp. 39
A gradient-descent training schemep. 43
Output weightsp. 45
Kernel widthsp. 47
Remaining weightsp. 48
Interpretation of the parameter adaptation rulesp. 49
Deficiencies of gradient descent and their remedyp. 51
Summaryp. 54
Appendixp. 55
Benchmark Problemsp. 57
Logistic map with intrinsic noisep. 57
Stochastic combination of two stochastic dynamical systemsp. 60
Brownian motion in a double-well potentialp. 63
Summaryp. 67
Demonstration of the Model Performance on the Benchmark Problemsp. 69
Introductionp. 69
Logistic map with intrinsic noisep. 71
Methodp. 71
Resultsp. 73
Stochastic coupling between two stochastic dynamical systemsp. 75
Methodp. 75
Resultsp. 77
Auto-pruningp. 78
Brownian motion in a double-well potentialp. 80
Methodp. 80
Resultsp. 82
Comparison with other approachesp. 82
Conclusionsp. 83
Discussionp. 84
Random Vector Functional Link (RVFL) Networksp. 87
The RVFL theoremp. 87
Proof of the RVFL theoremp. 89
Comparison with the multilayer perceptronp. 93
A simple illustrationp. 95
Summaryp. 96
Improved Training Scheme Combining the Expectation Maximisation (EM) Algorithm with the RVFL Approachp. 99
Review of the Expectation Maximisation (EM) algorithmp. 99
Simulation: Application of the GM network trained with the EM algorithmp. 104
Methodp. 104
Resultsp. 105
Discussionp. 108
Combining EM and RVFLp. 109
Preventing numerical instabilityp. 112
Regularisationp. 117
Summaryp. 118
Appendixp. 118
Empirical Demonstration: Combining EM and RVFLp. 121
Methodp. 121
Application of the GM-RVFL network to predicting the stochastic logistic-kappa mapp. 122
Training a single modelp. 122
Training an ensemble of modelsp. 126
Application of the GM-RVFL network to the double-well problemp. 129
Committee selectionp. 130
Predictionp. 131
Comparison with other approachesp. 132
Discussionp. 134
A simple Bayesian regularisation schemep. 137
A Bayesian approach to regularisationp. 137
A simple example: repeated coin flipsp. 139
A conjugate priorp. 140
EM algorithm with regularisationp. 142
The posterior modep. 143
Discussionp. 145
The Bayesian Evidence Scheme for Regularisationp. 147
Introductionp. 147
A simple illustration of the evidence ideap. 150
Overview of the evidence schemep. 152
First step: Gaussian approximation to the probability in parameter spacep. 152
Second step: Optimising the hyperparametersp. 153
A self-consistent iteration schemep. 154
Implementation of the evidence schemep. 155
First step: Gaussian approximation to the probability in parameter spacep. 156
Second step: Optimising the hyperparametersp. 157
Algorithmp. 159
Discussionp. 160
Improvement over the maximum likelihood estimatep. 160
Justification of the approximationsp. 161
Final remarkp. 162
The Bayesian Evidence Scheme for Model Selectionp. 165
The evidence for the modelp. 165
An uninformative priorp. 168
Comparison with MacKay's workp. 171
Interpretation of the model evidencep. 172
Ockham factors for the weight groupsp. 173
Ockham factors for the kernel widthsp. 174
Ockham factor for the priorsp. 175
Discussionp. 176
Demonstration of the Bayesian Evidence Scheme for Regularisationp. 179
Method and objectivep. 179
Initialisationp. 179
Different training and regularisation schemesp. 180
Pruningp. 181
Large Data Setp. 181
Small Data Setp. 183
Number of well-determined parameters and pruningp. 185
Automatic self-pruningp. 185
Mathematical elucidation of the pruning schemep. 189
Summary and Conclusionp. 191
Network Committees and Weighting Schemesp. 193
Network committees for interpolationp. 193
Network committees for modelling conditional probability densitiesp. 196
Weighting Schemes for Predictorsp. 198
Introductionp. 198
A Bayesian approachp. 199
Numerical problems with the model evidencep. 199
A weighting scheme based on the cross-validation performancep. 201
Demonstration: Committees of Networks Trained with Different Regularisation Schemesp. 203
Method and objectivep. 203
Single-model predictionp. 204
Committee predictionp. 207
Best and average single-model performancep. 207
Improvement over the average single-model performancep. 209
Improvement over the best single-model performancep. 210
Robustness of the committee performancep. 210
Dependence on the temperaturep. 211
Dependence on the temperature when including biased modelsp. 212
Optimal temperaturep. 213
Model selection and evidencep. 213
Advantage of under-regularisation and over-fittingp. 215
Conclusionsp. 215
Automatic Relevance Determination (ARD)p. 221
Introductionp. 221
Two alternative ARD schemesp. 223
Mathematical implementationp. 224
Empirical demonstrationp. 227
A Real-World Application: The Boston Housing Datap. 229
A real-world regression problem: The Boston house-price datap. 230
Prediction with a single modelp. 231
Methodologyp. 231
Resultsp. 232
Test of the ARD schemep. 234
Methodologyp. 234
Resultsp. 234
Prediction with network committeesp. 236
Objectivep. 236
Methodologyp. 237
Weighting scheme and temperaturep. 238
ARD parametersp. 239
Comparison between the two ARD schemesp. 240
Number of kernelsp. 240
Bayesian regularisationp. 241
Network complexityp. 241
Cross-validationp. 242
Discussion: How overfitting can be usefulp. 242
Increasing diversityp. 244
Baggingp. 245
Nonlinear Preprocessingp. 246
Comparison with Neal's resultsp. 248
Conclusionsp. 249
Summaryp. 251
Appendix: Derivation of the Hessian for the Bayesian Evidence Schemep. 255
Introduction and notationp. 255
A decomposition of the Hessian using EMp. 256
Explicit calculation of the Hessianp. 258
Discussionp. 265
Referencesp. 267
Indexp. 273
Table of Contents provided by Syndetics. All Rights Reserved.

ISBN: 9781852330958
ISBN-10: 1852330953
Series: Perspectives in Neural Computing
Audience: General
Format: Paperback
Language: English
Number Of Pages: 275
Published: 22nd February 1999
Publisher: Springer London Ltd
Country of Publication: GB
Dimensions (cm): 23.52 x 15.6  x 1.8
Weight (kg): 0.47