I Supervised learning 5
1 Linear regression 8
1.1 LMS algorithm . . . . . . . . . . . . . . . . . . . . . . . . . .9
1.2 The normal equations . . . . . . . . . . . . . . . . . . . . . . .13
1.2.1 Matrix derivatives . . . . . . . . . . . . . . . . . . . . .13
1.2.2 Least squares revisited . . . . . . . . . . . . . . . . . .14
1.3 Probabilistic interpretation . . . . . . . . . . . . . . . . . . . .15
1.4 Locally weighted linear regression (optional reading) . . . . . .17
2 Classification and logistic regression 20
2.1 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . .20
2.2 Digression: the perceptron learning algorithm . . . . . . . . .23
2.3 Multi-class classification . . . . . . . . . . . . . . . . . . . . .24
2.4 Another algorithm for maximizing `() . . . . . . . . . . . . .27
3 Generalized linear models 29
3.1 The exponential family . . . . . . . . . . . . . . . . . . . . . .29
3.2 Constructing GLMs . . . . . . . . . . . . . . . . . . . . . . . .31
3.2.1 Ordinary least squares . . . . . . . . . . . . . . . . . .32
3.2.2 Logistic regression . . . . . . . . . . . . . . . . . . . .33
4 Generative learning algorithms 34
4.1 Gaussian discriminant analysis . . . . . . . . . . . . . . . . . .35
4.1.1 The multivariate normal distribution . . . . . . . . . .35
4.1.2 The Gaussian discriminant analysis model . . . . . . .38
4.1.3 Discussion: GDA and logistic regression . . . . . . . .40
4.2 Naive bayes (Option Reading) . . . . . . . . . . . . . . . . . .41
4.2.1 Laplace smoothing . . . . . . . . . . . . . . . . . . . .44
4.2.2 Event models for text classification . . . . . . . . . . .46
5 Kernel methods 48
5.1 Feature maps . . . . . . . . . . . . . . . . . . . . . . . . . . .48
5.2 LMS (least mean squares) with features . . . . . . . . . . . . .49
5.3 LMS with the kernel trick . . . . . . . . . . . . . . . . . . . .49
5.4 Properties of kernels . . . . . . . . . . . . . . . . . . . . . . .53
6 Support vector machines 59
6.1 Margins: intuition . . . . . . . . . . . . . . . . . . . . . . . . .59
6.2 Notation (option reading) . . . . . . . . . . . . . . . . . . . .61
6.3 Functional and geometric margins (option reading) . . . . . .61
6.4 The optimal margin classifier (option reading) . . . . . . . . .63
6.5 Lagrange duality (optional reading) . . . . . . . . . . . . . . .65
6.6 Optimal margin classifiers: the dual form (option reading) . .68
6.7 Regularization and the non-separable case (optional reading) .72
6.8 The SMO algorithm (optional reading) . . . . . . . . . . . . .73
6.8.1 Coordinate ascent . . . . . . . . . . . . . . . . . . . . .74
6.8.2 SMO . . . . . . . . . . . . . . . . . . . . . . . . . . . .75
II Deep learning 79
7 Deep learning 80
7.1 Supervised learning with non-linear models . . . . . . . . . . .80
7.2 Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . .84
7.3 Modules in Modern Neural Networks . . . . . . . . . . . . . .92
7.4 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . .98
7.4.1 Preliminaries on partial derivatives . . . . . . . . . . .99
7.4.2 General strategy of backpropagation . . . . . . . . . .102
7.4.3 Backward functions for basic modules . . . . . . . . . .105
7.4.4 Back-propagation for MLPs . . . . . . . . . . . . . . .107
7.5 Vectorization over training examples . . . . . . . . . . . . . .109
III Generalization and regularization 112