sem8/ml/lec6.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64

\title{Neural Networks first lecture}

There is a lot about this in the book and slides, just some quick notes.

\section{Neural Networks}

When counting layers we do not include the input layer, as that is just the inputs directly.

Each neuron can have a \emph{bias}, which is a value not given by the layer above.
This can be used to change the point at which is activates, thus shifting the input in the activation function.

We can also have let a neural network output a specification of a gaussian distribution, thus $\mu$ and $\sigma$.

\section{Learning}

\subsection{Loss Functions}

We use $loss$ for a single loss function, and $Loss$ for the summed loss over all outputs.

The \emph{log} error function moves towards learning with maximum likelyhood.

The \emph{cross-entropy} function is  used with discrete outputs, where $o_i$ is the probability of having a specific target $t_i$.

\subsection{Gradient Descent}

Gradient is used interchangibly with derivative.
This can be used to find the place in the input space (weights) where the error function is low.

We can then move the in the opposide direction of the gradient scaled by a \emph{learning factor}, $\eta$.
When we start off with some random initial weights, it is not guaranteed that we will reach the globally optimal weights.
\[
    \mathbf{w}' = \mathbf{w} - \eta \nabla_{\mathbf{w}} Loss(\mathbf{w})
\]

This has a problem, which is why we introduce back propagation.
This is because the derivative of these networks is kind of hard.

\subsection{Back Propagation}

A way of calculating the derivative, by using the derivative chain rule.

One of the books uses \emph{Jacobian matrix} which is just a matrix holding derivative.
This is an alternative way of representing what is done in the slides.

When calculating the \emph{back progagation} we rely on computational graphs.
Here we represent out network as a computational graph.
And then we walk backwards, figuring out inputs to each step in the graph which are just chain rules and partial derivatives.

With this we can also find the partial derivate in regards to intermediate values and inputs, which is interesting for testing
sensibility.

\subsection{Momentum}

Here we introduce a \emph{velocity} or \emph{momentum} which we use to determine the direction of movement.
This we can then use to update weights.

This velocity is calculated from the preveous value, meaning that we converge much faster.

\subsection{Vanishing Gradients}

When using a lot of nested sigmoid functions we can often end with a really low update value for the weights.
This is a problem, and therefore we introduce the \emph{relu} function, which just passes the derivate right through when the input is positive.