diff options
author | Julian T <julian@jtle.dk> | 2022-03-14 14:36:28 +0100 |
---|---|---|
committer | Julian T <julian@jtle.dk> | 2022-03-14 14:36:28 +0100 |
commit | b33baeb47aa29805dea2509dc2dafdadc86d76bc (patch) | |
tree | da4116a56dfc6db2a47a868b74f8ca90c2cf2ba6 | |
parent | 07ce62e7e273b0a212301518d01496fe5e59a184 (diff) |
Add notes for ml lec 6
-rw-r--r-- | sem8/ml/lec6.tex | 64 |
1 files changed, 64 insertions, 0 deletions
diff --git a/sem8/ml/lec6.tex b/sem8/ml/lec6.tex new file mode 100644 index 0000000..0729052 --- /dev/null +++ b/sem8/ml/lec6.tex @@ -0,0 +1,64 @@ +\title{Neural Networks first lecture} + +There is a lot about this in the book and slides, just some quick notes. + +\section{Neural Networks} + +When counting layers we do not include the input layer, as that is just the inputs directly. + +Each neuron can have a \emph{bias}, which is a value not given by the layer above. +This can be used to change the point at which is activates, thus shifting the input in the activation function. + +We can also have let a neural network output a specification of a gaussian distribution, thus $\mu$ and $\sigma$. + +\section{Learning} + +\subsection{Loss Functions} + +We use $loss$ for a single loss function, and $Loss$ for the summed loss over all outputs. + +The \emph{log} error function moves towards learning with maximum likelyhood. + +The \emph{cross-entropy} function is used with discrete outputs, where $o_i$ is the probability of having a specific target $t_i$. + +\subsection{Gradient Descent} + +Gradient is used interchangibly with derivative. +This can be used to find the place in the input space (weights) where the error function is low. + +We can then move the in the opposide direction of the gradient scaled by a \emph{learning factor}, $\eta$. +When we start off with some random initial weights, it is not guaranteed that we will reach the globally optimal weights. +\[ + \mathbf{w}' = \mathbf{w} - \eta \nabla_{\mathbf{w}} Loss(\mathbf{w}) +\] + +This has a problem, which is why we introduce back propagation. +This is because the derivative of these networks is kind of hard. + +\subsection{Back Propagation} + +A way of calculating the derivative, by using the derivative chain rule. + +One of the books uses \emph{Jacobian matrix} which is just a matrix holding derivative. +This is an alternative way of representing what is done in the slides. + +When calculating the \emph{back progagation} we rely on computational graphs. +Here we represent out network as a computational graph. +And then we walk backwards, figuring out inputs to each step in the graph which are just chain rules and partial derivatives. + +With this we can also find the partial derivate in regards to intermediate values and inputs, which is interesting for testing +sensibility. + +\subsection{Momentum} + +Here we introduce a \emph{velocity} or \emph{momentum} which we use to determine the direction of movement. +This we can then use to update weights. + +This velocity is calculated from the preveous value, meaning that we converge much faster. + +\subsection{Vanishing Gradients} + +When using a lot of nested sigmoid functions we can often end with a really low update value for the weights. +This is a problem, and therefore we introduce the \emph{relu} function, which just passes the derivate right through when the input is positive. + + |