- Activation functions
- Sigmoid
- tanh
- ReLU
- Leaky ReLU
- Parametric ReLU (PReLU)
- Exponential Linear Units (ELU)
- Maxout Neuron
- Hierarchical Softmax
- Multi -label/-attribute/-tag loss
- Initialization
- w_i = 0 (all weights=0)
- small random numbers W = 0.01*np.random.randn(D, H)
- large random numbers (e.g. randn around 1)
- Xavier initialization (Glorot et al., 2010)
- Hyperparameter Search
- Cross Validation
- Autodifferentiation
- Gradient Descent
- Learning rate decay
- Higher-Order Optimization
- Model Ensembles
- Regularization
- L2 Regularization
- L1 Regularization
- Dropout
- Bias regularization
- Data Augmentation
- DropConnect
- Fractional Max Pooling
- Stochastic Depth
- Batch Normalization (Ioffe and Szegedy, 2015)
- CNN Architectures
- RNNs
- Char-level RNN
- Backpropagation through time
- Image Captioning
- Image Captioning with Attention
- Vanilla RNN Gradient Flow
- Long Short Term Memory (LSTM)
- Detection and Segmentation
- Semantic Segmentation
- Upsampling Methods
- Classification + Localization
- Object Detection
- Instance Segmentation
- Visualizing & Understanding
- Visualizing Convnet filter weights
- Maximally activating patches
- Cluster by L2-nn in feature space
- Activation Visualization
- Occlusion Experiments
- Saliency Maps
- Dimensionality Reduction
- Guided Backpropagation
- Synthesizing images with Gradient Ascent
- Deep Dream
- Feature inversion
- Neural Texture Synthesis
- Neural Style Transfer
- Generative Models
- Fully visible belief network (explicit density)
- Variational Autoencoders
- Generative Adversarial Networks
- Deep Reinforcement Learning
- Markov Decision Process \((\mathcal{S, A, R}, \mathbb{P}, \gamma)\)
- Q-learning
- Experience Replay
- Policy Gradients
- Actor-Critic
- Summary
- Efficient Methods and Hardware for Deep Learning
- Pruning
- Weight Sharing (Trained Quantization)
- Huffman Coding
- SqueezeNet
- Quantization
- Low Rank Approximation
- Binary/Ternary Net
- Winograd Convolution
- Parallelism
- Mixed Precision Training
- Model Distillation
- Adversarial Examples and Adversarial Training

Problems:

- vanishing gradient
- not zero centered as the image of the sigmoid function is \((0, 1)\)

\((\textrm{I}) := \sigma(\sum w_ix_i+b)\) is always all-positive, this causes the gradient updates all moving in the same direction. The gradient of \((\textrm{I})\) (with respect to \(w\)) is \(\sigma'(\sum w_ix_i+b) (\sum w_i x_i+b)'\) according to the chain rule. The first term \(\sigma'(\sum w_ix_i+b)\) is some arbitrary gradient, the gradient of \((\sum w_i x_i+b)\) is \(x\). If \(x\) is the output of a previous layer with sigmoid activation it is always all-positive. - \(\textrm{exp}()\) is computationally expensive

Problems:

- vanishing gradient

- does not saturate
- very computationally efficient
- converges fast
- more biologically plausible

Problems:

- not zero centered
- no gradient for negative values → bad initialization can cause relus to be "dead", a too high learning rate can also cause values to jump around and relus to die

relu neurons can be initialized with slightly positive bias (e.g. 0.01) to decrease the likelihood of it being dead

Leaky ReLUs don't die, better gradient behaviour

\(\alpha\) is learned

- all benefits of relu
- closer to zero-mean outputs

Problems:

- more saturating than relus
- computationally expensive because of \(\textrm{exp}()\)

- does not have the form of dot product, then linearity. Instead two sets of weights and biases
- generalizes ReLU and LeakyReLU
- Linear regime, does not die or saturate!

Problems:

- doubles the number of weights and biases

- binary classifier for each category
- logistic regression classifier for every attribute

`W = 0.01*np.random.randn(D, H)`

```
W = np.random.randn(fan_in, fan_out) / np.sqrt(fan_in) # layer initialization
```

- try to get the variance of the input to be the same as the variance of the output
- assumes linear activation (for example in the active region of tanh/sigmoid) → doesn't work with relu

To work with relu use this (note additional "/2") (He et al., 2015). This accounts for the 50% of neurons that are dead.

```
W = np.random.randn(fan_in, fan_out) / np.sqrt(fan_in/2) # layer initialization
```

- Optimize in log space!
`lr = 10**uniform(-3, -6)`

{0.001, 0.01, 0.1, 1} vs {0.001, 0.334, 0.667, 1}
- Use random search instead of grid search. Usually one variable is more important than the others, if we use random search we try many more different values for this important variable.

- split up computation in elementary operations ("gates" in the computation graoh) for which the local derivative is known
- chain the gradients of elementary operations together using the chain rule
- the elementary operations in the computation graph don't have to be atomic. For example the sigmoid function \(\sigma(x)=\frac{1}{1+e^{-x}}\) has derivative \(\frac{d\sigma}{dx}=\frac{e^{-x}}{(1+e^{-x})^2}=(1-\sigma(x))\sigma(x)\). As the derivative is very simple it might make sense to have the entire sigmoid function as a singular gate in the computation graph instead of chaining \(e^x\), \(\frac{1}{x}\), \(x^2\),.. operations together.
- don't create the entire computation graph for the gradient. Just compute it directly

See this.

Some quotes:

"

"We discussed the importance of

- mini batch size: "We use powers of 2 in practice because many vectorized operation implementations work faster when their inputs are sized in powers of 2."
- mini batches: "The reason this works well is that the examples in the training data are correlated. To see this, consider the extreme case where all 1.2 million images in ILSVRC are in fact made up of exact duplicates of only 1000 unique images (one for each class, or in other words 1200 identical copies of each image). Then it is clear that the gradients we would compute for all 1200 identical copies would all be the same, and when we average the data loss over all 1.2 million images we would get the exact same loss as if we only evaluated on a small subset of 1000."

- Loss changes quickly in one direction and slowly in another (loss landscape looks like a taco shell) causes SGD to oscillate/jitter. Loss function is said to have a high
*condition number*at this point (ratio of largest to smallest singular value of the Hessian matrix is large).
Problem is worse in higher dimensions: If we have thousands or millions of singular values of the (much larger) hessian matrix the ratio of the largest to smallest one will be quite large.
- saddle points/local minima, some/all components of the gradient are zero → network doesn't learn. In practice since in high-dim spaces local minima/maxima are very rare, saddle points are much more problematic since they are very common and the network doesn't learn very fast if many of the gradient's components are zero.

Nesterov momentum uses lookahead position (after adding velocity) to compute gradient. Formulation of NAG is a bit annoying since we usually want to compute the loss function and gradient at the same point. Change of variables fixes this

Does momentum causes us to skip over minima (when the minimum lies in a very narrow basin)? Yes, but usually those sharp minima are minima that we want to avoid anyway because they lead to overfitting. If we increase our training set, those sharp minima may disappear, while large flat minima will remain.

```
python
grad_squared = 0
while True:
dx = compute_gradient(x)
grad_squared += dx
```*dx
x -= lr * dx / (np.sqrt(grad_squared) + 1e-7)

Problem: grad_squared is never "reset" → step size decrease over the course of training

```
python
grad_squared = 0
while True:
dx = compute_gradient(x)
grad_squared = decay_rate
```* grad_squared + (1-decay_rate)*dx*dx
x -= lr * dx / (np.sqrt(grad_squared) + 1e-7)

Adam without bias correction:

```
first_moment = 0
second_moment = 0
while True:
dx = compute_gradient(x)
first_moment = beta1
```* first_moment + (1+beta1) * dx # Momentum
second_moment = beta2 * second_moment + (1-beta2) * dx*dx
x -= lr * first_moment / (np.sqrt(second_moment) + 1e-7)

```
first_moment = 0
second_moment = 0
for t in range(num_iterations):
dx = compute_gradient(x)
first_moment = beta1
```* first_moment + (1+beta1) * dx # Momentum
second_moment = beta2 * second_moment + (1-beta2) * dx*dx # AdaGrad/RMSProp
first_unbias = first_moment / (1-beta1 *** t)
second_unbias = second_moment / (1-beta2 ** t)
x -= lr first_unbias / (np.sqrt(second_unbias) + 1e-7)

Hessian Matrix for Newton parameter update has O(n^2) elements, inverting takes O(n^3)

Instead there are less expensive versions

- Quasi-Newton methods (BGFS)
- L-BFGS (Limited memory BFGS)

works well with full batch deterministic sgd but not in the stochastic min-batch setting

- Train multiple independent models (possibly with different hyperparameters)
- At test time average their results

→ A few percent higher performance

Alternatively:

- take multiple snapshots of a single model during training and average those snapshots in an ensemble (maybe with extreme learning rate schedule to explore different regions of the parameter space within the same training run)
- Polyak averaging

`W += -lambda * W`

towards zero."- Distributes representation over many more neurons, network doesn't become dependent on any single neuron
- Large ensemble of models within a single model

At test time we want to remove the stochasticity of dropout:

Instead approximate this by multiplying by dropout probability to rescale all activations. "Drop at train time, scale at test time"

- Horizontal Flips
- Random crops and scales
- Color jitter for contrast & brightness (more advanced with PCA to determine direction of jitter)
- translation/rotation/stretching/shearing/lens distortions...

Normalize by empirical mean and variance (independently for each dimension, applied batch-wise).

Can be applied to fully connected and convolutional layers. In convolutional layers we normalize across all spatial locations. (For conv only one scalar mean and variance, for fully connected possibly one scalar mean and variance for each feature)

BN avoids saturation, but maybe we want some saturation? Add parameters \(\beta\) and \(\gamma\) to control mean and variance. In this way we don't simply normalize the layer output but instead give the layer explicit control over it's output's mean and variance. (layer can just recover identity mapping)

- improves gradient flow through the network
- allows higher learning rates
- reduces strong dependence on initialization
- acts as regularization since each image (/item) is dependent on all other items in the batch. (since the normalization is done per batch). "It is no longer producing deterministic values for a given training example and is tying all of the inputs in a batch together"

Why should data everywhere in the network (and during preprocessing) be normalized?

**LeNet-5:**one of the first convnets that was successfully used in practice (standard conv, pool, ... , fc architecture)**AlexNet:**first large scale convnet that won imagenet competition. Architecture similar to LeNet, first use of ReLU, used dropout and data augmentation, SGD+momentum, l2 weight decay, 7-CNN ensemble**VGGNet:**(VGG16, VGG19 with 16 and 19 layers respectively), won imagenet, very small conv filters (3x3, 5x5). Stacker smaller filter are less computationally expensive but have the same effective receptive field.**GoogLeNet:**2014 winner, introduced inception module, no fc layers, 12x less parameters than alexnet**ResNet:**much deeper than any previous model: 152 layers, 2015 winner, residual connections make it easier to recover the identity mapping (and more explicit), batchnorm, xavier:2 init, sgd+momentum, weight decay, no dropout used, learning rate schedule Residual Blocks and L2 normalization: l2 normalization encourages weights to be close to zero, this might not actually make a lot of sense with normal convnets but with resnets this causes unused layers to just become identity mappings. Think of what happens with the gradient at "additive gates". The gradient is the sum of the gradients of both branches.

Other architectures:

**"Network in Network"****"Identity Mappings in Deep Residual Networks":**improved ResNet block design**"Wide Residual Networks":**Argues that residuals are the important factor, not depth. Uses wider residual blocks and fewer layers. 50-layer wide ResNet outperforms 152-layer original ResNet. Increasing width instead of depth is more computationally efficient (parallelizable)**"Aggregated Residual Transformations for Deep Neural Networks (ResNeXt)":**multiple parallel pathways within a residual block**"Deep Networks with Stochastic Depth":**Motivation: reduce vanishing gradients and training time through shallower networks during training. Like dropout but for entire layers. Use full network at test time.**"FractalNet: Ultra-Deep Neural Networks without Residuals":**argues that key is transitioning effectively from shallow to deep and residual representations are not necessary. Fractal architecture with both shallow and deep paths to output. Trained with dropping out sub-paths, full network at test time**DenseNet "Densely Connected Convolutional Networks":**dense blocks where each layer is connected to every other layer in feedforward fashion. Alleviates vanishing gradient, strengthens feature propagation, encourages feature reuse**"SqueezeNet: AlexNet-level Accuracy with 50x Fewer Parameters and less than 0.5Mb Model Size"**

Number of layers of a network: generally given as the number of weight layers (conv, fc)

\(h_t=f_W(h_{t-1}, x_t)\)

for example:

\(h_t=\tanh(W_{hh}h_{t-1}+W_{xh}x_t)\) \(y_t=W_{hy}h_t\)

or with biases:

\(h_t=\tanh(W_{hh}h_{t-1}+W_{xh}x_t+b_h)\) \(y_t=W_{hy}h_t+b_y\)

(Using \(\tanh\) generally causes the vanishing gradient problem but avoids the problem that unbounded activations in RNNs may explode, see this and this.

One-to-one, one-to-many, many-to-one, many-to-many architectures

- Feed initial primer/prompt through the network
- At each step apply softmax to output and sample from the resulting probability distribution. Then feed this back into the network as input. (Instead of sampling we could also take the argmax which may increase stability but reduces sample diversity)

```
ys[t] = np.dot(Why, hs[t]) + by # unnormalized log probabilities for next chars
ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars
```

Hard-attention is not differentiable. "Need to use something slightly fancier than vanilla backprob."

Slight hack to fix exploding gradients:

`grad_norm = np.sum(grad`*grad)
if grad_norm > threshold:
grad *= (threshold / grad_norm)

better gradient flow: gradient flows through forget gate operation but this can be different for each timestep so we are not repeatedly multiplying by the same matrix. Also the forget gate is also in the range [0, 1] due to the sigmoid. Furthermore we have no non-linearity in the gradient path.

- Idea: Sliding window approach: Split image into tiny crops and classify each one by a conventional cnn classifier. Extremely inefficient
- Idea: fully convolutional network. Very computationally expensive since we have convolutional layers with large height/width and depth.
- Solution: fully convolutional network with down- and upsampling

**Unpooling**(nearest neighbor/bed of nails)**Max Unpooling**: similar to bed of nails unpooling above but we choose the spots based on the max-positions of a previous max-pooling layer. May improve image sharpness compared to e.g. nn unpooling.**Transpose Convolution:**learnable upsampling Forward pass of transpose convolution is the same operation as the backward pass of a normal convolutionOverlapping patches are added (not averaged) as a result of convolution's definition as a matrix multiplication

Why is it called transposed conv?

Use for example 4x4 stride=2 transposed conv or 2x2 stride=2 transposed conv to avoid checkerboard artefacts (or resize conv).

Solution: single convnet that outputs classification score (apply softmax loss) and (x, y, w, h) bounding box coordinates (apply l2/l1/any regression loss)

Related task: human pose estimation (e.g. network outputs the position of 14 fixed joints)

Sliding window approach intractable since there are too many possible crops Solution: "Region Proposals" (for example with "Selective Search") to propose about 2000 possible image crops that could contain objects (high noise, but high recall).

**R-CNN:**create 2000 region proposals, scale them to a fixed size, apply convnet to classify and create bounding box correction
Problems: training (84h) and inference (47s/img with vgg16) is slow, using fixed instead of learned region proposals **Faster R-CNN:**Fixes main problem with R-CNN (slowness) by computing feature maps for the entire image first (and only once) and then takes crops from region proposals (which are based on the image) out of the computed feature maps. Faster R-CNN's computation time is dominated by computing 2000 region proposals (0.32 sec excl, 2.3 sec incl region proposals)**Faster R-CNN:**insert Region Proposal Network to predict proposals from features**YOLO / SSD (Single-Shot MultiBox Detector)**

Summary:

classification + bounding box regression + segmentation mask for each region proposal

Can also do pose estimation

Visualizing later layers doesn't work well since activations in later layers are not in terms of pixels of the input image but rather in terms of activations of previous layers:

Can be used for segmentation without supervision

**PCA**(Principal Component Analysis)**t-SNE**(t-distributed stochastic neighbor embedding)

Collect maximally activating patches, then run guided backprop to find pixels that maximally affect the label

By adding additional priors we can generate more realistic images.

+ jiggle image as regularization to increase spatial smoothness

Use gram matrices for neural texture synthesis: pass target image through cnn and compute gram matrix. Then use gradient descent to train new image to have a similar gram matrix. (Or use weighted average of multiple gram matrices at different layers)

Pass style image through network to get gram matrices Pass content image to get features

Then run feature inversion and neural texture synthesis in parallel

- We can also use multiple scale images
- Fast style transfer: train neural network to do style transfer in one step instead of needing to do thousands of forward and backward passes for each image

- clustering
- dimensionality reduction
- feature learning (e.g. autoencoders)
- density estimation (approximate prob. density function of distribution based on data)

- Explicit density estimation: explicitly define and solve for p_model
- Implicit density estimation: Learn model that can sample from p_model but does not explicitly define the distribution.

Why generative models:

- generating images, artwork, audio samples
- generative models of time-series data can be used for simulation and planning (reinforcement learning applications)
- generating data from latent representations

$$ p(x) = \prod_{i=1}^n p(x_i | x_1, \ldots, x_{i-1}) $$

Then learn to generate the image pixel by pixel (computes probability distribution over all pixel intensities)

PixelCNN: still generate image pixels sequentially but don't use an RNN and instead use a CNN to look at the context region around the next pixel.

For both PixelRNN and PixelCNN density estimation is explicit because the network predicts a probability distribution of possible values for each pixel and by the chain rule for the entire image.

- Train autoencoder on a large amount of unlabeled data
- Throw away decoder
- Train classifier on top of the (feature) encoder with smaller amount of labeled data. (and possibly finetune encoder)

$$ p_\theta(x) = \int p_\theta(z) p_\theta(x|z)dz $$

(this integral is intractable because we cannot sum over

Instead optimize variational lower bound ("ELBO") → See lecture for exact discussion and derivations

Instead of sampleing from a complex high dimensional distribution, sample from a simpler one and transform it into a complex one using a neural network

To train a GAN we alternate between the generator and discriminator objective.

We want to minimize the generator objective in the graph above. Unfortunately we only get a strong gradient when the generator is already very good. Instead optimize the following objective

We are able to do interpolation and interpretable vector math with latent vectors.

- \(\mathcal{S}\) set of possible states
- \(\mathcal{A}\) set of possible actions
- \(\mathcal{R}\) distribution of reward given for (state, action) pair
- \(\mathbb{P}\) transition probability i.e. distribution over next states given current (state, action) pair
- \(\gamma\) discount factor

satisfies the Markov property

A

The optimal policy \(\pi^\ast\) corresponds to taking the best action in any state as specified by \(Q^\ast\)

→ Instead use a function approximator to estimate the action-value function

The network outputs the approximated Q-value for each action at the current state:

Instead:

- continually update a replay memory table of transitions (\(s_t, a_t, r_t, s_{t+1}\)) as game episodes are played
- Train Q-network on random minibatches of transitions from the replay memory instead of consecutive samples

Class of parametrized policies: $$ \Pi=\{\pi_\theta, \theta\in \mathbb{R}^m\} $$

Value of a policy $$ J(\theta) = \mathbb{E}\Big[\sum_{t\ge 0} \gamma^tr_t|\pi_\theta\Big] $$

We want to find the optimal policy $$ \theta^\ast = \arg \max_\theta J(\theta) $$

When we differentiate this however becomes untractable: $$ \nabla_\theta J(\theta) = \int_\tau r(\tau)\nabla_\theta p(\tau;\theta)d\tau $$

Use the fact that: \(\nabla_\theta p(\tau; \theta)=p(\tau; \theta) \frac{\nabla_\theta p(\tau; \theta)}{p(\tau; \theta)} = p(\tau; \theta) \nabla_\theta \log p(\tau; \theta)\)

Inject this back: $$ \nabla_\theta J(\theta) = \int_\tau (r(\tau)\nabla_\theta \log p(\tau; \theta))p(\tau; \theta)d\tau = \mathbb{E}_{\tau\sim p(\tau;\theta)}[r(\tau)\nabla_\theta \log p(\tau; \theta)] $$ We have taken a gradient of an expectation and transformed it into an expectation of gradients.

Simple baseline: constant moving average of rewards experienced from all trajectories

From REINFORCE, the \(J(\theta)\) estimator is $$ \nabla_\theta J(\theta) = \sum_{t\ge 0} r(\tau)\nabla_\theta \log \pi_\theta(a_t|s_t) $$ In Actor-Critic it is $$ \nabla_\theta J(\theta) = \sum_{t\ge 0} (Q^\pi(s_t, a_t)-V^\pi(s_t))\nabla_\theta \log \pi_\theta(a_t|s_t) $$ The

Q-learning methods are more sample efficient but do not work in all cases

Applications of RL:

- Recurrent Attention Models (RAM) uses REINFORCE
- AlphaGo uses Policy Gradients

Pruning happens in humans too (50 trillion synapses as newborns, 1000 trillion at 1yo, 500 trillion as adolescents).

As pruning removes the neurons with the least impact (those with the smallest weights) this changes the weight distribution.

With trained quantization a special training procedure is needed (that trains only the centroids):

How many bits do we actually need?

With pruning + quantization we can reduce the AlexNet memory footprint by about 97% with minimal reduction in accuracy

Achieves AlexNet level accuracy with 50x fewer parameters. Can be further compressed with the methods above to a total of 510x fewer parameters.

Decompose a conv layer with \(d\) filters with filter size \(k\times k \times c\) to

- A layer with \(d'\) filters (\(k\times k \times c\))
- A layer with \(d\) filters (\(1 \times 1 \times d'\))

Variables and some operations should be kept in 32 bits for numerical reasons

Take the "output of geometric ensemble" and divide score by a temperature to get a much softer distribution $$ p_i = \frac{\exp \frac{z_i}{T}}{\sum_j \exp \frac{z_i}{T}} $$

If this were the case however, each adversarial example would be unique to a specific trained model. But often, a single adversarial example can fool many different models (or different trained instances of the same model) in the same way.

Furthermore by taking the difference between the original example and the adversarial example we get a direction in input space that we can use to turn almost all items in the dataset into adversarial examples.

→ systematic example, not random effect

Modern neural nets are piecewise linear

- nonlinear interactions between parameters and the output
- piecewise linear (with relatively few pieces) interactions between the input and the output

→ Optimization problems that optimize the input of the model are much easier than problems that optimize the parameters

We can visualize this by putting \((\text{example})+\epsilon(\text{fixed random unit vector})\) through a relu convnet model. Only in the center the network returns the correct class, automobile.

In large images with many pixels we can make very small changes to each pixel but travel very far in vector space (as measured by the l2 norm).

When doing adversarial experiments make sure not to apply too large changes that actually change the true class of the image. For example use max-norm (L∞ norm) that constrains the maximum change for each pixel.

With the max norm the total change to the image can be large but the change can't be concentrated on a small area.

\(\tilde x = x+ \epsilon \text{ sign}(\nabla_x J(X))\)

Adversarial examples live in more or less linear subspaces (it was initially thought they live in "little tiny pockets"):

→ Only need to get the direction right, don't need to find exact coordinates in space to go into adversarial subspace (in the images, can go upwards first, then to the right)

Adversarial examples are not just noise (noise has very little effect compared to adversarial examples). Adding random noise rarely changes the classification decision:

Since the procedure in the image above was performed on the test set, in some cases the network is already wrong, even without adding any noise (Usually in those cases adding the noise doesn't make the model correct).

Neural networks are wrong almost everywhere and behave reasonably only on a really thin manifold around the data.

- Adversarial attacks work for RL too (so far we have only been able to make them fail at their task, not perform a different task/action)
- shallow RBF networks resist adversarial pertubations fairly well

These attacks are possible cross-technique and cross-training instance