Activation functions





like sigmoid but zero centered


\(\textrm{relu}(x) = \textrm{max}(0, x)\)


relu neurons can be initialized with slightly positive bias (e.g. 0.01) to decrease the likelihood of it being dead

Leaky ReLU

\(\textrm{relu}(x) = \textrm{max}(0.01x, x)\)
Leaky ReLUs don't die, better gradient behaviour

Parametric ReLU (PReLU)

\(\textrm{relu}(x) = \textrm{max}(\alpha x, x)\)
\(\alpha\) is learned

Exponential Linear Units (ELU)

\(f(x) = \left\{ \begin{array} {ll} x & \textrm{if} \, x > 0 \\ \alpha (\textrm{exp}(x)-1) & \textrm{if} \, x \le 0 \\ \end{array}\right.\)


Maxout Neuron

\(\textrm{max}(w^T_1x+b_1, w^T_2x+b_2)\)


Hierarchical Softmax

Decomposes labels into a tree, where each label is a path along the tree. A softmax classifier is trained at every node of the tree to select either the left or right branch.

Multi -label/-attribute/-tag loss


w_i = 0 (all weights=0)

all neurons do the same operation and have the same gradient, then they update all in the same way → all neurons do essentially the same thing

small random numbers W = 0.01*np.random.randn(D, H)

Works okay for small networks, in deeper networks since all values are multiplied by 0.01 at each layer the values get smaller and smaller until at the end of the network they are extremely tiny/zero. Same thing happens for gradients in the other direction → network doesn't learn

large random numbers (e.g. randn around 1)

almost all neurons completely saturated with (tanh)

Xavier initialization (Glorot et al., 2010)

W = np.random.randn(fan_in, fan_out) / np.sqrt(fan_in) # layer initialization

To work with relu use this (note additional "/2") (He et al., 2015). This accounts for the 50% of neurons that are dead.
W = np.random.randn(fan_in, fan_out) / np.sqrt(fan_in/2) # layer initialization

Cross Validation


See this.

Some quotes:
"Unintuitive effects and their consequences. Notice that if one of the inputs to the multiply gate is very small and the other is very big, then the multiply gate will do something slightly unintuitive: it will assign a relatively huge gradient to the small input and a tiny gradient to the large input. Note that in linear classifiers where the weights are dot producted wTxiwTxi (multiplied) with the inputs, this implies that the scale of the data has an effect on the magnitude of the gradient for the weights. For example, if you multiplied all input data examples by 1000 during preprocessing, then the gradient on the weights will be 1000 times larger, and you’d have to lower the learning rate by that factor to compensate. This is why preprocessing matters a lot, sometimes in subtle ways! And having intuitive understanding for how the gradients flow can help you debug some of these cases."

"We discussed the importance of staged computation for practical implementations of backpropagation. You always want to break up your function into modules for which you can easily derive local gradients, and then chain them with chain rule. Crucially, you almost never want to write out these expressions on paper and differentiate them symbolically in full, because you never need an explicit mathematical equation for the gradient of the input variables. Hence, decompose your expressions into stages such that you can differentiate every stage independently (the stages will be matrix vector multiplies, or max operations, or sum operations, etc.) and then backprop through the variables one step at a time."

Gradient Descent

Problems with SGD

SGD + Momentum, Nesterov Momentum

Many of the problems with SGD are solved by SGD+momentum. Momentum tracks "velocity" and applies friction \(\rho\). This helps to avoid oscillations and accelerates traversal of local minima and saddle points.

Nesterov momentum uses lookahead position (after adding velocity) to compute gradient. Formulation of NAG is a bit annoying since we usually want to compute the loss function and gradient at the same point. Change of variables fixes this

Does momentum causes us to skip over minima (when the minimum lies in a very narrow basin)? Yes, but usually those sharp minima are minima that we want to avoid anyway because they lead to overfitting. If we increase our training set, those sharp minima may disappear, while large flat minima will remain.


grad_squared = 0
while True:
	dx = compute_gradient(x)
	grad_squared += dxdx
	x -= lr  dx / (np.sqrt(grad_squared) + 1e-7)
Accelerates movement along dimensions of slow movement (weak gradient) and slows movement along dimensions of fast movement (strong gradient). Division by a large/small number.

Problem: grad_squared is never "reset" → step size decrease over the course of training


fixed the problem with AdaGrad by decaying the grad_squared term
grad_squared = 0
while True:
	dx = compute_gradient(x)
	grad_squared = decay_rate  grad_squared + (1-decay_rate)dxdx
	x -= lr  dx / (np.sqrt(grad_squared) + 1e-7)


Momentum: build up velocity by adding averaging the gradients AdaGrad & RMSProp: build up estimate of squared gradient and divide by those Adam: combines both
Adam without bias correction:
first_moment = 0
second_moment = 0
while True:
	dx = compute_gradient(x)
	first_moment = beta1  first_moment + (1+beta1)  dx # Momentum
    second_moment = beta2  second_moment + (1-beta2)  dxdx
	x -= lr  first_moment / (np.sqrt(second_moment) + 1e-7)
Problem: At the first training steps first and second moment estimates are still close to zero. This causes us to take too large steps in the beginning. (because of the division by sqrt second_moment)

Full Adam adds bias correction to fix this:
first_moment = 0
second_moment = 0
for t in range(num_iterations):
	dx = compute_gradient(x)
	first_moment = beta1  first_moment + (1+beta1)  dx # Momentum
    second_moment = beta2  second_moment + (1-beta2)  dxdx # AdaGrad/RMSProp
    first_unbias = first_moment / (1-beta1  t)
    second_unbias = second_moment / (1-beta2  t)
	x -= lr  first_unbias / (np.sqrt(second_unbias) + 1e-7)
One problem with Adam: The taco shell optimization landscape problem above (under "problems with sgd") is only partly fixed by adam. Specifically only in cases where the "taco shell axes" are aligned with coordinate axes. In other cases where the valley is rotated Adam still has the same problem.

Learning rate decay

Common with sgd/momentum, less commonly used with adam.
Exponential Decay: \(\alpha = \alpha_0 e^{-kt}\)
1/t decay: \(\alpha = \alpha_0/(1+kt)\)
Step decay: e.g. decay learning rate by half every few epochs

Higher-Order Optimization

We can use higher order derivatives instead of only the first one. This has some advantages (we don't necessarily need a learning rate since we can directly step to the minimum of our quadratic approximation) but is very computationally expensive.

Hessian Matrix for Newton parameter update has O(n^2) elements, inverting takes O(n^3)

Instead there are less expensive versions

works well with full batch deterministic sgd but not in the stochastic min-batch setting

Model Ensembles

  1. Train multiple independent models (possibly with different hyperparameters)
  2. At test time average their results

→ A few percent higher performance



L2 Regularization

"The L2 regularization has the intuitive interpretation of heavily penalizing peaky weight vectors and preferring diffuse weight vectors. As we discussed in the Linear Classification section, due to multiplicative interactions between weights and inputs this has the appealing property of encouraging the network to use all of its inputs a little rather than some of its inputs a lot. Lastly, notice that during gradient descent parameter update, using the L2 regularization ultimately means that every weight is decayed linearly: W += -lambda * W towards zero."[1]

L1 Regularization

"The L1 regularization has the intriguing property that it leads the weight vectors to become sparse during optimization (i.e. very close to exactly zero). In other words, neurons with L1 regularization end up using only a sparse subset of their most important inputs and become nearly invariant to the “noisy” inputs. In comparison, final weight vectors from L2 regularization are usually diffuse, small numbers. In practice, if you are not concerned with explicit feature selection, L2 regularization can be expected to give superior performance over L1." [1]


At test time we want to remove the stochasticity of dropout:
Instead approximate this by multiplying by dropout probability to rescale all activations. "Drop at train time, scale at test time"

Inverted Dropout: Drop and scale at train time, don't do anything at test time.

Bias regularization

Uncommon since bias parameters don't interact with the data multiplicatively, therefore don't control how individual data dimensions influence the final output

Data Augmentation

Transform images/data in such a way that the label is preserved


Instead of randomly zeroing out neuron activations, zero out some of the values in the weight matrix instead.

Fractional Max Pooling

Use multiple different pooling maps for each pooling layer

Stochastic Depth

During training drop some layers

Batch Normalization (Ioffe and Szegedy, 2015)

"We want gaussian activations → just make them so"

Normalize by empirical mean and variance (independently for each dimension, applied batch-wise).

Can be applied to fully connected and convolutional layers. In convolutional layers we normalize across all spatial locations. (For conv only one scalar mean and variance, for fully connected possibly one scalar mean and variance for each feature)

BN avoids saturation, but maybe we want some saturation? Add parameters \(\beta\) and \(\gamma\) to control mean and variance. In this way we don't simply normalize the layer output but instead give the layer explicit control over it's output's mean and variance. (layer can just recover identity mapping)

Why should data everywhere in the network (and during preprocessing) be normalized?

CNN Architectures

Other architectures:

Number of layers of a network: generally given as the number of weight layers (conv, fc)


Current state \(h_t\), state transition function \(f_W\), current input \(x_t\), current output \(y_t\). The recurrence relation is of the form:

\(h_t=f_W(h_{t-1}, x_t)\)

for example:

\(h_t=\tanh(W_{hh}h_{t-1}+W_{xh}x_t)\) \(y_t=W_{hy}h_t\)

or with biases:

\(h_t=\tanh(W_{hh}h_{t-1}+W_{xh}x_t+b_h)\) \(y_t=W_{hy}h_t+b_y\)

(Using \(\tanh\) generally causes the vanishing gradient problem but avoids the problem that unbounded activations in RNNs may explode, see this and this.
One-to-one, one-to-many, many-to-one, many-to-many architectures
Sequence to sequence: Many-to-one encoder + One-to-many decoder

Char-level RNN

Generating text with a char-level rnn:

char-level rnn example: karpathy/ What are unnormalized log probabilities in See this The softmax function is \(\exp{z_k}/\sum_i{\exp{z_i}}\). \(z\) are called the unnormalized log probabilities because they are not yet normalized by \(\sum_i{\exp{z_i}}\) and \(z_k = \log (\exp z_k)\).
ys[t] =, hs[t]) + by # unnormalized log probabilities for next chars
ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars

Backpropagation through time

Forward pass through entire sequence to compute loss, backward pass through entire sequence to compute gradient

Truncated BPTT: Forward and backward through smaller chunks of the whole sequence (carry over hidden state from previous chunk to the beginning of the next chunk)

Image Captioning

Feed image through convnet (for example inceptionv4 without the last two layers) to get a feature vector \(v\). Then run an rnn with the hidden state recurrence relation \(h_t=\tanh(W_{hh}h_{t-1}+W_{xh}x_t+W_{ih}v)\) which incorporates the image information at every timestep. (Alternatively just use the feature vector as the initial hidden state of a language model rnn instead of the changed recurrence relation)

Image Captioning with Attention

Instead of a single feature vector for the entire image, the convnet now extracts one feature vector for each patch of the image (e.g. 14x14 feature map for a 300x300px image), which is then fed into the lstm language model. In addition to the vocabulary prob. distribution the lstm model also outputs an (attention) distribution over all locations in the image. As the next input the model then receives the word/char sampled from the first prob. distr. and the (14x14) feature vector map weighted by the attention distribution.
Soft-attention: weighted combination of features from all image locations
Hard attention: Select exactly one location in the image to look at
Hard-attention is not differentiable. "Need to use something slightly fancier than vanilla backprob."

Visual question answering: use rnn to summarize question, cnn to summarize image, then use another rnn to generate answer based on both summaries.

Multilayer RNNs: hidden states from one rnn (one for each timestep) are passed to another rnn

Vanilla RNN Gradient Flow

When backpropping through matrix multiply with \(W\) we multiply by the transpose of \(W\). Therefore in a deep RNN when computing the gradient we have repeated multiplication by W and application of tanh. This could cause our gradients to either explode or vanish. If \(W\) is a scalar then the gradient explodes or vanishes depending on whether \(W > 1\) or \(W < 1\). In the non-scalar case it depends on whether the largest singular value \(> 1\) or \(< 1\).

Slight hack to fix exploding gradients: gradient clipping
grad_norm = np.sum(gradgrad)
if grad_norm > threshold:
    grad = (threshold / grad_norm)
(grad_norm might be the l2 norm)

Long Short Term Memory (LSTM)

explicit forget, input, update gates
better gradient flow: gradient flows through forget gate operation but this can be different for each timestep so we are not repeatedly multiplying by the same matrix. Also the forget gate is also in the range [0, 1] due to the sigmoid. Furthermore we have no non-linearity in the gradient path.
GRU: LSTM variant

Detection and Segmentation

Semantic Segmentation

Task: Classify every pixel of an image

Upsampling Methods

Why is it called transposed conv?
Use for example 4x4 stride=2 transposed conv or 2x2 stride=2 transposed conv to avoid checkerboard artefacts (or resize conv).

Classification + Localization

Task: single object classification + localization
Solution: single convnet that outputs classification score (apply softmax loss) and (x, y, w, h) bounding box coordinates (apply l2/l1/any regression loss)
Multi-task loss: weighted sum of multiple losses, each for a (possibly) different task. It's difficult to set this weighting parameter well because the parameter directly changes the value of the loss function (we can't just look which hyperparameter setting causes the loss after training to be the lowest because the parameter directly influences the loss scale). Recommended to find a different metric (accuracy,..) by which to judge the hyperparameter.
Regression loss: basically any loss other than cross-entropy or softmax (e.g. l1, l2,..). Losses for continuous outputs rather than classification categories
Related task: human pose estimation (e.g. network outputs the position of 14 fixed joints)

Object Detection

Task: multiple objects, draw bounding boxes and classify
Sliding window approach intractable since there are too many possible crops Solution: "Region Proposals" (for example with "Selective Search") to propose about 2000 possible image crops that could contain objects (high noise, but high recall).


Instance Segmentation

Mask R-CNN
classification + bounding box regression + segmentation mask for each region proposal
Can also do pose estimation

Visualizing & Understanding

Visualizing Convnet filter weights

We can easily visualize the filters of the first layer of a convnet (these are the filter weights):
Visualizing later layers doesn't work well since activations in later layers are not in terms of pixels of the input image but rather in terms of activations of previous layers:

Maximally activating patches

One way to visualize which input images maximally excite a specific neuron is to iterate over a dataset and look for the image that maximally excites that neuron.

Cluster by L2-nn in feature space

We can also cluster similar images from a dataset by looking for l2-nn in feature space. L2 dist in feature space is related to semantic difference, L2 dist in pixel space is related to pixel similarity.

Activation Visualization

We can also look at activation values (here one filter in one layer seems to be activated at neurons over the persons face):

Occlusion Experiments

Saliency Maps

Gradient of unnormalized class score with respect to image pixels, take absolute value and max over RGB channels
Can be used for segmentation without supervision

Dimensionality Reduction

Guided Backpropagation

Only backprop positive gradients through each ReLU. This way we only keep track of positive influences throughout the network → Images come out nicer
Collect maximally activating patches, then run guided backprop to find pixels that maximally affect the label

Synthesizing images with Gradient Ascent

Use gradient ascent (with a natural image regularizer) to maximize neuron value/label value of a synthetic image.
By adding additional priors we can generate more realistic images.

Deep Dream

Amplify activations of entire layer of neurons by performing gradient descent on the image.
+ jiggle image as regularization to increase spatial smoothness

Feature inversion

Reconstruct image from feature vector at a specific layer

Neural Texture Synthesis

Gram Matrix: pick two feature vectors of shape \(W \times H\), matrix product create co-occurrence matrix between different features. Then take the average of this matrix over all these w, h feature vector pairs
Use gram matrices for neural texture synthesis: pass target image through cnn and compute gram matrix. Then use gradient descent to train new image to have a similar gram matrix. (Or use weighted average of multiple gram matrices at different layers)

Neural Style Transfer

combines feature inversion and neural texture synthesis (gram matrices) to apply a given style to a content image
Pass style image through network to get gram matrices Pass content image to get features
Then run feature inversion and neural texture synthesis in parallel

Generative Models

Unsupervised learning

Generative Models (density estimation) Given training data ~ p_data, generate new samples from distribution p_model, where p_data ≈ p_model

Why generative models:

Fully visible belief network (explicit density)

Use chain rule for probabilities to decompose the likelihood of an image into a product of 1d distributions.

PixelRNN defines the tractable density function:
$$ p(x) = \prod_{i=1}^n p(x_i | x_1, \ldots, x_{i-1}) $$
Then learn to generate the image pixel by pixel (computes probability distribution over all pixel intensities)

PixelCNN: still generate image pixels sequentially but don't use an RNN and instead use a CNN to look at the context region around the next pixel.
For both PixelRNN and PixelCNN density estimation is explicit because the network predicts a probability distribution of possible values for each pixel and by the chain rule for the entire image.

Variational Autoencoders

General Autoencoders: Autoencoders allow us to take advantage of large amount of unlabeled data and carry over this knowledge to smaller supervised tasks.
  1. Train autoencoder on a large amount of unlabeled data
  2. Throw away decoder
  3. Train classifier on top of the (feature) encoder with smaller amount of labeled data. (and possibly finetune encoder)

Variational Autoencoders (probabilistic spin to traditional autoencoders) define intractable density function with latent \(z\) (expectation over all possible values of \(z\))
$$ p_\theta(x) = \int p_\theta(z) p_\theta(x|z)dz $$
(this integral is intractable because we cannot sum over all \(z\))
Instead optimize variational lower bound ("ELBO") → See lecture for exact discussion and derivations

Generative Adversarial Networks

GANs allow us to sample directly from the (implicit) distribution without modelling it explicitly. Trained through a 2-player game.
Instead of sampleing from a complex high dimensional distribution, sample from a simpler one and transform it into a complex one using a neural network
To train a GAN we alternate between the generator and discriminator objective.
We want to minimize the generator objective in the graph above. Unfortunately we only get a strong gradient when the generator is already very good. Instead optimize the following objective
We are able to do interpolation and interpretable vector math with latent vectors.

Deep Reinforcement Learning

Markov Decision Process \((\mathcal{S, A, R}, \mathbb{P}, \gamma)\)

satisfies the Markov property

A policy is a function \(\pi\colon \mathcal{S}→\mathcal{A}\) The optimal policy \(\pi^\ast\) maximizes the cumulative future discounted reward $$ \pi^\ast = \arg \max_\pi \Bigg[\sum_{t\ge 0}\gamma^t r_t|\pi \Bigg] $$ with \(s_0\sim p(s_0), a_t\sim\pi(\cdot|s_t), s_{t+1}\sim p(\cdot|s_t, a_t)\).

Trajectory: \(s_0, a_0, r_0, s_1, a_1, r_1,\ldots\)

Value function at state \(s\) is the expected cumulative reward from following the policy from state \(s\). $$ V^\pi(s) = \mathbb{E}\Bigg[\sum_{t\ge0} \gamma^t r_t | s_0 = s, \pi\Bigg] $$ Q-value function at state \(s\) and action \(a\) is the expected cumulative reward from taking action \(a\) and the following the policy $$ Q^\pi(s, a) = \mathbb{E}\Bigg[\sum_{t\ge0} \gamma^t r_t | s_0 = s, a_0=a, \pi\Bigg] $$ The optimal Q-value function is the maximum cumulative reward achievable from taking action \(a\) and then following a policy. $$ Q^\ast(s, a) = \max_\pi \mathbb{E}\Bigg[\sum_{t\ge0} \gamma^t r_t | s_0 = s, a_0=a, \pi\Bigg] $$ \(Q^\ast\) satisfies the Bellman equation $$ Q^\ast(s,a) = \mathbb{E}_{s'\sim\epsilon}[r+\gamma\max_{a'} Q^\ast(s', a')|s, a] $$ If the optimal state-action values for the next step \(Q^\ast(s', a')\) are known then the optimal strategy is to take the action that maximizes the expected value of \(r+\gamma Q^\ast(s', a')\)

The optimal policy \(\pi^\ast\) corresponds to taking the best action in any state as specified by \(Q^\ast\)

Value iteration algorithm: use bellman equation as an iterative update

→ Instead use a function approximator to estimate the action-value function


Approximate the Q-function through a neural network and train it to approximately satisfy the Bellman equation:
The network outputs the approximated Q-value for each action at the current state:

Experience Replay

When learning from sequential samples the samples are strongly correlated which leads to inefficient learning


Policy Gradients

Problem with Q-learning: the Q-function can be very complicated (high dimensional state, lots of possible actions) → Just learning the policy directly can be much simpler

Class of parametrized policies: $$ \Pi=\{\pi_\theta, \theta\in \mathbb{R}^m\} $$
Value of a policy $$ J(\theta) = \mathbb{E}\Big[\sum_{t\ge 0} \gamma^tr_t|\pi_\theta\Big] $$
We want to find the optimal policy $$ \theta^\ast = \arg \max_\theta J(\theta) $$
REINFORCE $$ J(\theta) = \mathbb{E}_{\tau\sim p(\tau;\theta)}[r(\tau)] = \int_\tau r(\tau)p(\tau;\theta)d\tau $$ where \(r(\tau)\) is the reward of a trajectory \(\tau=(s_0, a_0, r_0, s_1,\ldots)\)
When we differentiate this however becomes untractable: $$ \nabla_\theta J(\theta) = \int_\tau r(\tau)\nabla_\theta p(\tau;\theta)d\tau $$
Use the fact that: \(\nabla_\theta p(\tau; \theta)=p(\tau; \theta) \frac{\nabla_\theta p(\tau; \theta)}{p(\tau; \theta)} = p(\tau; \theta) \nabla_\theta \log p(\tau; \theta)\)

Inject this back: $$ \nabla_\theta J(\theta) = \int_\tau (r(\tau)\nabla_\theta \log p(\tau; \theta))p(\tau; \theta)d\tau = \mathbb{E}_{\tau\sim p(\tau;\theta)}[r(\tau)\nabla_\theta \log p(\tau; \theta)] $$ We have taken a gradient of an expectation and transformed it into an expectation of gradients.
Simple baseline: constant moving average of rewards experienced from all trajectories


The raw values of the rewards \(r(\tau)\) are not necessarily meaningful. For example when they are all positive, we would just try to keep pushing all action probabilities upwards. Instead, the actor-critic algorithm incorporates Q-learning into policy gradients and uses \(Q^\pi(s_t, a_t)-V^\pi(s_t)\) as the measure of "happiness" that we get from taking action \(a_t\) in state \(s_t\).
From REINFORCE, the \(J(\theta)\) estimator is $$ \nabla_\theta J(\theta) = \sum_{t\ge 0} r(\tau)\nabla_\theta \log \pi_\theta(a_t|s_t) $$ In Actor-Critic it is $$ \nabla_\theta J(\theta) = \sum_{t\ge 0} (Q^\pi(s_t, a_t)-V^\pi(s_t))\nabla_\theta \log \pi_\theta(a_t|s_t) $$ The advantage function is \(A^\pi(s,a) =Q^\pi(s, a)-V^\pi(s)\) and denotes how much better the action was than expected.


Policy gradient methods are stable and very general but suffer from high variance
Q-learning methods are more sample efficient but do not work in all cases

Applications of RL:

Efficient Methods and Hardware for Deep Learning

The following methods drastically decrease the memory footprint (by 10x for inception and resnet, about 40x for others) while increasing speed and decreasing energy usage (by about 3x) with no hit on accuracy.


By iteratively pruning and retraining weights, the majority of the neurons can be removed with very little impact on performance.
Pruning happens in humans too (50 trillion synapses as newborns, 1000 trillion at 1yo, 500 trillion as adolescents).
As pruning removes the neurons with the least impact (those with the smallest weights) this changes the weight distribution.

Weight Sharing (Trained Quantization)

Use k-means clustering to cluster similar values in the weight matrices and replace them with a single quantized shared weight. (8x less memory with 4bit instead of 32 bit)
With trained quantization a special training procedure is needed (that trains only the centroids):
How many bits do we actually need?
With pruning + quantization we can reduce the AlexNet memory footprint by about 97% with minimal reduction in accuracy

Huffman Coding

Use Huffman coding to let the binary representation of frequently used quantized weights to use fewer bits and use more bits for infrequently used weights.


Reduced version of Inception with just two pathways per block.
Achieves AlexNet level accuracy with 50x fewer parameters. Can be further compressed with the methods above to a total of 510x fewer parameters.


Train with float then quantize weights and activations and use for inference

Low Rank Approximation

Layer responses lie in a low-rank subspace
Decompose a conv layer with \(d\) filters with filter size \(k\times k \times c\) to

Binary/Ternary Net

Train with full precision, only use weights {-1, 0, 1} at inference time.

Winograd Convolution

Alternative way to perform convolution that performs 2.25 fewer FMAs (fused multiply-add ops)


Data parallelism: run multiple training examples in parallel
Model parallelism: split model over multiple processors

Mixed Precision Training

Many operations can be performed in 16 bit mode with little change in accuracy.
Variables and some operations should be kept in 32 bits for numerical reasons

Model Distillation

Given multiple large teacher networks (e.g. Googlenet + Vggnet + Resnet) model distillation aims to distill all their knowledge into a smaller student net.
Take the "output of geometric ensemble" and divide score by a temperature to get a much softer distribution $$ p_i = \frac{\exp \frac{z_i}{T}}{\sum_j \exp \frac{z_i}{T}} $$

Adversarial Examples and Adversarial Training

Almost all machine learning models (not just neural networks) are highly vulnerable to adversarial examples.

Adversarial Examples from Overfitting

One theory was that adversarial attacks are possible due to overfitting. Complex neural networks with lots of parameters may just have random gaps in their decision areas/boundaries.
If this were the case however, each adversarial example would be unique to a specific trained model. But often, a single adversarial example can fool many different models (or different trained instances of the same model) in the same way.
Furthermore by taking the difference between the original example and the adversarial example we get a direction in input space that we can use to turn almost all items in the dataset into adversarial examples.
→ systematic example, not random effect

Adversarial Examples from Underfitting

Another theory is that adversarial attacks are possible due to underfitting
Modern neural nets are piecewise linear

→ Optimization problems that optimize the input of the model are much easier than problems that optimize the parameters

We can visualize this by putting \((\text{example})+\epsilon(\text{fixed random unit vector})\) through a relu convnet model. Only in the center the network returns the correct class, automobile.
In large images with many pixels we can make very small changes to each pixel but travel very far in vector space (as measured by the l2 norm).

When doing adversarial experiments make sure not to apply too large changes that actually change the true class of the image. For example use max-norm (L∞ norm) that constrains the maximum change for each pixel.

With the max norm the total change to the image can be large but the change can't be concentrated on a small area.

Fast Gradient Sign Method

Maximize \(J(x, \theta) + (\tilde x-x)^\top \nabla_x J(x)\) subject to \(||\tilde x- x||_\infty \le \epsilon\).
\(\tilde x = x+ \epsilon \text{ sign}(\nabla_x J(X))\)
Adversarial examples live in more or less linear subspaces (it was initially thought they live in "little tiny pockets"):
→ Only need to get the direction right, don't need to find exact coordinates in space to go into adversarial subspace (in the images, can go upwards first, then to the right)

Adversarial examples are not just noise (noise has very little effect compared to adversarial examples). Adding random noise rarely changes the classification decision:
Since the procedure in the image above was performed on the test set, in some cases the network is already wrong, even without adding any noise (Usually in those cases adding the noise doesn't make the model correct).

Neural networks are wrong almost everywhere and behave reasonably only on a really thin manifold around the data.

Transferability Attack

A transferability attack takes place when the attacker creates adversarial examples against a substitute model and uses the created examples to attack a target model. This is necessary when the attacker does not have direct access to the target model.
These attacks are possible cross-technique and cross-training instance

Failed Defenses