Updating arxiv papers
The relationship between a convolution operation’s input shape, kernel size, stride, padding and its output shape can be confusing at times.
The tutorial’s objective is threefold: : a vector is received as input and is multiplied with a matrix to produce an output (to which a bias vector is usually added before passing the result through a nonlinearity).
The size of the output will be equal to the number of steps made, plus one, accounting for the initial position of the kernel. More formally, the following relationship can be inferred: To factor in zero padding (i.e., only restricting to ), let’s consider its effect on the effective input size: padding with zeros changes the effective input size from to .
In the general case, Relationship 1 can then be used to infer the following relationship: All relationships derived so far only apply for unit-strided convolutions.
It is sparse (only a few input units contribute to a given output unit) and reuses parameters (the same weights are applied to multiple locations in the input).
Here is an example of a discrete convolution: slides across the input feature map.
This contrasts with fully-connected layers, whose output size is independent of the input size.
Additionally, so-called transposed convolutional layers (also known as fractionally strided convolutional layers, or – wrongly – as deconvolutions) have been employed in more and more work as of late, and their relationship with convolutional layers has been explained with various degrees of clarity.
The size of the output is again equal to the number of steps made, plus one, accounting for the initial position of the kernel. From this, the following relationship can be inferred: The most general case (convolving over a zero padded input using non-unit strides) can be derived by applying Relationship 5 on an effective input of size , in analogy to what was done for Relationship 2: As before, the floor function means that in some cases a convolution will produce the same output size for multiple input sizes.
Because of that, this section will focus on the following simplified setting: One way of defining the output size in this case is by the number of possible placements of the kernel on the input.
Let’s consider the width axis: the kernel starts on the leftmost part of the input feature map and slides by steps of one until it touches the right side of the input.
For each output channel, each input channel is convolved with a distinct part of the kernel and the resulting set of feature maps is summed elementwise to produce the corresponding output feature map.
The result of this procedure is a set of output feature maps, one for each output channel, that is the output of the convolution.