What are Channels and Kernels
To understand what are channels and kernels, we must understand what we mean by features.
Features
If I had to give a layman's definition, I would say a feature is something as simple as a dogs nose or facial features which can be used to identify a certian class of images. But these are features that we as humans can visually recognise and comprehend. There exist a lot of features that are characterstics of a particular class which cannot be comprehended by humans but exist as a result of model training. Spatial relationship between pixels, specific to a certain class is what we mean by features.
Kernels
A kernel is something that is used to identify\ locate a feature. They can do so with the help of the Convolution operation and are usually matices of size 3x3. For example a 3x3 kernel used to identify lines will be
Channels/Feature Maps/Feature Bucket
Kernels or Filterss are used to identify features on the input in a certain layer. The output we get are called channels. They are the aggragation of certain features found (or not) in the image. These channels are again input to the next layer.
The key differce between channel and kernel is that if ⁍ then A and C are channels and B (the multiplier) is the kernel
Why should we nearly always use 3x3 kernels
Before the invention of backpropagation kernel values were set manually to detecct features .(Poor Interns!!). 2x2 kernels would be so samll that they would be able to detect where a feature begins but they would neet another kernel to detect where the feature ends. Yes, a 5x5 would be better at detecting spatial relationships between pixel values, but the total number of parameters would increase drastically as compares to a 3x3 kernel (25/9 = 2.7). Its better to use more 3x3 filters instead of 5x5 or larger to reach the same receptive field.
The most important reason is that 3x3 kernels are accelerated on NVIDIA GPU's
Can be used to detect a line at an angle
←——-
can be used to detect a triangle
How are kernels initialized?
The most used method to initialise kernels is to sample the parameters of a kernel from a normal distritibution.
Another popular method is Xavier Initialisation - which sets a layers weight to values chosen from a random uniform distribution that is bound between
Although in the research paper this method was only tested against CIFAR10 dataset, and it showed better results. Im not sure if this works all the time.
Good Kernel initialisation prevents gradient vanishing and boosting
How many times do we need to perform 3x3 convolutions operations to reach close to 1x1 from 199x199 (type each layer output like 199x199 > 197x197...)
We need to perform 3x3 convolutions exactly 99 times to reach from 199x199 to 1x1
3x3 —> 1 time to reach 1x1
5x5 —> 2 times to reach 1x1
7x7 —> 3 times
... and so on
Therefore we need to find what index does 199 fall in the Arithmatic Progression 3,5,7....
It is the 99th term which is also the answer
What happens during the Training of a DNN?
A lot of caluclations!!!
The whole point of training a DNN is so that the model parameters,which had a certain initialisation can update their values through gradient descent. Therefore, a lot of gradients flow backwards, updating the parameters untill the loss function converges to a minima or we run out of input images to train on.