Nishant's Machine Learning Blog

Observations

There's a global inductive bias in CNN's (invariance to shift and scale) which is why CNN's have comparable performance w.r.t Transformers (Reference to this statement is in the Transformer survey paper). Transformer models overcome this with the help of extensive training regimes, large datasets and larger models. (It will be good if we mention this in the paper somewhere)

They have also experimented with MobileVIT as the backbone to downstream tasks like detection and segmentation, showing results like:

The architecture is simple itself. They start with a couple of Mobile-Net-v2 blocks which downsamples the input. After this, a Self attention layer is used on the processed feature map (note that the input shape is the same as the output shape for these layers). This output is then concatenated with the outputs from a parallel convolution operation. Then again point-wise convolutions are used on this concatenated layer. This whole process is used twice (two transformer layers only)

The idea of fusing attention and convolutional outputs with the help of another convolutional layer is interesting.

Just introducing transformer layers at two places in the model and then calling the model "VIT" makes little sense. It is clear that the model is Convolutional in nature. They have themselves mentioned that the significant amount of parameters come from these 2 layers. Also, there's no experiment to show that the model gets a boost in performance because of these 2 layers. For example, they can replace the Attention layer and perform a couple of experiments to show that the model doesn't perform as good as it does with the attention layer.

They said they have used the swish activation function for the entire model. Yes, theoretically it's better than a simple linear activation function but for an architecture to be deployed on edge devices, it would be better to add more parameters than to waste computation on a complex non-linear activation function.

The exact value of FLOPS is not mentioned for any variant of the model. They just mention that it is roughly half the FLOPS of DeIT on image-net dataset.

The final paragraph of the paper labelled discussion mentions that even though the model is smaller than some well-known CNN's, on mobile devices:

Layer Norms are used in transformer models because the batch size has to be kept too small because of the large size of transformer models. Batch size has to be kept extremely low (I have myself used 2 or 4 as batch-size), and as batch_norm is not that effective when batch size is so low. We use learning rate warmup for the same reason