Report – Week 8, Chapters 10–11: CNNs & Residual Neural Networks

Presentation Date: 15.12.2025\ Presenter: Aisaiah Pellecer\ Discussion Lead: Fadi Dalbah\ Report by: Aisaiah Pellecer

Chapter 10: Convolutional Neural Networks

Summary

Convolutional Networks consist of convolutional layers, where each hidden unit is computed as the sum of nearby inputs, an added bias, and an activation function. Information is retained by repeating this operation with different weights and biases to create multiple channels at each spatial position.

Most CNN models, such as encoder–decoder models, use a sequence of layers that change the spatial size of the data as well as the number of channels (by upsampling or downsampling). Toward the end of the network, fully connected layers are often used to produce the final output.

The structure of these models offers key performance benefits that overcome the limitations of fully connected networks.

The translational equivariance of convolutional layers is especially powerful for image-based tasks. For example, this means that the representation of an object in a given image remains consistent despite shifts in its position. In other words, the model encodes the inductive bias that can be applied to image classification, object detection, and semantic segmentation networks.
Compared to fully connected networks, weights and biases are shared across every spatial position. Furthermore, there are far fewer parameters, and larger inputs (e.g., larger image sizes) can be handled.

Common CNN Architectures

Image Classification: AlexNet
Object Detection: YOLO (You Only Look Once) Network
Semantic Segmentation: Encoder–Decoder Networks or Hourglass Networks

Chapter 11: Residual Neural Networks

Residual Neural Networks use residual blocks (or residual layers), which allow much deeper networks to be trained. The ability to train deeper networks is due to residual blocks, which compute an additive change to the current representation instead of transforming it directly.

Residual connections allow the final output to be expressed as the sum of the original input and the outputs of multiple shorter networks. This enables information from the original input to be preserved as the network depth increases.

Limitations of Sequential Processing

In the previous chapter, it was noted that Convolutional Neural Networks have specific features that allow them to outperform fully connected networks. However, the input size of convolutional networks cannot be arbitrarily increased, as performance tends to degrade as the network becomes deeper. This limitation is a direct result of sequential processing network models, where inputs and outputs are passed from one hidden unit to the next in a strictly sequential manner.

As networks grow deeper, small changes to parameters in early layers can cause large and unpredictable changes in the loss gradients. Even with proper weight initialization that avoids vanishing or exploding gradients, optimization algorithms rely on finite step sizes, which can move the model to regions of the loss surface with unrelated gradients. As a result, the loss surface can become highly irregular (known as shattered gradients), making optimization increasingly difficult in deep sequential networks.

Exploding Gradients and Batch Normalization

While residual blocks help enable deeper network training and mitigate the problem of vanishing gradients, they are not immune to exploding gradients. They can suffer from unstable forward propagation, as there may be an exponential increase in the variance of activations during training.

These issues are addressed using Batch Normalization, which is applied independently to each hidden unit. Batch Normalization shifts and rescales each activation so that its mean and variance across the batch are learned during training. This makes the network invariant to rescaling of the weights and biases that contribute to each activation.

Cost of Batch Normalization: Batch Normalization introduces two additional parameters (mean and variance) at each hidden unit, which increases the model size. However, Batch Normalization enables:

Stable Forward Propagation: With unit variance, the variance at initialization increases linearly with each residual block layer.
Higher Learning Rates: The loss surface and gradients change more smoothly and predictably, allowing the model to learn solutions that generalize well.
Regularization: Batch Normalization introduces noise during training, which improves generalization.

Common Residual Architectures

ResNets
DenseNets
U-Nets and Hourglass Networks

Discussion Notes

What are the uses for 1D-CNNs?

1D-CNNs can be used for timeseries or any sequential data especially when only immediate context is needed for interpretation. For data where larger context is needed nowadays transformer would be the choice.

What is stride good for?

While stride is used for downsampling between layers when set to >2 it only loses data. Pooling would be a preferable way to do downsampling since it doesn't downsample "mindlessly".

What is a 1x1 convolution used for?

A 1x1 convolution is used for downsampling or upsampling the channels layer.

What are other applications for CNNs other than images?

Any spatial data where immediate context between close data points matters are sensible to be used for training CNNs.

What is the interpretation of a skip connection in a ResNet?

A skip connection can be interpreted as a learned parameter on which layers contribute well to the output. They are used to prevent the vanishing gradient problem.

Can we identify good/bad layers by removing them permanently and see how it affects the results?

While it should be possible to do so, the effort will probably not be worth it.

References

Deep Learning using Rectified Linear Units (ReLU) [Link]