Image Classification: On Robustness and Transferability of Convolutional Neural Networks

9 min readJul 18, 2022

Article Summary of “On Robustness and Transferability of Convolutional Neural Networks” by Josip Djolonga, et al.

Image created using DALL·E mini by craiyon.com

People often criticize modern deep convolutional networks (CNNs) for not being able to generalize distributed changes. With today’s breakthroughs in transfer learning, it is now possible for these networks to handle changes in distributions with just a few examples of how to do them.

This short summary is based on the paper “On Robustness and Transferability of Convolutional Neural Networks” by Josip Djolonga, et al. about the out-of-distribution and transfer performance of modern image classification CNNs and examines the data used for training, the size of the model, and the data preprocessing pipelines.

In this study, the authors found that both increasing the size of the training set and the size of the model made the distributional shifts much more stable. The study also demonstrates that simple changes in the preprocessing, like changing the image resolution, can make a big difference in how robust the system is. Lastly, the authors outline the problems with existing datasets for evaluating robustness and introduce a synthetic dataset for analysis of common variation factors.

Article explores the interplay between in-distribution performance, out-of-distribution (OOD) performance, and transfer learning performance with respect to the design choices on the left.

Summary

The study comprehensively examines the generalization to out-of-distribution (OOD) data (without adaptation) and transfer learning performance (with adaptation) of image classification models in the low-data regime.

Summary of the articles main contributions:

1. Present analysis of existing OOD metrics and benchmarks across a variety of models

Classification accuracy of image classification models
Analysis of generalization to OOD metrics and transfer learning benchmarks
Demonstrate variance contained in these metrics is explained using ImageNet validation set accuracy

2. Analyzes OOD robustness

Analysis the effects of training size, model scale and training and testing

3. Introduce a novel dataset for a fine-grained OOD analysis to quantify the robustness

Explore common factors of variations: object size, location, and rotation angle

Distribution Shifts: From IID to OOD

Before jumping in, let’s understand what is OOD and where did it come from. We can start with changes in distribution, in the classical framework for machine learning (ML), ML is based on the hypothesis of IID Independently and Identically Distributed Data which assumes your test data has the same distribution as the training data. In the real world this is not the case, recently Yoshua Bengio quoted Facebook’s AI Research Lead Leon Bottou, in his 2019 ICML keynote stating “nature does not shuffle data, we shouldn’t!” Bengio went further to explain that “when we collect data we should not shuffle it, and by shuffling it we destroy value information about these distributions that are inherent in our data, instead of destroying that information we should use it to learn how the world changes”.

“When we collect data we should not shuffle it, and by shuffling it we destroy value information about these distributions that are inherent in our data, instead of destroying that information we should use it to learn how the world changes” Yoshua Bengio

The problem of dataset shift, also called “out-of-distribution” (OOD) generalization, has a lot in common with “transfer learning”. In transfer learning, we build models that can improve performance on a target task better by reusing data from a related problem. OOD, breaks with the traditional IID hypothesis that your test data has the same distribution as your training data. OOD is learning to generalize to changes in the current distribution.

While models cannot be adapted using data from the target environment, OOD generalization goal is to adapt the model to the target task.

Robustness of Image Classification Models

Correcting for dataset shifts is a classical machine learning problem. A dataset shift is present when we train on samples from Ptrain(X,Y), but are tested under a different distribution Ptest(X,Y). The study seeks to focus on covariate shifts, i.e., when the conditionals Ptrain(Y|X)=Ptest(Y|X) agree, but the marginals Ptrain(X) and Ptest(X) may differ.

Correcting datasets shifts are classical problems in machine learning.

Several synthetic datasets are presented to examine the robustness with regard to covariate shifts.

The name and reference, number of instances, and the number of classes overlapping with ImageNet for each dataset.

The ImageNet-C and ImageNet-P datasets both include images that have been artificially corrupted and to varying degrees distorted and disrupted.

Benchmarking Neural Network Robustness to Common Corruptions and Perturbations (Hendrycks et al., ICLR’19)

The ImageNet-A dataset includes natural examples of adversarial data, which CNNs are unable to predict well.

Natural Adversarial Examples (Hendrycks et al., 2019)

The ImageNet-V2 dataset was updated with fresh test data for the ImageNet benchmark.

Do ImageNet Classifiers Generalize to ImageNet? (Recht et al., 2019)

The ObjectNet dataset is a sizable real-world test set for object recognition that includes settings for controlling backdrops, rotations, and random imaging viewpoints.

ObjectNet: A large scale bias controlled dataset for using the limits of object recognition models (Barbu et al., NeurIPS’19)

The Vid-Robust datasets include images that have been organically and temporally distorted after being obtained from videos.

Do Image Classifiers Generalize Across Time? (Shankar et al., 2019)

ANALYSIS OF EXISTING ROBUSTNESS AND TRANSFER METRICS

Although numerous robustness metrics have been developed for a variety of distinct causes of brittleness, the relationships between these metrics are not well understood. There has been little investigation into the degree to which the metrics of robustness and transferability are predictive of one another, despite the strong connection that exists between the two. Furthermore, despite the tight relationship between the concepts of robustness and transferability, no examination of how predictive of each other their respective metrics are has been conducted.

Experiment Design

In order to provide answers to these questions, 39 distinct models were tested using over 23 different robustness criteria and 19 different transfer tasks. See Appendix for experiment details.

So how well does ImageNet Accuracy Predict Performance on OOD data?

Results have proved that all metrics are highly correlated with one another, with a median Spearman’s coefficient of 0.9. The metrics also have a significant correlation with the accuracy on the ImageNet validation set, with a median Spearman’s of 0.89 and a minimum of 0.84.

* Null distribution: 1000 random permutations

Can robustness metrics discriminate between models?

Results found that the majority of the measures studied provide little improvement in model discriminability over ImageNet accuracy. The results were conditioned on the size and makeup of the dataset influenced this finding, and it may differ for a different collection of models.

How Related are OOD Robustness and Transfer Metrics?

Robustness correlates with ImageNet accuracy, and better ImageNet models transfer better. There is a somewhat strong association (0.12) between transferability and robustness, although a substantial correlation (0.73) exists between the two (robustness — ImageNet acc). No metric predicts transferability well for the groups that need more remote (Specialized, Structured) abilities. The accuracy of raw ImageNet is the most accurate indicator of transfer to structured tasks. The results imply that robustness measurements do not correlate with difficult transfer tasks, at least not to a greater extent than ImageNet accuracy.

Experiment Summary

We have seen that many popular measures of robustness are very similar to each other. Some metrics, especially those that aren’t based on ImageNet, don’t do much more than ImageNet accuracy to help tell models apart. Transferability is also related to how accurate ImageNet is, and therefore to how strong it is. We see that there is a correlation, but that transfer shows problems that aren’t always related to robustness. Also, ImageNet seems to have a better correlation with any group of transfer tasks than any other robustness measure. All of these metrics seem to be related, so we look at strategies that work well for ImageNet and see if we can apply them to the newer robustness benchmarks.

THE EFFECTIVENESS OF SCALE FOR OOD GENERALIZATION

Increasing the size of pre-training data, model architecture, and training steps hasn’t helped ImageNet accuracy as much.. Scaling along these axes, on the other hand, has recently been shown to lead to big improvements in transfer learning performance. This drives a thorough study of the impacts of the pre-training data size, model architecture size, training stages, and input resolution on transfer performance and resilience.

EFFECT OF MODEL SIZE, TRAINING SET SIZE, AND TRAINING SCHEDULE

ResNet-101x3 (large convolutional neural networks) performance on ImageNet-21K subsamples. Scaling up the model, the data, and the training iterations all lead to greater robustness measurements.

EFFECT OF THE TESTING RESOLUTION

To avoid overfitting, pictures are often cropped randomly during training using a wide range of crop sizes and aspect ratios. In contrast, photos are often resized during testing with a fixed-size central crop and the shorter side length set to a predetermined value. Consequently, there will be a discrepancy between the training and testing item sizes. The accuracy of various architectures improves with increasing picture testing resolution.

Creating a model to be able to withstand common sources of variation, such as the position, size, and orientation of the object is challenging. The ideal models would be resistant to noise introduced by changes in context, object size, and orientation. If one wants a reliable diagnosis of the failure modes, it’s best if one can adjust the testing data along these dimensions. No large-scale systematic data gathering plan is achievable, however, due to the combinatorial nature of the amount of possible combinations of such causes of variation.

SYNTHETIC DATASET

A synthetic dataset was created to test model performance as item position, size, and orientation change. The collection contains objects on clean backgrounds. Using segmentation masks, OpenImages objects were retrieved. Since we’re studying ImageNet-trained models, only ImageNet-mappable classes were used. All obstructed, truncated, and incompletely labeled objects were deleted. The size of the synthetic dataset had 614 object instances in 62 classes. Pexels.com provided nature-themed backdrops (the license therein allows one to reuse photos with modifications). The backdrops were manually removed with obvious items, such animals or people. 867 backgrounds were collected.

A SYSTEMATIC STUDY ON THE EFFECT OF SCALE ON COMMON FACTORS OF VARIATION

Pretraining big CNNs using large-scale datasets renders the model rotation- and size-invariant. The research strived to create models to be resilient against a wide range of inputs, including things like object location, size, and rotation. As a result, we are presented with a synthetic dataset for fine-grained evaluation. The observations focused on a single source of variation (such as the position of the object’s center) and examine its effects over a uniform grid. As the model training set increases, we see the model performance become invariant to location, size and rotation of objects.

OOD generalization and transferability of image classifiers were examined and demonstrated that model and data scalability in conjunction with a simple training recipe result in substantial enhancements. However, there is a considerable performance disparity between the models when tested on OOD data, and scaling is unlikely to be the sole way to close this gap. Secondly, this method is dependent on the availability of curated datasets and substantial computational resources, which is not always feasible.

Therefore, we confirm that transfer learning, also known as “train once, apply many times,” is the most promising paradigm for OOD resilience in the near future. The ImageNet label space-tuned image classification models that were constructed with the aim of optimizing accuracy on the ImageNet test set represent a drawback of this study. Existing research indicates that no overfitting was found in ImageNet, but it is possible that these models have linked failure modes on datasets that reflect ImageNet’s biases.

Please find a link to the full paper here: https://www.arxiv-vanity.com/papers/2007.08558/