ImageNet database

Quoting from ILSVRC paper

ImageNet populates 21,841 synsets of WordNet with an average of 650 manually verified and full resolution images. As a result, ImageNet contains 14,197,122 annotated images organized by the semantic hierarchy of WordNet (as of August 2014).

ImageNet-1k dataset

The standard 50,000 image ImageNet-1k dataset.

The ILSVRC (ImageNet Large Scale Visual Recognition Challenge) ran from 2010-2017. This challenge provided the teams with a subset of ImageNet database, called ILSVRC-2012 or ImageNet-1k or ImageNet (I think ILSVRC-2012 is the correct name, but people also refer to this dataset by the later two names).

Dataset creation process

  1. Selecting categories:- The 1000 categories were manually (based on heuristics related to WordNet hierarchy). Also, to include fine-grained classification in the dataset the authors included 120 categories of dog breeds (this is why ImageNet models generally dream about dogs).

  2. Selecting candidate images:- Taken directly from ImageNet database. They basically did search queries for each category (synset) on several image search engines. The queries were also translated to Chinese, Spanish, Dutch and Italian to increase the diversity of the images.

Important: Step 2 introduces the problem of inaccurate annotations because we don’t know whether the search engines are correct or not.
  1. Annotating images:- Amazon Mechanical Turk (AMT) was used to label the images. Each user was a given a set of candidate images and the definition of the target category (synset). The users were then asked to verify if the image contained the category. There was also a quality control system setup which you can read in the paper.

Problems with ImageNet

Are we done with ImageNet? goes into the details. But the main problem is the classification task. There are a lot of images in the dataset which have multiple classes or classes with multiple meanings. This is shown in the figure below taken from the above paper.

From the figure, we can see that multi-label classification would be a better option to train on the dataset. Deep learning models are generally considered to be robust to some of the noise, so maybe that is why we can still train using the classification and get the awesome results.

One thing that the above image shows is that blindly following the top-1 accuracy on validation dataset is not a good idea. As at that point a model is simply learning to overfit or learn which class to predict for the validation images.

We cannot do anything about the training dataset. As collecting new labels for the images would be a big project on its own, but we can try to test if the model really generalized or not by coming up with newer validation datasets. For some datasets we would prioritize robustness, as generalization also means that a model should be robust to unseen changes in the training dataset.

Another problem is we do not have access to ImageNet test data. This means people have to resort to validation results to infer which model works better (in terms of accuracy). The main problem here is the extensive hyperparameter tuning on the validation set.

ImageNetv2 Matched Frequency

An ImageNet test set of 10,000 images sampled from new images. Care was taken to replicate the original ImageNet curation/sampling process.

In the paper, the authors observed a drop of 11%-14% in accuracy for the models they tested. The main reason for this is extensive hyperparameter tuning on the validation set.

This paper solves this problem by collecting 10,000 new images (10 for each class) from Flickr. These images are much harder than the original ImageNet validation images. There are three versions of the dataset available which you can check on source link (the difference is in the method to select the 10 images for each class). MatchedFrequency dataset is used in the rwightman repo.


50,000 non photographic (or photos of such) images (sketches, doodles, mostly monochromatic) covering all 1000 ImageNet classes.

In this dataset we penalize the predictive power of the networks by discarding predictive signals such as color and texture that can be obtained from local receptive fields and rely instead on the global structure of the image.

This dataset basically consists of black and white sketches, doodles of the 1000 classes. This dataset focuses on model robustness by defining robustness to generalize to structure of the categories (i.e. low-frequency signal).

I don't think high accuracy on this dataset should be the primary goal. The reasoning being the hardware has gone pretty strong and using RGB images is not that expensive, so I don't see any point as to why we should penalize our models by taking out color and texture to check the robustness of the model (as we will never have that input during inference). Instead we should find ways to check robustness in the original RGB domain (check the next datasets for this).

ImageNet-A / ImageNet-O

7500 images covering 200 of the 1000 classes. The images are naturally occurring adversarial examples that confuse typical ImageNet classifiers. Typical ResNet-50 will score 0% top-1 accuracy on this dataset.

There are two datasets introduced in the paper with different purposes

  1. ImageNet-Adversarial (ImageNet-A): Contains 7500 images which are naturally adversarial (200 classes out of 1000 in ImageNet). Classifiers should be able to classify the images correctly.
  2. ImageNet-Out-of-Distribution-Detection (ImageNet-O): Contains 2000 images with classes that are not in ImageNet-1k dataset (out-of-distribution). Classifiers should output low-confidence predictions on the images.

How ImageNet-A is constructed?

First, a lot of images for the 200 classes of ImageNet were collected from the Internet. Then all the images correctly classified by ResNet-50 are removed from the dataset (reason for 0% top-1 acc using ResNet-50). Finally, a subset of high quality images are selected for the final dataset.

How ImageNet-O is constructed?

ImageNet database was used to get the images (excluding the 1000 classes in ImageNet-1k). Then ResNet-50 is used to select the images for which the model predicts high-confidence predictions.

ImageNet-A and ImageNet-O are good datasets to check the robustness of the models. ImageNet-A can be tested automatically. In case of ImageNet-O we have to come up with our own evaluation strategy. Manually looking at the predictions is possible. Plotting a histogram of model confidence predictions is also a possibility to get a sense of the confidence values.

I want to test a new thing for image classifier models, where we use sigmoid instead of softmax. The problem with softmax is that it forces one prediction to a large value, where as with sigmoid all the predictions are independent of each other. This can also help counter the problem of multiple classes in images of ImageNet dataset. I still have to explore this method, on how to train the model.

ImageNet-C / ImageNet-P

Evaluates performance on common corruptions and perturbations that may happen in real-life. 3.75 million images in ImageNet-C.

How ImageNet-C is constructed?

ImageNet-C consists of 15 common corruptions (Gaussian Noise, Shot Noise, Impulse Noise, Defocus Blur, Frosted Glass Blur, Motion Blur, Zoom Blur, Snow, Frost, Fog, Brightness, Contrast, Elastic, Pixelate, JPEG) with 5 levels of severity, resulting in 75 distinct corruptions for each image. These corruptions are applied to all 50,000 validation images of the original ImageNet-1k dataset, resulting in 3.75 million images.

This assumes that you did not use any of these corruptions as data augmentation during the training phase.

How ImageNet-P is constructed?

Perturbations hear mean applying the same corruption successively on the previous applied image. This dataset measures classifier's prediction stability, reliability and consistency in case of minor change in input image.

For each image we generate 30 frames of perturbation (from 10 corruptions) 5 levels of severity resulting in total of 7.5 million images. Starting with a clean ImageNet image apply brighness corruption to it and then apply brightness corruption on the current image and keep doing it for 30 times.

In the paper, the authors also introduce a metric for ImageNet-P. Check the paper for its details.

The only problem with this dataset is size. The authors of the paper have also created a new dataset ImageNet-R which we can use instead.

ImageNet-"Real Labels"

New labels for the original ImageNet-1k intended to improve on mistakes in the original annotation process.

This dataset can be easily summarized with Figure 3.

This dataset provides new labels for the validation set of original ImageNet-1k dataset (50,000 images).

In the paper, the authors propose a new metric Real accuracy as we cannot use top-1 accuracy for this multi-label dataset. This metric measures the precision of the model's top-1 prediction, which is deemed correct if it is included in the set of labels, and incorrect otherwise.


Renditions of 200 ImageNet classes resulting in 30,000 images for testing robustness.

The dataset consists of artistic renditions (art, cartoons, graffiti, embroidery, graphics, origami, paintings, patterns, sculptures and more) of object classes. These kind of images are not included in the original ImageNet dataset (it consists of photos of real objects only).


This paper also introduces a new data augmentation technique called DeepAugment. Take any image-to-image network (like image autoencoder or a superresolution network) and pass the images through it. Now, distort the internal weights of the network (zeroing, negating, transposing, ...) and the images you get as output can be used for training.

I highly recommend reading this paper.

What to use?

For the training phase, I would still use the original ImageNet-1k validation dataset. The top-1 accuracy on this dataset does not matter much due to incorrect labels. Next I would use ImageNetv2 Matched Frequency dataset as a proxy of test set. In the end, when everything is done I would get the results of ImageNet Real labels dataset to get a real sense of the accuracy of the model.

To test the robustness of the model, I would work with ImageNet-Adversarial and ImageNet-Rendition datasets. With ImageNet-A we can get the accuracy of the model for really hard images. With ImageNet-R we can get results for out-of-distribution images. If the results on these datasets are satisfactory ImageNet-O can also be used for further testing.


  • rwightman/pytorch-image-models link
  • WordNet: a lexical database for English link
  • ImageNet: A large-scale hierarchical image database link
  • ImageNet Large Scale Visual Recognition Challenge link
  • Learning Robust Global Representations by Penalizing Local Predictive Power link
  • Natural Adversarial Examples link
  • Benchmarking Neural Network Robustness to Common Corruptions and Perturbations link
  • Are we done with ImageNet? link
  • The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization link