In the past year I have worked with a few projects that use vision datasets. People might give them fancy names, but in simple terms, they are just datasets with images and videos that are annotated with text descriptions. For the purpose of training large models, projects almost always start with open source datasets. They are the most accessible, and more likely than not, they will make a large contribution to the full training datasets. There are two common additional sources, commerical (e.g. iStock, Shutterstock, Leixa, Alamy, Artistation, Adobe Stock) and internal datasets (e.g. JFT-300M, JFT-3B, IG-1B). I am compiling most of the well known open source datasets into a few tables to showcase how vision datasets have evolve over time.

Image Dataset before ImageNet

Name Released Size Source Org Notes
FERET 2000 14k curated NIST faces
caltech-101 2003 9k curated Caltech objects
Yale Face B+ 2005 16k curated Yale faces
Caltech-256 2007 30k curated Caltech objects
Oxford-5k 2007 5k curated Oxford buildings
LFW 2007 12k web U Mass. Labeled Faces in the Wild
PASCAL VOC 2007 30k curated Microsoft objects
MIRFLICKR-25K 2008 25k flickr Leiden objects, concepts
Tiny Image 2008 80m web MIT classification
NUS-WIDE 2009 270k flickr NUS classification
SUN 2009 130k web MIT environmental scenes, places and the objects
ImageNet 2009 14m web Princeton classification

Table 1: Early Image Datasets

ImageNet kick started the revolution of training large, deep neural net model using large amount of image. While image datasets existed before ImageNet, they were much smaller in scale. They tend to cover a narrow, specific purpose. They all involve human curation and annotations. ImageNet’s key differentiator was its size. It was considerably larger than previous dataset. Its labels were still annotated by humans, maintaining its accuracy. There was a curious case of TinyImages because it also had scale, but it did not have nearly the impact that ImageNet had. The key differentiator was that ImageNet retained the image’s full resolution, it used human annotation, and it hosted a very popular benchmark competition.

Image Datasets

Name Released Size Source Org Notes
TinyImages 2008 80m web MIT
ImageNet 2009 14m web Princeton
SBU caption 2011 1m flickr Stony Brook
flicker 30k 2014 31k flickr Urbana-Champaign
yfcc100m 2015 100m flickr Yahoo
open images 2016 9m web Google
JFT300m 2017 300m Google not public
cc3m 2018 3m web Google Conceptual Captions
IG-1B 2019 1b Instagram Meta not public
LAION 400m 2021 400m common crawl LAION
cc12m 2021 12m web Google
redcap 2021 12m reddit U Michigan
WIT 2021 11m wikipedia Google
WebImageText 2021 400m web OpenAI - not released
- CLIP
ALIGN 2021 1.8b web Google - not released
- ALIGN
JFT3B 2021 3b Google not public
LAION 5B 2022 5b common crawl LAION
Coyo 700m 2022 700m common crawl Kakao Brain
CommonPool 2023 12.8b common crawl LAION DataComp

Table 2: Web Datasets

Name Released Size Source Org Notes
CIFAR-10 2009 60k TinyImages CIFAR classification
ImageNet 2009 14m web Princeton classification
MS-COCO 2014 300k web Microsfot - segmentation
- context
- multiple captions
Visual Genome 2015 100k ms-coco
yfcc100m
Stanford - relationships
- bounding boxes
Open Images 2016 9m web Google - relationships
- segmentation
- labels
Hierarchical Paragraphs 2017 15k visual genome Stanford - Hierarchical Approach for Generating Descriptive Image Paragraphs
- dense paragraph
Localized Narratives 2020 900k ms-coco, open images Google - voice and mouse movement
Hateful Memes 2020 10k Getty Meta
Crossmodal-3600 2022 3.6k curated Google 36 languages
Segment Anything 2023 11m - licenced Meta - segmentation
DCI 2023 7.8k segment anything Meta - Densely Caption Images (DCI)
- multiple rounds of human annotations
vision2ui 2024 3m web Peking ui images
DOCCI 2024 15k curated Google - highly detailed description
- donated by one person
IIW 2024 9k curated Deepmind - ImageInWords
- highly detailed
- sequentially refined

Table 3: Curated Datasets

Name Released Size Source Org Notes
clevr 2017 100k generated images stanford, meta - visual question answering
LAION-coco 2022 600m LAION 2b laion - recaptioning
LAION-translate 2022 3b LAION 5b laion - machine translation
COYO-Labeled-300M 2022 300m coyo kakaobrain - machine labels
- ImageNet labels
Pixel-Prose 2023 16m - LAION
- cc12m
- redcap
u maryland - dense caption
- using Google Gemini
Pick-a-Pic 2023 500k generated images stability ai - user prompts
- generated images
DAC 2023 3m cc3m IBM - Dense and Aligned Captions
- LLM expansion
- segmented parts
Recap-Datacomp-1B 2024 1.3b common-pool santa cruz - recap with LLaVA

Table 4: Examples Synthetic Datasets

Image dataset scales from 10’s million from ImageNet dataset to 10’s billions in recent years. The model and training dataset size increases exponentially because of transformers-based deep learning models. Companies such as Facebook and Google first experimented with internal datasets that are in the 100s million to billions ranges (e.g. IG-1B and JFT-300m). It took a few years for the open source community to catch up. LAION adopted similar automatic scraping and filtering techniques to curate LAION-400m, and later on with LAION 2B, and Common-Pool. These images became the backbone of data sources for all image generators and multimodal models.

The latest dataset is the 12B dataset Common-Pool. We are probably reaching the limit through scraping the common crawl archive. It is possible that we get to 100 billion, but in my opinion, that is the upper bound of the public domain. There are other data sources that could break this scaling limit. Social media platform such as Instagram has 100s of billion of posts. The total image counts in personal photo archive (e.g. Google photos, Apple photos) should be well into the trillion level, and they contain much higher quality photos. For example, DOCCI is an example where one researcher is able to curate more than 10,000 high quality images in a few years. IoT devices could be a rich source of image data as well.

The quality and the diversity of the images in these datasets resembles the content of LAION-400m. This is evidenced by how similar generated images look and feel across different the latest models and products. The content of the datasets such as yfcc100m, LAION-2B, Common-Pool, and Coyo are remarkably similar. One could extract similar subsets from those datasets. For example, if one were to extract a subset of images that filters for humans, aesthetic, size, aspect ratio, and text-image alignment, the filtered subset have very similar images. I suspect that is just the nature of images in public digital domain. It is a reflection of the evolution of digital media. More likely than not, the Google and Facebook private datasets look similar to the open source, web-scraped versions.

The text descriptions are still in the middle of active development. We have a few varieties. The first generation of text is merely for classification. They are objects and concetps. Examples are TinyImages, ImageNet, and SUN. The second variety is description scraped from the alt attribute in html img tag. This description tends to be short, noisy, and could contain irrelevant texts. The third variety is typified by DCI, IIW, and DOCCI. In the last couple of years, it is all about vision-language models. That means the text is desired to be general and contain as much details as possible. The descriptions in DCI, IIW, and DOCCI are human annotated through multiple rounds. They are impossibly detailed. They are purposely and painstakingly annotated to include a lot of details. They are time consuming to create. These datasets are rare and small, in the 10k range1. Fourth type is recaptioning. This is a big trend. Most of the synthetic data focuses on recaptioning. A lot of the image, video, and multimodal model training dataset choose to use recaptioned descriptions over raw texts 2.

Video Datasets

Name Released Size Source Org Notes
ucf-101 2012 13k clips youtube U of Central Florida classification
activity-net 2015 30k videos youtube KAUST classification
kinetics 600 2018 500k clips youtube Deepmind classification
ssv2 2018 220k clips crowd sourced Qualcomm - action recognition
- video understanding
how-to-100m 2019 1.2m videos, 136m clips youtube Deepmind - instruction videos
movienet 2020 1000 movies CUHK
WebVid-10M 2021 10m clips Shutterstock Oxford - low resolution
- poor visual
- watermarks
merlot 2021 6m videos, 180 m clips youtube Allen Institute for AI - aim to be generic, diverse
- YT-Temporal-180m
celebv-hq 2022 15k videos youtube Nanyang Technological University - focus on humans
hd-vila-100m 2022 3.3m, 100m youtube Microsoft - aim to be diverse
VidChapters-7M 2023 817k videos, 7m chapters youtube Meta
HD-VG-130M 2024 1.5m, 130m youtube Microsoft
panda 70m 2024 3.3m, 70m HD-VILA-100M Snap - recap
merlot reserve 2024 20m youtube Allen Institute for AI YT-Temporal-1B

Table 4: Examples Synthetic Datasets

Almost all of the open source video dataset use Youtube as the data source. All the significant datasets are of this variety. Their creation process is simple. They scrape youtube ids based on some criteria. For example, celebv-hq focuses on celebrity. How-to-100m focuses how-to videos. HD-Vila and Merlot datasets attempt to cover diverse topics. These youtube videos are cut to 5 to 20 seconds short clips, usually cut on the scene changing boundaries. Different dataset might apply additional filtering, processing of metadata, and recaptioning. Youtube is the singularly important data source for video dataset. If Google were to pursue serious law suits against major AI model developers and open source researchers in the near future, it could impact many research projects and video AI products. There are alternative sources of videos, such as movies, broadcast studio archives, short video platforms, and alternative video platforms. Still, youtube is king.

In theory, video datasets should be orders of magnitude larger than image datasets. In practice, open source video datasets are comparable to image dataset, or slightly smaller. Each youtube video is roughly 30MB. The largest video dataset is about 500TB (Merlot Reserve). This comparable to LAION 2B. This is partly due to the compute resources those dataset creation teams have access to. Large industrial labs such as OpenAI are certainly working to create datasets an order of magnitude larger than these. It also indicates that video dataset is still in its early stages. Unlike image datasets, I expect video datasets to dramatically increase its scale in the coming years, if not the coming months. Video models should still have a lot of room to grow. It is exciting to see what kind of video editing will be enabled by the next generation of video models in the next few years.


Footnotes

  1. It would be interesting to see a team to devote a lot of human resources to create a 500k dataset with such detailed description and use that to train a captioning model. That is one experimental design that no one has attempted to push the limit of language-vision models.
  2. Model interdependence is a fascinating phenomenon that is not widely discussed yet. Most generated outputs look remarkably similar. Their architectures are similar. Their training data are similar. They also use a lot of the same underlying models to process data or create synthetic data. Recaptioning is the most common such examples. From a information theoretical standpoint, the models all contain knowledge. It is not surprising their results are similar. It is almost tempting that they will invariably converge as models get larger and become more capable, unless they contain different starting information.


Published

Contact