In the past year I have worked with a few projects that use vision datasets. People might give them fancy names, but in simple terms, they are just datasets with images and videos that are annotated with text descriptions. For the purpose of training large models, projects almost always start with open source datasets. They are the most accessible, and more likely than not, they will make a large contribution to the full training datasets. There are two common additional sources, commerical (e.g. iStock, Shutterstock, Leixa, Alamy, Artistation, Adobe Stock) and internal datasets (e.g. JFT-300M, JFT-3B, IG-1B). I am compiling most of the well known open source datasets into a few tables to showcase how vision datasets have evolve over time.
Image Dataset before ImageNet
Name | Released | Size | Source | Org | Notes |
---|---|---|---|---|---|
FERET | 2000 | 14k | curated | NIST | faces |
caltech-101 | 2003 | 9k | curated | Caltech | objects |
Yale Face B+ | 2005 | 16k | curated | Yale | faces |
Caltech-256 | 2007 | 30k | curated | Caltech | objects |
Oxford-5k | 2007 | 5k | curated | Oxford | buildings |
LFW | 2007 | 12k | web | U Mass. | Labeled Faces in the Wild |
PASCAL VOC | 2007 | 30k | curated | Microsoft | objects |
MIRFLICKR-25K | 2008 | 25k | flickr | Leiden | objects, concepts |
Tiny Image | 2008 | 80m | web | MIT | classification |
NUS-WIDE | 2009 | 270k | flickr | NUS | classification |
SUN | 2009 | 130k | web | MIT | environmental scenes, places and the objects |
ImageNet | 2009 | 14m | web | Princeton | classification |
Table 1: Early Image Datasets
ImageNet kick started the revolution of training large, deep neural net model using large amount of image. While image datasets existed before ImageNet, they were much smaller in scale. They tend to cover a narrow, specific purpose. They all involve human curation and annotations. ImageNet’s key differentiator was its size. It was considerably larger than previous dataset. Its labels were still annotated by humans, maintaining its accuracy. There was a curious case of TinyImages because it also had scale, but it did not have nearly the impact that ImageNet had. The key differentiator was that ImageNet retained the image’s full resolution, it used human annotation, and it hosted a very popular benchmark competition.
Image Datasets
Name | Released | Size | Source | Org | Notes |
---|---|---|---|---|---|
TinyImages | 2008 | 80m | web | MIT | |
ImageNet | 2009 | 14m | web | Princeton | |
SBU caption | 2011 | 1m | flickr | Stony Brook | |
flicker 30k | 2014 | 31k | flickr | Urbana-Champaign | |
yfcc100m | 2015 | 100m | flickr | Yahoo | |
open images | 2016 | 9m | web | ||
JFT300m | 2017 | 300m | not public | ||
cc3m | 2018 | 3m | web | Conceptual Captions | |
IG-1B | 2019 | 1b | Meta | not public | |
LAION 400m | 2021 | 400m | common crawl | LAION | |
cc12m | 2021 | 12m | web | ||
redcap | 2021 | 12m | U Michigan | ||
WIT | 2021 | 11m | wikipedia | ||
WebImageText | 2021 | 400m | web | OpenAI | - not released - CLIP |
ALIGN | 2021 | 1.8b | web | - not released - ALIGN |
|
JFT3B | 2021 | 3b | not public | ||
LAION 5B | 2022 | 5b | common crawl | LAION | |
Coyo 700m | 2022 | 700m | common crawl | Kakao Brain | |
CommonPool | 2023 | 12.8b | common crawl | LAION | DataComp |
Table 2: Web Datasets
Name | Released | Size | Source | Org | Notes |
---|---|---|---|---|---|
CIFAR-10 | 2009 | 60k | TinyImages | CIFAR | classification |
ImageNet | 2009 | 14m | web | Princeton | classification |
MS-COCO | 2014 | 300k | web | Microsfot | - segmentation - context - multiple captions |
Visual Genome | 2015 | 100k | ms-coco yfcc100m |
Stanford | - relationships - bounding boxes |
Open Images | 2016 | 9m | web | - relationships - segmentation - labels |
|
Hierarchical Paragraphs | 2017 | 15k | visual genome | Stanford | - Hierarchical Approach for Generating Descriptive Image Paragraphs - dense paragraph |
Localized Narratives | 2020 | 900k | ms-coco, open images | - voice and mouse movement | |
Hateful Memes | 2020 | 10k | Getty | Meta | |
Crossmodal-3600 | 2022 | 3.6k | curated | 36 languages | |
Segment Anything | 2023 | 11m | - licenced | Meta | - segmentation |
DCI | 2023 | 7.8k | segment anything | Meta | - Densely Caption Images (DCI) - multiple rounds of human annotations |
vision2ui | 2024 | 3m | web | Peking | ui images |
DOCCI | 2024 | 15k | curated | - highly detailed description - donated by one person |
|
IIW | 2024 | 9k | curated | Deepmind | - ImageInWords - highly detailed - sequentially refined |
Table 3: Curated Datasets
Name | Released | Size | Source | Org | Notes |
---|---|---|---|---|---|
clevr | 2017 | 100k | generated images | stanford, meta | - visual question answering |
LAION-coco | 2022 | 600m | LAION 2b | laion | - recaptioning |
LAION-translate | 2022 | 3b | LAION 5b | laion | - machine translation |
COYO-Labeled-300M | 2022 | 300m | coyo | kakaobrain | - machine labels - ImageNet labels |
Pixel-Prose | 2023 | 16m | - LAION - cc12m - redcap |
u maryland | - dense caption - using Google Gemini |
Pick-a-Pic | 2023 | 500k | generated images | stability ai | - user prompts - generated images |
DAC | 2023 | 3m | cc3m | IBM | - Dense and Aligned Captions - LLM expansion - segmented parts |
Recap-Datacomp-1B | 2024 | 1.3b | common-pool | santa cruz | - recap with LLaVA |
Table 4: Examples Synthetic Datasets
Image dataset scales from 10’s million from ImageNet dataset to 10’s billions in recent years. The model and training dataset size increases exponentially because of transformers-based deep learning models. Companies such as Facebook and Google first experimented with internal datasets that are in the 100s million to billions ranges (e.g. IG-1B and JFT-300m). It took a few years for the open source community to catch up. LAION adopted similar automatic scraping and filtering techniques to curate LAION-400m, and later on with LAION 2B, and Common-Pool. These images became the backbone of data sources for all image generators and multimodal models.
The latest dataset is the 12B dataset Common-Pool. We are probably reaching the limit through scraping the common crawl archive. It is possible that we get to 100 billion, but in my opinion, that is the upper bound of the public domain. There are other data sources that could break this scaling limit. Social media platform such as Instagram has 100s of billion of posts. The total image counts in personal photo archive (e.g. Google photos, Apple photos) should be well into the trillion level, and they contain much higher quality photos. For example, DOCCI is an example where one researcher is able to curate more than 10,000 high quality images in a few years. IoT devices could be a rich source of image data as well.
The quality and the diversity of the images in these datasets resembles the content of LAION-400m. This is evidenced by how similar generated images look and feel across different the latest models and products. The content of the datasets such as yfcc100m, LAION-2B, Common-Pool, and Coyo are remarkably similar. One could extract similar subsets from those datasets. For example, if one were to extract a subset of images that filters for humans, aesthetic, size, aspect ratio, and text-image alignment, the filtered subset have very similar images. I suspect that is just the nature of images in public digital domain. It is a reflection of the evolution of digital media. More likely than not, the Google and Facebook private datasets look similar to the open source, web-scraped versions.
The text descriptions are still in the middle of active development. We have a few varieties. The first generation of text is merely for classification. They are objects and concetps. Examples are TinyImages, ImageNet, and SUN. The second variety is description scraped from the alt
attribute in html img
tag. This description tends to be short, noisy, and could contain irrelevant texts. The third variety is typified by DCI, IIW, and DOCCI. In the last couple of years, it is all about vision-language models. That means the text is desired to be general and contain as much details as possible. The descriptions in DCI, IIW, and DOCCI are human annotated through multiple rounds. They are impossibly detailed. They are purposely and painstakingly annotated to include a lot of details. They are time consuming to create. These datasets are rare and small, in the 10k range1. Fourth type is recaptioning. This is a big trend. Most of the synthetic data focuses on recaptioning. A lot of the image, video, and multimodal model training dataset choose to use recaptioned descriptions over raw texts 2.
Video Datasets
Name | Released | Size | Source | Org | Notes |
---|---|---|---|---|---|
ucf-101 | 2012 | 13k clips | youtube | U of Central Florida | classification |
activity-net | 2015 | 30k videos | youtube | KAUST | classification |
kinetics 600 | 2018 | 500k clips | youtube | Deepmind | classification |
ssv2 | 2018 | 220k clips | crowd sourced | Qualcomm | - action recognition - video understanding |
how-to-100m | 2019 | 1.2m videos, 136m clips | youtube | Deepmind | - instruction videos |
movienet | 2020 | 1000 movies | CUHK | ||
WebVid-10M | 2021 | 10m clips | Shutterstock | Oxford | - low resolution - poor visual - watermarks |
merlot | 2021 | 6m videos, 180 m clips | youtube | Allen Institute for AI | - aim to be generic, diverse - YT-Temporal-180m |
celebv-hq | 2022 | 15k videos | youtube | Nanyang Technological University | - focus on humans |
hd-vila-100m | 2022 | 3.3m, 100m | youtube | Microsoft | - aim to be diverse |
VidChapters-7M | 2023 | 817k videos, 7m chapters | youtube | Meta | |
HD-VG-130M | 2024 | 1.5m, 130m | youtube | Microsoft | |
panda 70m | 2024 | 3.3m, 70m | HD-VILA-100M | Snap | - recap |
merlot reserve | 2024 | 20m | youtube | Allen Institute for AI | YT-Temporal-1B |
Table 4: Examples Synthetic Datasets
Almost all of the open source video dataset use Youtube as the data source. All the significant datasets are of this variety. Their creation process is simple. They scrape youtube ids based on some criteria. For example, celebv-hq focuses on celebrity. How-to-100m focuses how-to videos. HD-Vila and Merlot datasets attempt to cover diverse topics. These youtube videos are cut to 5 to 20 seconds short clips, usually cut on the scene changing boundaries. Different dataset might apply additional filtering, processing of metadata, and recaptioning. Youtube is the singularly important data source for video dataset. If Google were to pursue serious law suits against major AI model developers and open source researchers in the near future, it could impact many research projects and video AI products. There are alternative sources of videos, such as movies, broadcast studio archives, short video platforms, and alternative video platforms. Still, youtube is king.
In theory, video datasets should be orders of magnitude larger than image datasets. In practice, open source video datasets are comparable to image dataset, or slightly smaller. Each youtube video is roughly 30MB. The largest video dataset is about 500TB (Merlot Reserve). This comparable to LAION 2B. This is partly due to the compute resources those dataset creation teams have access to. Large industrial labs such as OpenAI are certainly working to create datasets an order of magnitude larger than these. It also indicates that video dataset is still in its early stages. Unlike image datasets, I expect video datasets to dramatically increase its scale in the coming years, if not the coming months. Video models should still have a lot of room to grow. It is exciting to see what kind of video editing will be enabled by the next generation of video models in the next few years.
Footnotes
- It would be interesting to see a team to devote a lot of human resources to create a 500k dataset with such detailed description and use that to train a captioning model. That is one experimental design that no one has attempted to push the limit of language-vision models. ↩
- Model interdependence is a fascinating phenomenon that is not widely discussed yet. Most generated outputs look remarkably similar. Their architectures are similar. Their training data are similar. They also use a lot of the same underlying models to process data or create synthetic data. Recaptioning is the most common such examples. From a information theoretical standpoint, the models all contain knowledge. It is not surprising their results are similar. It is almost tempting that they will invariably converge as models get larger and become more capable, unless they contain different starting information. ↩