Open Source Vision Datasets

In the past year I have worked with a few projects that use vision datasets. People might give them fancy names, but in simple terms, they are just datasets with images and videos that are annotated with text descriptions. For the purpose of training large models, projects almost always start with open source datasets. They are the most accessible, and more likely than not, they will make a large contribution to the full training datasets. There are two common additional sources, commerical (e.g. iStock, Shutterstock, Leixa, Alamy, Artistation, Adobe Stock) and internal datasets (e.g. JFT-300M, JFT-3B, IG-1B). I am compiling most of the well known open source datasets into a few tables to showcase how vision datasets have evolve over time.

Image Dataset before ImageNet¶

Name	Released	Size	Source	Org	Notes
FERET	2000	14k	curated	NIST	faces
caltech-101	2003	9k	curated	Caltech	objects
Yale Face B+	2005	16k	curated	Yale	faces
Caltech-256	2007	30k	curated	Caltech	objects
Oxford-5k	2007	5k	curated	Oxford	buildings
LFW	2007	12k	web	U Mass.	Labeled Faces in the Wild
PASCAL VOC	2007	30k	curated	Microsoft	objects
MIRFLICKR-25K	2008	25k	flickr	Leiden	objects, concepts
Tiny Image	2008	80m	web	MIT	classification
NUS-WIDE	2009	270k	flickr	NUS	classification
SUN	2009	130k	web	MIT	environmental scenes, places and the objects
ImageNet	2009	14m	web	Princeton	classification

Table 1: Early Image Datasets

ImageNet kick started the revolution of training large, deep neural net model using large amount of image. While image datasets existed before ImageNet, they were much smaller in scale. They tend to cover a narrow, specific purpose. They all involve human curation and annotations. ImageNet’s key differentiator was its size. It was considerably larger than previous dataset. Its labels were still annotated by humans, maintaining its accuracy. There was a curious case of TinyImages because it also had scale, but it did not have nearly the impact that ImageNet had. The key differentiator was that ImageNet retained the image’s full resolution, it used human annotation, and it hosted a very popular benchmark competition.

Image Datasets¶

Name	Released	Size	Source	Org	Notes
TinyImages	2008	80m	web	MIT
ImageNet	2009	14m	web	Princeton
SBU caption	2011	1m	flickr	Stony Brook
flicker 30k	2014	31k	flickr	Urbana-Champaign
yfcc100m	2015	100m	flickr	Yahoo
open images	2016	9m	web	Google
JFT300m	2017	300m		Google	not public
cc3m	2018	3m	web	Google	Conceptual Captions
IG-1B	2019	1b	Instagram	Meta	not public
LAION 400m	2021	400m	common crawl	LAION
cc12m	2021	12m	web	Google
redcap	2021	12m	reddit	U Michigan
WIT	2021	11m	wikipedia	Google
WebImageText	2021	400m	web	OpenAI	- not released - CLIP
ALIGN	2021	1.8b	web	Google	- not released - ALIGN
JFT3B	2021	3b		Google	not public
LAION 5B	2022	5b	common crawl	LAION
Coyo 700m	2022	700m	common crawl	Kakao Brain
CommonPool	2023	12.8b	common crawl	LAION	DataComp

Table 2: Web Datasets

Name	Released	Size	Source	Org	Notes
CIFAR-10	2009	60k	TinyImages	CIFAR	classification
ImageNet	2009	14m	web	Princeton	classification
MS-COCO	2014	300k	web	Microsfot	- segmentation - context - multiple captions
Visual Genome	2015	100k	ms-coco yfcc100m	Stanford	- relationships - bounding boxes
Open Images	2016	9m	web	Google	- relationships - segmentation - labels
Hierarchical Paragraphs	2017	15k	visual genome	Stanford	- Hierarchical Approach for Generating Descriptive Image Paragraphs - dense paragraph
Localized Narratives	2020	900k	ms-coco, open images	Google	- voice and mouse movement
Hateful Memes	2020	10k	Getty	Meta
Crossmodal-3600	2022	3.6k	curated	Google	36 languages
Segment Anything	2023	11m	- licenced	Meta	- segmentation
DCI	2023	7.8k	segment anything	Meta	- Densely Caption Images (DCI) - multiple rounds of human annotations
vision2ui	2024	3m	web	Peking	ui images
DOCCI	2024	15k	curated	Google	- highly detailed description - donated by one person
IIW	2024	9k	curated	Deepmind	- ImageInWords - highly detailed - sequentially refined

Table 3: Curated Datasets

Name	Released	Size	Source	Org	Notes
clevr	2017	100k	generated images	stanford, meta	- visual question answering
LAION-coco	2022	600m	LAION 2b	laion	- recaptioning
LAION-translate	2022	3b	LAION 5b	laion	- machine translation
COYO-Labeled-300M	2022	300m	coyo	kakaobrain	- machine labels - ImageNet labels
Pixel-Prose	2023	16m	- LAION - cc12m - redcap	u maryland	- dense caption - using Google Gemini
Pick-a-Pic	2023	500k	generated images	stability ai	- user prompts - generated images
DAC	2023	3m	cc3m	IBM	- Dense and Aligned Captions - LLM expansion - segmented parts
Recap-Datacomp-1B	2024	1.3b	common-pool	santa cruz	- recap with LLaVA

Table 4: Examples Synthetic Datasets

Image dataset scales from 10’s million from ImageNet dataset to 10’s billions in recent years. The model and training dataset size increases exponentially because of transformers-based deep learning models. Companies such as Facebook and Google first experimented with internal datasets that are in the 100s million to billions ranges (e.g. IG-1B and JFT-300m). It took a few years for the open source community to catch up. LAION adopted similar automatic scraping and filtering techniques to curate LAION-400m, and later on with LAION 2B, and Common-Pool. These images became the backbone of data sources for all image generators and multimodal models.

The latest dataset is the 12B dataset Common-Pool. We are probably reaching the limit through scraping the common crawl archive. It is possible that we get to 100 billion, but in my opinion, that is the upper bound of the public domain. There are other data sources that could break this scaling limit. Social media platform such as Instagram has 100s of billion of posts. The total image counts in personal photo archive (e.g. Google photos, Apple photos) should be well into the trillion level, and they contain much higher quality photos. For example, DOCCI is an example where one researcher is able to curate more than 10,000 high quality images in a few years. IoT devices could be a rich source of image data as well.

The quality and the diversity of the images in these datasets resembles the content of LAION-400m. This is evidenced by how similar generated images look and feel across different the latest models and products. The content of the datasets such as yfcc100m, LAION-2B, Common-Pool, and Coyo are remarkably similar. One could extract similar subsets from those datasets. For example, if one were to extract a subset of images that filters for humans, aesthetic, size, aspect ratio, and text-image alignment, the filtered subset have very similar images. I suspect that is just the nature of images in public digital domain. It is a reflection of the evolution of digital media. More likely than not, the Google and Facebook private datasets look similar to the open source, web-scraped versions.

The text descriptions are still in the middle of active development. We have a few varieties. The first generation of text is merely for classification. They are objects and concetps. Examples are TinyImages, ImageNet, and SUN. The second variety is description scraped from the alt attribute in html img tag. This description tends to be short, noisy, and could contain irrelevant texts. The third variety is typified by DCI, IIW, and DOCCI. In the last couple of years, it is all about vision-language models. That means the text is desired to be general and contain as much details as possible. The descriptions in DCI, IIW, and DOCCI are human annotated through multiple rounds. They are impossibly detailed. They are purposely and painstakingly annotated to include a lot of details. They are time consuming to create. These datasets are rare and small, in the 10k range¹. Fourth type is recaptioning. This is a big trend. Most of the synthetic data focuses on recaptioning. A lot of the image, video, and multimodal model training dataset choose to use recaptioned descriptions over raw texts ².

Video Datasets¶

Name	Released	Size	Source	Org	Notes
ucf-101	2012	13k clips	youtube	U of Central Florida	classification
activity-net	2015	30k videos	youtube	KAUST	classification
kinetics 600	2018	500k clips	youtube	Deepmind	classification
ssv2	2018	220k clips	crowd sourced	Qualcomm	- action recognition - video understanding
how-to-100m	2019	1.2m videos, 136m clips	youtube	Deepmind	- instruction videos
movienet	2020	1000 movies		CUHK
WebVid-10M	2021	10m clips	Shutterstock	Oxford	- low resolution - poor visual - watermarks
merlot	2021	6m videos, 180 m clips	youtube	Allen Institute for AI	- aim to be generic, diverse - YT-Temporal-180m
celebv-hq	2022	15k videos	youtube	Nanyang Technological University	- focus on humans
hd-vila-100m	2022	3.3m, 100m	youtube	Microsoft	- aim to be diverse
VidChapters-7M	2023	817k videos, 7m chapters	youtube	Meta
HD-VG-130M	2024	1.5m, 130m	youtube	Microsoft
panda 70m	2024	3.3m, 70m	HD-VILA-100M	Snap	- recap
merlot reserve	2024	20m	youtube	Allen Institute for AI	YT-Temporal-1B

Table 4: Examples Synthetic Datasets

Almost all of the open source video dataset use Youtube as the data source. All the significant datasets are of this variety. Their creation process is simple. They scrape youtube ids based on some criteria. For example, celebv-hq focuses on celebrity. How-to-100m focuses how-to videos. HD-Vila and Merlot datasets attempt to cover diverse topics. These youtube videos are cut to 5 to 20 seconds short clips, usually cut on the scene changing boundaries. Different dataset might apply additional filtering, processing of metadata, and recaptioning. Youtube is the singularly important data source for video dataset. If Google were to pursue serious law suits against major AI model developers and open source researchers in the near future, it could impact many research projects and video AI products. There are alternative sources of videos, such as movies, broadcast studio archives, short video platforms, and alternative video platforms. Still, youtube is king.

In theory, video datasets should be orders of magnitude larger than image datasets. In practice, open source video datasets are comparable to image dataset, or slightly smaller. Each youtube video is roughly 30MB. The largest video dataset is about 500TB (Merlot Reserve). This comparable to LAION 2B. This is partly due to the compute resources those dataset creation teams have access to. Large industrial labs such as OpenAI are certainly working to create datasets an order of magnitude larger than these. It also indicates that video dataset is still in its early stages. Unlike image datasets, I expect video datasets to dramatically increase its scale in the coming years, if not the coming months. Video models should still have a lot of room to grow. It is exciting to see what kind of video editing will be enabled by the next generation of video models in the next few years.

Footnotes¶

It would be interesting to see a team to devote a lot of human resources to create a 500k dataset with such detailed description and use that to train a captioning model. That is one experimental design that no one has attempted to push the limit of language-vision models. ↩
Model interdependence is a fascinating phenomenon that is not widely discussed yet. Most generated outputs look remarkably similar. Their architectures are similar. Their training data are similar. They also use a lot of the same underlying models to process data or create synthetic data. Recaptioning is the most common such examples. From a information theoretical standpoint, the models all contain knowledge. It is not surprising their results are similar. It is almost tempting that they will invariably converge as models get larger and become more capable, unless they contain different starting information. ↩

Open Source Vision Datasets

Image Dataset before ImageNet¶

Image Datasets¶

Video Datasets¶

Footnotes¶

Related Posts

Published

Tags

Contact