How can you access the data?

The Common Screens dataset lives on Amazon S3. You can download or hotlink the image files entirely free using HTTP(S) or S3 or with cloudfront cdn delivery.

S3 Bucket Information

The AWS s3 bucket is s3://common-screens hosted in us-west-2 (Oregon) region

Cloudfront CDN distribution

Cloud Front distribution domain name is

Image name from domain name

The image name can be derived from domain names by converting all non alpha-numeric characters to hyphen “-” and suffixing “–.jpeg”. e.g. becomes aws-amazon-com–.jpeg. The image can be downloaded or hotlinked using the URL–.jpeg
or via s3 https using–.jpeg.
The s3 bucket prefix is s3://common-screens/data/jpeg/ for all jpeg images.

If you want to link to the source servers the URL will be–.jpeg

Try it for a domain..


Bucket structure

/data — all image and ocr data
/data/jpeg — jpeg images (mapped to cloudfront distribution)
/data/png — png images (not available yet)
/data/tiff — tiff images (not available yet)
/data/ocr — ocr text files with jpeg image name appended with .txt (in processing)
/metadata — metadata for images and domain in csv and gz format
/source-data — source domain names used to generate all sceenshots, in csv and gz format
/machine-learning machine learning models for IAB Classification of text, image processing etc.
/docs — documentation


The metadata is updated every month with new domains added, it contains IAB category and page language determined by machine learning along with other useful domain profile. The metadata is essentially title, description and keywords captured from the primary webpage of the domain. GB GB MB GB

ML Training Datasets (English language IAB Classification manually verified training dataset) (Multilingual IAB Classification training dataset derived by translating the english manually verified dataset)