Legacy page 1492

How can you access the data?

The Common Screens dataset lives on Amazon S3. You can download or hotlink the image files free using HTTPS or S3.

S3 bucket information

The AWS S3 bucket is s3://common-screens in the us-west-2 Oregon region.

Image name from domain name

The image name can be derived from a domain name by converting every run of non alphanumeric characters to a hyphen and suffixing --.jpeg. For example, a427.com becomes a427-com--.jpeg.

Example URLs

S3 HTTPS:   https://common-screens.s3.us-west-2.amazonaws.com/data/jpeg/a427-com--.jpeg
S3 URI:     s3://common-screens/data/jpeg/a427-com--.jpeg

Rule

Start with the host name, not a full page path.
Lowercase the host.
Replace each run of characters outside a-z and 0-9 with one hyphen.
Append --.jpeg.
Join it with https://common-screens.s3.us-west-2.amazonaws.com/data/jpeg/.

JavaScript

function commonScreensFilename(host) {
  return host
    .trim()
    .toLowerCase()
    .replace(/^https?:\/\//, '')
    .split('/')[0]
    .split('?')[0]
    .split('#')[0]
    .replace(/[^a-z0-9]+/g, '-') + '--.jpeg';
}

const filename = commonScreensFilename('a427.com');
const url = 'https://common-screens.s3.us-west-2.amazonaws.com/data/jpeg/' + filename;

Python

import re
from urllib.parse import urlparse

def common_screens_filename(value):
    parsed = urlparse(value if '://' in value else 'http://' + value)
    host = (parsed.hostname or value).strip().lower()
    return re.sub(r'[^a-z0-9]+', '-', host) + '--.jpeg'

filename = common_screens_filename('a427.com')
url = f'https://common-screens.s3.us-west-2.amazonaws.com/data/jpeg/{filename}'

AWS CLI

aws s3 cp --no-sign-request \
  s3://common-screens/data/jpeg/a427-com--.jpeg \
  ./screenshots/a427-com--.jpeg

Bucket structure

Prefix	Contents
`/data/jpeg`	JPEG images available through the public S3 HTTPS endpoint.
`/data/png`	PNG images when available.
`/data/tiff`	TIFF images when available.
`/data/ocr`	OCR text files with the JPEG image name appended with `.txt`.
`/metadata`	Metadata for images and domains in CSV and gzip formats.
`/source-data`	Source domain names used to generate screenshots.
`/machine-learning`	Machine learning datasets and models for IAB classification and image processing.
`/docs`	Documentation files.

Metadata

The metadata is updated monthly with new domains. It includes IAB category, page language, title, description, and keywords captured from each domain's primary webpage.

File	Legacy size
common-screens-2022-12.csv.gz	2.7 GB
common-screens-2022-header.txt
common-screens-with-meta-2022-12.csv.gz	7.3 GB
common-screens-with-meta-2022-header.txt
common-screens-jpeg-filenames-2022-12.txt.gz	408 MB
common-screens-2022-12-mysqldump.sql.gz	7.3 GB

ML training datasets

English IAB classification training dataset

Multilingual IAB classification training dataset