Common Screens

Legacy page 1492

How can you access the data?

The Common Screens dataset lives on Amazon S3. You can download or hotlink the image files free using HTTPS or S3.

S3 bucket information

The AWS S3 bucket is s3://common-screens in the us-west-2 Oregon region.

Image name from domain name

The image name can be derived from a domain name by converting every run of non alphanumeric characters to a hyphen and suffixing --.jpeg. For example, a427.com becomes a427-com--.jpeg.

Example URLs

S3 HTTPS:   https://common-screens.s3.us-west-2.amazonaws.com/data/jpeg/a427-com--.jpeg
S3 URI:     s3://common-screens/data/jpeg/a427-com--.jpeg

Rule

  1. Start with the host name, not a full page path.
  2. Lowercase the host.
  3. Replace each run of characters outside a-z and 0-9 with one hyphen.
  4. Append --.jpeg.
  5. Join it with https://common-screens.s3.us-west-2.amazonaws.com/data/jpeg/.

JavaScript

function commonScreensFilename(host) {
  return host
    .trim()
    .toLowerCase()
    .replace(/^https?:\/\//, '')
    .split('/')[0]
    .split('?')[0]
    .split('#')[0]
    .replace(/[^a-z0-9]+/g, '-') + '--.jpeg';
}

const filename = commonScreensFilename('a427.com');
const url = 'https://common-screens.s3.us-west-2.amazonaws.com/data/jpeg/' + filename;

Python

import re
from urllib.parse import urlparse

def common_screens_filename(value):
    parsed = urlparse(value if '://' in value else 'http://' + value)
    host = (parsed.hostname or value).strip().lower()
    return re.sub(r'[^a-z0-9]+', '-', host) + '--.jpeg'

filename = common_screens_filename('a427.com')
url = f'https://common-screens.s3.us-west-2.amazonaws.com/data/jpeg/{filename}'

AWS CLI

aws s3 cp --no-sign-request \
  s3://common-screens/data/jpeg/a427-com--.jpeg \
  ./screenshots/a427-com--.jpeg

Bucket structure

PrefixContents
/data/jpegJPEG images available through the public S3 HTTPS endpoint.
/data/pngPNG images when available.
/data/tiffTIFF images when available.
/data/ocrOCR text files with the JPEG image name appended with .txt.
/metadataMetadata for images and domains in CSV and gzip formats.
/source-dataSource domain names used to generate screenshots.
/machine-learningMachine learning datasets and models for IAB classification and image processing.
/docsDocumentation files.

Metadata

The metadata is updated monthly with new domains. It includes IAB category, page language, title, description, and keywords captured from each domain's primary webpage.

FileLegacy size
common-screens-2022-12.csv.gz2.7 GB
common-screens-2022-header.txt
common-screens-with-meta-2022-12.csv.gz7.3 GB
common-screens-with-meta-2022-header.txt
common-screens-jpeg-filenames-2022-12.txt.gz408 MB
common-screens-2022-12-mysqldump.sql.gz7.3 GB

ML training datasets

English IAB classification training dataset

Multilingual IAB classification training dataset