How can you access the data?

The Common Screens dataset lives on Amazon S3. You can download or hotlink the image files entirely free using HTTP(S) or S3 or with cloudfront cdn delivery.

S3 Bucket Information

The AWS s3 bucket is s3://common-screens hosted in us-west-2 (Oregon) region

Cloudfront CDN distribution

Cloud Front distribution domain name is dqh5x5k6xg3n1.cloudfront.net

Image name from domain name

The image name can be derived from domain names by converting all non alpha-numeric characters to hyphen “-” and suffixing “–.jpeg”. e.g. aws.amazon.com becomes aws-amazon-com–.jpeg. The image can be downloaded or hotlinked using the URL
https://dqh5x5k6xg3n1.cloudfront.net/aws-amazon-com–.jpeg
or via s3 https using
https://common-screens.s3.us-west-2.amazonaws.com/data/jpeg/aws-amazon-com–.jpeg.
The s3 bucket prefix is s3://common-screens/data/jpeg/ for all jpeg images.

If you want to link to the source servers the URL will be
https://data.commonscreens.com/aws-amazon-com–.jpeg

Try it for a domain..

Domain

Bucket structure

/data — all image and ocr data
/data/jpeg — jpeg images (mapped to cloudfront distribution)
/data/png — png images (not available yet)
/data/tiff — tiff images (not available yet)
/data/ocr — ocr text files with jpeg image name appended with .txt (in processing)
/metadata — metadata for images and domain in csv and gz format
/source-data — source domain names used to generate all sceenshots, in csv and gz format
/machine-learning machine learning models for IAB Classification of text, image processing etc.
/docs — documentation

Metadata

The metadata is updated every month with new domains added, it contains IAB category and page language determined by machine learning along with other useful domain profile. The metadata is essentially title, description and keywords captured from the primary webpage of the domain.

https://common-screens.s3.us-west-2.amazonaws.com/metadata/common-screens-2022-12.csv.gz2.7 GB
https://common-screens.s3.us-west-2.amazonaws.com/metadata/common-screens-2022-header.txt
https://common-screens.s3.us-west-2.amazonaws.com/metadata/common-screens-with-meta-2022-12.csv.gz7.3 GB
https://common-screens.s3.us-west-2.amazonaws.com/metadata/common-screens-with-meta-2022-header.txt
https://common-screens.s3.us-west-2.amazonaws.com/metadata/common-screens-jpeg-filenames-2022-12.txt.gz408 MB
https://common-screens.s3.us-west-2.amazonaws.com/metadata/common-screens-2022-12-mysqldump.sql.gz7.3 GB

ML Training Datasets


https://common-screens.s3.us-west-2.amazonaws.com/machine-learning/text-classification-training-dataset-en.csv (English language IAB Classification manually verified training dataset)

https://common-screens.s3.us-west-2.amazonaws.com/machine-learning/text-classification-training-dataset-multi-lingual.csv (Multilingual IAB Classification training dataset derived by translating the english manually verified dataset)