How can you access the data?
The Common Screens dataset lives on Amazon S3. You can download or hotlink the image files entirely free using HTTP(S) or S3 or with cloudfront cdn delivery.
S3 Bucket Information
The AWS s3 bucket is s3://common-screens hosted in us-west-2 (Oregon) region
Cloudfront CDN distribution
Cloud Front distribution domain name is dqh5x5k6xg3n1.cloudfront.net
Image name from domain name
The image name can be derived from domain names by converting all non alpha-numeric characters to hyphen “-” and suffixing “–.jpeg”. e.g. aws.amazon.com becomes aws-amazon-com–.jpeg. The image can be downloaded or hotlinked using the URL
https://dqh5x5k6xg3n1.cloudfront.net/aws-amazon-com–.jpeg
or via s3 https using
https://common-screens.s3.us-west-2.amazonaws.com/data/jpeg/aws-amazon-com–.jpeg.
The s3 bucket prefix is s3://common-screens/data/jpeg/ for all jpeg images.
If you want to link to the source servers the URL will be
https://data.commonscreens.com/aws-amazon-com–.jpeg
Try it for a domain..
Bucket structure
/data — all image and ocr data
/data/jpeg — jpeg images (mapped to cloudfront distribution)
/data/png — png images (not available yet)
/data/tiff — tiff images (not available yet)
/data/ocr — ocr text files with jpeg image name appended with .txt (in processing)
/metadata — metadata for images and domain in csv and gz format
/source-data — source domain names used to generate all sceenshots, in csv and gz format
/machine-learning – machine learning models for IAB Classification of text, image processing etc.
/docs — documentation
Metadata
The metadata is updated every month with new domains added, it contains IAB category and page language determined by machine learning along with other useful domain profile. The metadata is essentially title, description and keywords captured from the primary webpage of the domain.
ML Training Datasets
https://common-screens.s3.us-west-2.amazonaws.com/machine-learning/text-classification-training-dataset-en.csv (English language IAB Classification manually verified training dataset)
https://common-screens.s3.us-west-2.amazonaws.com/machine-learning/text-classification-training-dataset-multi-lingual.csv (Multilingual IAB Classification training dataset derived by translating the english manually verified dataset)