Legacy page 1492
How can you access the data?
The Common Screens dataset lives on Amazon S3. You can download or hotlink the image files free using HTTPS or S3.
S3 bucket information
The AWS S3 bucket is s3://common-screens in the us-west-2 Oregon region.
Image name from domain name
The image name can be derived from a domain name by converting every run of non alphanumeric characters to a hyphen and suffixing --.jpeg. For example, a427.com becomes a427-com--.jpeg.
Example URLs
S3 HTTPS: https://common-screens.s3.us-west-2.amazonaws.com/data/jpeg/a427-com--.jpeg
S3 URI: s3://common-screens/data/jpeg/a427-com--.jpeg
Rule
- Start with the host name, not a full page path.
- Lowercase the host.
- Replace each run of characters outside
a-zand0-9with one hyphen. - Append
--.jpeg. - Join it with
https://common-screens.s3.us-west-2.amazonaws.com/data/jpeg/.
JavaScript
function commonScreensFilename(host) {
return host
.trim()
.toLowerCase()
.replace(/^https?:\/\//, '')
.split('/')[0]
.split('?')[0]
.split('#')[0]
.replace(/[^a-z0-9]+/g, '-') + '--.jpeg';
}
const filename = commonScreensFilename('a427.com');
const url = 'https://common-screens.s3.us-west-2.amazonaws.com/data/jpeg/' + filename;
Python
import re
from urllib.parse import urlparse
def common_screens_filename(value):
parsed = urlparse(value if '://' in value else 'http://' + value)
host = (parsed.hostname or value).strip().lower()
return re.sub(r'[^a-z0-9]+', '-', host) + '--.jpeg'
filename = common_screens_filename('a427.com')
url = f'https://common-screens.s3.us-west-2.amazonaws.com/data/jpeg/{filename}'
AWS CLI
aws s3 cp --no-sign-request \
s3://common-screens/data/jpeg/a427-com--.jpeg \
./screenshots/a427-com--.jpeg
Bucket structure
| Prefix | Contents |
|---|---|
/data/jpeg | JPEG images available through the public S3 HTTPS endpoint. |
/data/png | PNG images when available. |
/data/tiff | TIFF images when available. |
/data/ocr | OCR text files with the JPEG image name appended with .txt. |
/metadata | Metadata for images and domains in CSV and gzip formats. |
/source-data | Source domain names used to generate screenshots. |
/machine-learning | Machine learning datasets and models for IAB classification and image processing. |
/docs | Documentation files. |
Metadata
The metadata is updated monthly with new domains. It includes IAB category, page language, title, description, and keywords captured from each domain's primary webpage.