Any text classification engine will need a good training dataset, the more accurate the training dataset is the higher the accuracy of the classification exercise.
Common Screens provides a manually created training dataset available as follows
https://common-screens.s3.us-west-2.amazonaws.com/machine-learning/text-classification-training-dataset-en.csv (English language IAB Classification manually verified training dataset)
https://common-screens.s3.us-west-2.amazonaws.com/machine-learning/text-classification-training-dataset-multi-lingual.csv (Multilingual IAB Classification training dataset derived by translating the english manually verified dataset)
Following is a simplest example of such classification in PHP, the approach will be same for other ML languages also. This example uses the teamtnt\tntsearch project https://github.com/teamtnt/tntsearch
use TeamTNT\TNTSearch\Classifier\TNTClassifier; $classifier = new TNTClassifier(); $classifier->learn("A great game", "Sports"); $classifier->learn("The election was over", "Not sports"); $classifier->learn("Very clean match", "Sports"); $classifier->learn("A clean but forgettable game", "Sports"); $guess = $classifier->predict("It was a close election"); var_dump($guess['label']); //returns "Not sports"
You essentially will load the training model into a database and iterate through rows feeding them into $classifier->learn(“A clean but forgettable game”, “Sports”); And then iterate through the target metadata file and predict the category for each of the domains using a combination of title, description and keywords.
You can train once and reuse the generated model again and again by saving the model and loading it again as needed or keepng in in memory for faster processing.
Saving the classifier
$classifier = new TNTClassifier(); $classifier->load('sports.cls');