The topic is Hate Speech Detection and I’m using this Kaggle competition: kaggle.com/c/detecting-insults-in-social-commentary
From what I’m understanding is that there are 3 relevant datasets for the purpose of binary text classification:
- train.csv: which consists of the columns “Insult”, “Date” and “Comment”, due to the “Insult” column this dataset is labeled (0 = OTHER; 1 = TOXIC)
- test_with_solutions.csv: which consists of the columns “Insult”, “Date”, “Comment” and “Usage”; due to the “Insult” column this dataset is labeled as well
- test.csv: which consists of the columns “ID”, “Date” and “Comment”; this dataset is not labeled
I’m somewhat familiar with train_and_test_split to create “unseen” data for the classifier but what confuses me is that I’m thinking this dataset is already splitted into train, dev and test datasets.
My assumption is that train.csv is the train dataset (such wow :p), test_with solutions.csv is my dev dataset and test.csv is my test dataset, but how can I evaluate my classifier if there isn’t a column “Insult” with the labels? Is that due to the Kaggle Competition?
If it helps, I can also post my code 🙂
Can someone help me?
Read more here: Source link