kaggle-lshtc from bamine – Github Help

Code for Large Scale Hierarchical Text Classification competition.

www.kaggle.com/c/lshtc

a centroid-based flat classifier.

Prediction

  1. Selecting k-class from near the query with nearest centroid classifier.
  2. Judging with binary classifier whether the query can be accepted to class.

(predict.cpp)

predict1

Selecting k-candidate classes that centroid of class close to the query.

predict2

Selecting classes that binary classifier of class returns p > 0.5. (Implementation of the binary classifier is logistic regression)

predict3

Training

For each data points..

  1. Selecting k-class from near the data point with nearest centroid classifier.
  2. Adding the data point as training data to dataset for each classes.

(prefetch.cpp)

For each classes..

  1. Learning the binary classifier using own dataset.

(train.cpp)

train1
train2

What are the feature

using variant TF-IDF.

tf = log(number_of_term_occurs_in_document + 1)
idf = log(total_number_of_documents / (number_of_documents_containing_term + 1)) + 5
tfidf = tf * idf

and feature vector is normalized by L2 norm.
(code: tfidf_transformer.hpp)

What are the metric for Centroid Classifier

using cosine similarity.

  • Ubuntu 13.10
  • g++ 4.8.1
  • make
  • 32GB RAM

please edit SETTINGS.h first.

make
./prefetch
./train
./predict

NOTE: ./prefetch is very slow. probably processing time exceeds 15 hours.

Running the Validation Test

./vt_prefech
./vt_train
./validation

Simple k-NN baseline

running the validation test.

generating the sumission.txt.

Simple Nearest Centroid Classifier

running the validation test.

generating the sumission.txt.

Figure

Model LBMaF Training Time Prediction Time
k-NN 0.23088 n/a 10 minutes
NCC 0.28931 80 seconds 2 hours
NCC+BC 0.33025 15 hours 2 hours

Read more here: Source link