Kaggle Jane Street competition

Kaggle There are a lot of competitions sponsored by hedge funds , It may have become a new type of inner roll , Or maybe you really want to start from Kaggler Get some idea.
This time we’re here to learn what has just ended Jane Street Sponsored competition .

This competition is a classification problem , For every piece of data ( A trading opportunity ), We need to give whether to act (action). If the transaction is executed , The corresponding benefit is return * weight, Add it up to every day :

In the field of investment , Many people pursue high sharpe value ( Considering both benefits and risks ), Its specific calculation is as follows :

Our final evaluation index is :

Again , Let’s take a look at open source high score code .

The first part is to import a bunch of packages as usual . It is worth mentioning that , Due to the large amount of data this time . Code used datatable Read the data .datatable It is said to be a performance roller pandas An efficient multithreaded data processing tool

import datatable as dtable
train = dtable.fread('/kaggle/input/jane-street-market-prediction/train.csv').to_pandas()

Because the features provided this time are still some anonymous features that do not know their meaning . The code is artificially divided into according to the distribution of each feature 4 Species characteristics :Linear,Noisy,Negative and Hybrid.
The last feature is all the original features , Add the mean value of each type of feature to construct the feature .


The training set does not directly give us label, I.e. Yes No action. What it provides is 5 Yields in different time windows (return). Code construction label The way is to judge if there is more than 3 A positive return , Execute the transaction (action=1).

resp_cols = ['resp''resp_1''resp_2''resp_3''resp_4']
y = np.stack([(train[c] > 0).astype('int'for c in resp_cols]).T

train['action'] = (y.mean(axis=1) > 0.5).astype('int')

About the model part , The scheme uses a XGBoost Model , And use HyperOpt The parameters are optimized .
About cv Use , It uses a method suitable for such time series problems PurgedGroupTimeSeriesSplit, It can be seen intuitively from the following figure . The verification set is always behind the training set , And after a short interval .

As you can see, the open source solution is not complicated , There is still a lot of room to improve , Such as analyzing the meaning of features , Use neural network model , Optimize the evaluation indicators given by the topic, etc . however , In the financial world with low signal-to-noise ratio , Whether these methods are useful is still a question mark .

Read more here: Source link