Kaggle There are a lot of competitions sponsored by hedge funds , It may have become a new type of inner roll , Or maybe you really want to start from Kaggler Get some idea.
This time we’re here to learn what has just ended Jane Street Sponsored competition .
This competition is a classification problem , For every piece of data ( A trading opportunity ), We need to give whether to act (action). If the transaction is executed , The corresponding benefit is return * weight, Add it up to every day ：
In the field of investment , Many people pursue high sharpe value ( Considering both benefits and risks ), Its specific calculation is as follows ：
Our final evaluation index is ：
Again , Let’s take a look at open source high score code .
3.1 import packages
The first part is to import a bunch of packages as usual . It is worth mentioning that , Due to the large amount of data this time . Code used datatable Read the data .datatable It is said to be a performance roller pandas An efficient multithreaded data processing tool
import datatable as dtable
train = dtable.fread('/kaggle/input/jane-street-market-prediction/train.csv').to_pandas()
3.2 Feature Engineering
Because the features provided this time are still some anonymous features that do not know their meaning . The code is artificially divided into according to the distribution of each feature 4 Species characteristics ：Linear,Noisy,Negative and Hybrid.
The last feature is all the original features , Add the mean value of each type of feature to construct the feature .
3.3 determine label
The training set does not directly give us label, I.e. Yes No action. What it provides is 5 Yields in different time windows (return). Code construction label The way is to judge if there is more than 3 A positive return , Execute the transaction (action=1).
resp_cols = ['resp', 'resp_1', 'resp_2', 'resp_3', 'resp_4']
y = np.stack([(train[c] > 0).astype('int') for c in resp_cols]).T
train['action'] = (y.mean(axis=1) > 0.5).astype('int')
3.4 model training
About the model part , The scheme uses a XGBoost Model , And use HyperOpt The parameters are optimized .
About cv Use , It uses a method suitable for such time series problems PurgedGroupTimeSeriesSplit, It can be seen intuitively from the following figure . The verification set is always behind the training set , And after a short interval .
As you can see, the open source solution is not complicated , There is still a lot of room to improve , Such as analyzing the meaning of features , Use neural network model , Optimize the evaluation indicators given by the topic, etc . however , In the financial world with low signal-to-noise ratio , Whether these methods are useful is still a question mark .
Read more here: Source link