# Kaggle Jane Street competition

## 1 introduction

Kaggle There are a lot of competitions sponsored by hedge funds , It may have become a new type of inner roll , Or maybe you really want to start from Kaggler Get some idea.
This time we’re here to learn what has just ended Jane Street Sponsored competition .

## 2 Introduction to the competition

This competition is a classification problem , For every piece of data ( A trading opportunity ), We need to give whether to act (action). If the transaction is executed , The corresponding benefit is return * weight, Add it up to every day ：

In the field of investment , Many people pursue high sharpe value ( Considering both benefits and risks ), Its specific calculation is as follows ：

Our final evaluation index is ：

## 3 Specific code

Again , Let’s take a look at open source high score code .

### 3.1 import packages

The first part is to import a bunch of packages as usual . It is worth mentioning that , Due to the large amount of data this time . Code used datatable Read the data .datatable It is said to be a performance roller pandas An efficient multithreaded data processing tool

``import datatable as dtabletrain = dtable.fread('/kaggle/input/jane-street-market-prediction/train.csv').to_pandas()``

### 3.2 Feature Engineering

Because the features provided this time are still some anonymous features that do not know their meaning . The code is artificially divided into according to the distribution of each feature 4 Species characteristics ：Linear,Noisy,Negative and Hybrid.
The last feature is all the original features , Add the mean value of each type of feature to construct the feature .

``train['f_Linear']=train[f_Linear].mean(axis=1)train['f_Noisy']=train[f_Noisy].mean(axis=1)train['f_Negative']=train[f_Negative].mean(axis=1)train['f_Hybrid']=train[f_Hybrid].mean(axis=1)``

### 3.3 determine label

The training set does not directly give us label, I.e. Yes No action. What it provides is 5 Yields in different time windows (return). Code construction label The way is to judge if there is more than 3 A positive return , Execute the transaction (action=1).

``resp_cols = ['resp', 'resp_1', 'resp_2', 'resp_3', 'resp_4']y = np.stack([(train[c] > 0).astype('int') for c in resp_cols]).Ttrain['action'] = (y.mean(axis=1) > 0.5).astype('int')``

### 3.4 model training

About the model part , The scheme uses a XGBoost Model , And use HyperOpt The parameters are optimized .
About cv Use , It uses a method suitable for such time series problems PurgedGroupTimeSeriesSplit, It can be seen intuitively from the following figure . The verification set is always behind the training set , And after a short interval .

## 4 Summary

As you can see, the open source solution is not complicated , There is still a lot of room to improve , Such as analyzing the meaning of features , Use neural network model , Optimize the evaluation indicators given by the topic, etc . however , In the financial world with low signal-to-noise ratio , Whether these methods are useful is still a question mark .