Numerai is a hedge fund that gives you an obfuscated financial dataset which you use to predict new unseen market data and if you are correct you can then earn some money in the form of NMR* in essence a fund that crowdsourced most of its quant division.
I wrote a now slightly dated article a few years ago, but is still relevant if you want more background.
Numerai walkthrough: Quantitative Analysis & Machine learning for fun and profit.
If you believe that action and practice are better than theory and have little to no experience in the field of machine…
And straight from the horse’s mouth 🐴 :
* It is unfortunately not as straightforward as predict right and make money, you need to first buy NMR (a cryptocurrency ) with your own money and then stake it on your predictions, if your predictions are wrong you lose money if they are correct you earn a percentage on your stake.You can read more about payouts here: Staking and PayoutsWhile we are here I should also mention that “practical" is a relative term, some people are curing disease with AI while we are predicting the stock market so for some it might not seem practical. I mean practical here in the sense that this is a current, real problem with consequences (you can make/lose money) as opposed to an imagined or academic one like predicting house prices in the 80’s.
The actual problem
Historically some statistical models have fared better in the competition; XGboost for instance is what Numerai itself uses as an example and it ranks top 50 consistently, but as they themselves mention having a single type of model does not help diversify their meta model so I wanted to try Keras and this is the result which hopefully also serves as a Keras/Numerai beginners tutorial/crash course. Let’s start with an overview of the datasets you are given:
The training dataset is simply a set of features and targets (or labels) you can use to train your models, id and data_type columns are for organization purposes and era is there if you want to use them, we won't. Note also how clean the data is, this data represents a decade or so of market data but it's nearly impossible to know what each column/row represents in real life and values are discrete (ie 1.00, 0.75, 0.50 , 0.25, 0.00).
This one is a bit more complex ( plus a chonker ), but once you are in the know it's not that bad:The main idea here is that you are expected to predict the target column based on the feature columns, note that you need to submit predictions for the whole table (you can overwrite the validation targets). There are 3 data-types:test: Used for backtesting by Numerai, not to be confused with the common train/test split.validation : An extra year or so of monthly and weekly data you can use to validate or train.live: Unseen data, the main event, you are scored based on these predictions over a 4 week period (i.e. your targets are 4 weeks into the future but you are scored during those 4 weeks, presumably this emulates a stock portfolio you hold for 4 weeks and get P/L results daily ). Read more here: ReputationFor an in depth exploration of the dataset also check out: Numerai Analysis and Tips
We’ll come back to the dataset with some working code but before we can do that we need a ML pipeline we can use with Keras, this is a skinny version of mine ( mostly fits in a single file ) :
*Ideally you'd only need to train once and save your model but the tournament changes so often that I've opted to retrain every week (and just upgrade my hardware so it can keep up and keep me warm), this is also necessary if you use validation data to train since it changes weekly.Also note that the validation step can take many different forms, you can do cross validation with the training dataset, use some or all of the validation rows found on the Tournament dataset, or simply use the loss metrics from the model as your compass and let the live data be your ultimate validation.
One last thing I feel needs to be explained is what the neural network will be doing, in short it will be doing regression:
We start by feeding the NN our Training Dataset features, then we add heavily connected inner layers where the training happens via the loss function.When we want to predict we feed the trained NN features from the Tournament Dataset and they spew out a prediction (ie target/label). Without getting into the mathematical weeds the inner layers of the neural network gradually modify the weights between them so that the inputs get matched to the correct outputs or rather get closer to them by measuring an error and trying to minimize it ( by affecting the weights in the connections between layers ).If you are new to Neural Networks this might be a lot to digest, here's a couple of links that might help you:Making a simple neural network.Google's Machine Learning crash course.
Some starter code
I will divide each portion of the above mentioned pipeline into code chunks and some discussion, but don’t worry I’ll link to the full script later.
1. Load datasets and extract the things you’ll need:
In order to get the datasets you can manually download them from the site or through the fully fledged API (numerapi)
The rest is pretty simple, load Training and Tournament datasets into dataframes and separate the features from the targets/labels.If you wanted to do Cross validation (CV) or train on the validation dataset you would split that here. You might also run into memory issues during the loading and later prediction stage, about 32GB of RAM seems to work just fine but if you can't swing that you can tweak the loading types see the forum or use a hosted environment like colab or compute. Hey at least you don't need to clean the data 😐.
2. Define a NN in Keras:
The build_model convenience function is where you add the previously defined NN architecture, an input layer with the shape of our features, and 2 extra layers, the last one serves as the output and the sigmoid activation makes sure that the values are between 0 and 1. Also note that you can define optimizers (ie learning rate) and loss metrics.
3. Train the NN :
The train_model function like the build_model one is just a way to bundle keras classes and callbacks, model.fit does the training and the rest are mostly parameters.
You can validate in a number of ways, the simplest and laziest is to monitor your training losses, predict and then upload your predictions which will then be scored across a number of metrics by Numerai:
And on the site after uploading:
The Keras API like other statistics packages has a convenient predict function and all we have to do is attach that to our model and feed it the Tournament Dataset features.Your submission file consist of 2 columns, the id's and the predictions ( which should be from 0-1).
As promised you can find the whole code in this repo :
Where to go from here
Out of the box the performance of this NN is probably not great, I say probably because a big component of the Numerai Tournament is monitoring your live performance during the next 4 weeks, but beyond that you need to monitor your models over a longer time frame, somewhere between 3–6 months.
What to optimize for or which parameters to tune for is also a tricky thing, you can use the validation data,eras or features, some, all or a combination of them, just be mindful that you could end up overfitting which can hurt your scores and your stake.
Here are the parameters/options from this example you can play with:lr > Learning Rate
epochs > Epochs to train
batch_size > Sample size from training DF
layer_Size > NN layer size.Additionally you can change things like your model optimizer, activation functions per layer and loss functions & early stopping.Final Note/Tip: You might be tempted to train your models until mse converges smoothly to zero, but this might not necessarily be the best performance, the example with 2,5000 epochs for instance did worse than on with less than half of that.
And that’s it, a minimal basic Keras regression example for the Numerai competition.
Thanks for reading !