The world of machine learning is not new, and there are numerous machine learning model types out there. Different models work in a different way. There are a few heuristics which can help you to select the model type for the case. However, a typical approach is beautiful and simple: for every task at hand, try all the models which seem reasonable (see below) and see what works the best.
Currently, TrendSpider supports the following types of machine learning models:
- Naive Bayes (complexity of 1 at a scale of 1 to 4)
- Logistic Regression (complexity 2/4)
- K-Nearest Neighbors (complexity 3/4)
- Random Forest (complexity 4/4)
These models have different complexity. Random Forest is the most complex and powerful of them. Naive Bayes is the most simple. The word “powerful” is the key here, and it’s a double-edged sword. The more powerful the model is, the more complex patterns it can capture. For example, linear models can only capture simplest forms of relationships. Complex models like RF can easily do things like “if A and B, or C and not D, or if E and F and G”. That’s the good edge of a sword here.
However, no model can tell which patterns are “true” and which are just noise. Powerful models are way more prone to learning random noise, if applied to inputs which carry no “true” pattern which is easier to detect. That’s the edge of a sword you did not want, but it’s here.
The less powerful the model is, the more forgiving it is when it comes to noisy data. A simple model will tell you straight when it can’t identify any clear relation, should you give it noisy data. A complex model will try hard and find patterns anyways. Even if no truly persistent patterns exist, the model will find something. That’s an inaccurate analogy, but you can think of that as LLM hallucinating (and LLM are damn complex models).
The critical thing to understand is that models deal with numbers. They have no feelings and they don’t really “know” what you’re doing here. “True persistent pattern” is a concept invented by humans, and the machine learning model per se can’t operate with big abstract ideas like that. Models can do numbers, that’s it.
A general advice when picking a model is as follows.
- If you know what you’re doing then devote significant effort to feature engineering and go with the simplest models possible.
- If you don’t really know what you’re doing, then use medium-to-complex models and prepare to do a lot of grinding (train-backtest-discard) to compensate for the lack of knowledge.
3.1 Naive Bayes (model complexity: 1 of 4)
This model will base its predictions on signals from your inputs assuming that inputs are not correlated with each other. It can be a good idea to use this model, should you use a set of indicators which are very different in their nature (i.e., one momentum indicator, one volume indicator, one derivative like “speed of indicator change”). Use this model if you believe that your signal can be captured by conditions like "if at least 3 of 5 inputs (independently) signal Entry".
3.2 Logistic regression (complexity: 2 of 4)
This model will base its predictions on signals from your inputs assuming that inputs can have correlations and dependencies with each other, but these dependencies are rather simple. Use this model if you believe that your signal can be captured by rather simple (linear) relations between your inputs.
3.3 K-nearest neighbors (complexity: 3 of 4)
KNN is easily explained in an example. Assume we have a bunch of candles (training data set) and for each of them we know whether we want it to be a signal or not. Assume we mark “signal candles” green and “no signal candles” red.
Assume we only have 2 inputs for the model: RSI and MFI. If so, then we can paint our entire training data set on a chart, where Y = RSI and X = MFI. Every candle will look like a dot on this chart, with X = “MFI value at this candle” and Y = “RSI at this candle”. Here’s a chart like that.
Any time KNN model needs to make a prediction for a new candle (see the black dot), it checks K dots which are the nearest neighbors to the new candle on this chart. In the picture above, violet lines illustrate which neighbors will be selected for K=5.
It then checks what portion of these were marked as “signal”. This “portion” value will be the resulting confidence for the signal on the new candle. In the picture above, it will be ⅘ = 80%.
KNN is harder to grasp when it comes to more than 3 inputs (i.e., for 7 inputs we’re talking 7-dimensional space for KNN), but the general idea remains.
3.1 Random Forest (complexity: 4 of 4)
Random Forest is an ensemble based model, meaning that every RF model consists of a number of smaller models. These smaller models are named Decision Trees. Hence the "Forest". Beautiful, right?
During training, the RF model creates multiple decision trees, by randomly picking different parts of the learning data set and building an "if-then" tree for it. This tree derives proper signals from this given subset of data. So each tree looks at its own portion of data and learns how to make decisions based on the patterns it finds.
When it's time to generate signals, all the decision trees analyze the data at hand and vote. The most common suggestion is taken as the final answer. The amount of trees voted "there's a signal" defines the overall model's confidence.
Parameters of RF models are things like “amount of trees it can grow”, “max complexity of every tree” and then there are more in-depth parameters which you should figure out yourself.