Merry Christmas and Happy New Year to everyone on this fantastic board.
So I've done a thing! Or, at least I think I have.
The past few weeks I've been trying to dig into machine learning and advanced predicting algorithms. Since Trump is taking office in January, I figured I would start by taking stock history for all the stocks in the NYSE and NASDAQ since 2016 and use it to train my machine learning model. The shear number of stocks involved forced me to filter out any stock that traded below 1 million shares a day and under 1 dollar a share. This put me around 1250 stocks in total to look at. Maybe when I get home and onto my PC, I can start relaxing this requirement a bit, but at the moment, this is the best my MacBook can handle.
The model uses the Random Forest alogorithm mixed in with LSBoost for gradient boosting. I'm hardly an expert at all this, but this is what chatgpt suggested I use. Random Forest is used to find broad patterns, while LSBoost uses iterative learning for detailed refinement.
I'm using 9 indicators at the moment to train my model.
- ema5
- ema10
- Volatility
- RSI
- ema5 Gradients
- 3 Day trading range (high)
- 3 Day trading range (low)
- RSI Gradients
- Approximate Volatility
- RSI Divergence (Bullish)
- RSI Divergence (Bearish)
I'm considering Bollinger Bands in next iteration and eventually, I'd like to do some monte carlos based on lines of support/resistance....but baby steps.
When Random Forest runs, it identifies what the key inputs to its learning were. Generally speaking, you don't want a single variable to dominate in the model and you want to avoid giving anything directly related to price to avoid over fitting. Random Forest is apparently better at not overfitting than other codes, but you still have to be diligent. In my case, instead of raw emas, I normalized them first to obfuscate the raw price from the model. Below is how the features broke down and how relevant they were to the predictions.
Part of the process with Random Forest is setting the number of trees the model uses to learn and tracking the out of bag error after each iteration. For me, my OOB error started off pretty low and trended down with each run. I think I was using something like 100 trees in this version. This part makes me a little weary in that it seems too good to be true. I would feel better about it if I had more error actually as this amount of perfection is a sign of over fitting.
Then the moment of truth....how well does the model predict what actually happened?
The blue is the historical data. The red is what the model is predicting will happen. It's cheating because it's using the known price to determine the the variables and then using the variables to predict the price. Still, it shows that the model is able to predict the price with the indicators. The green curve shows where the model thinks the stock is headed over the next 40 days and is the true forecast. After each daily prediction, it updates the price and all the features that went into the original model. I'll say, it does a poor job of capturing the day to day variation and I'm not sure how I feel about that. I was hoping for something super fancy where it attempts to capture the ups and downs, but as it stands you really have to use it as a general trend tool and not so much day trading right now.
So how did the model do at predicting the price compared to actuals? Note, I did not feed it ANY 2024 data past January 4th so everything that happened this year is a completely authentic result. Here, I've sorted all 1250 stocks and by %return and am just showing the top 10.
If you blindly bought the stocks the code suggested, you would have ended up +122% for the year. Not all stocks hit the targets. Some of them hit within my 40 day prediction period. Some of them hit sometime after. None of them were losers. I got my computer churning away for 2025 now if you're interested. Might take a few more hours to run, but I was giddy this morning to see these results.