We’ve now defined functions to collect relevant stock articles, analyze those articles’ sentiment, and interpret stock trends with historical data. These 3 parts will finally come together to make one procedure which we can use to finally generate our training dataset
to use for Deep Learning. In Part 6 of our Deep Learning for Stock Market Prediction
series, we learn how to use the functions defined earlier to create a dataset.
Although simple to code, creating a dataset proved very difficult due to the tedious process of finding credible articles in which to base our entire algorithm off of. I chose to use Apple (AAPL)
as the basis for our dataset, and thus collected 53 articles relevant to events/announcements important and relevant to Apple.
An example of URL collection
Putting these article URLs into a giant list, we can then extract important information such as the text and date from each article using Python’s for-in
statement. This allows us to get the sentiment of the text, as well as the source of the article.
The important thing is making sure that the article has a date. We can do this by taking the datetime
variable from our URLtoText()
function and making sure it isn’t a null variable. We can then check to make sure the date this article was written on was a logged date in our Historical Stock CSV
. We do this to make sure that the article’s publishing has a better chance of being the cause of the following upwards/downwards trend in the data.
If both these conditions are met, we use our getSlope() function to find the slope of the stock’s data points around the date the article was written. If the slope is greater than 0, this means the line of best fit has an upwards movement (direct correlation), in which we assign slope the value of 1. If the slope is less than 0, this means the line of best fit has a downwards movement (indirect correlation), in which we assign the slope a value of 0.
Obvious long-term up/down trends in GE’s stock
This is very important for our deep learning model. Because the model tries to predict between two classes (Stock Will Go Up and Stock Will Go Down), it is technically predicting either a 1 or a 0 (which ties into what we did in the paragraph above). Our AI model will then analyze an article and its source, and return a real decimal number between 0 and 1, which shows its confidence level. A return of 0.45 means the AI model isn’t that sure, but is learning towards a future downward trend in the data. A return of 0.90 means the model is pretty sure the stock will go up.
Back in our code, we use the following code to enter information into our CSV data file:
with open(name.csv, ‘a’) as csvfile:
filewriter = csv.writer(csvfile, delimiter=',',
filewriter.writerow([sentiment, source, slope])
This will create a CSV file that, viewed in Excel, would look like this:
Remember that the first column is Sentiment (+/- level of a text), second column is Source (integer designating the source of article), and Slope (1/0 for up/down trend in data).
With this CSV, we are ready to split the data and prepare it for our Deep Learning Model!
Any questions can be left in the comments! 😃
Thanks for the awesome tutorial. What is an acceptable confidence level – if 0.9 is pretty high, how about 0.8 or 0.7?
This is all up to personal preference. Anything above 0.75 is generally considered good, as it means that most of the media is saying good things about that stock.