Purpose: Collect Articles about a Stock and Get Their Body Text
Functions Defined: GetNewsURLs(), URLtoText(), and SourceInt()
A) SourceInt()
For the sake of having a well organized dataset, it would be optimal to have them categorizable by source – in other words, we should define a list of sources that are trusted and ONLY collect articles about those sources. Here’s a list of sources we have found that generally write good content about stocks:
Let’s write a function that, inputting a URL, will return what source the URL is from:
Note that code will give each “Trusted Source” an integer from 1 to 13, and any other source a 0. This function is MultiPurpose, as we will be using in 2 different places for developing our model.
B) GetNewsURLs()
We need a simple way to, inputting any Stock Ticker, get a list of recent articles written about that company, and hopefully a financial evaluation of it. The easiest way to get this information is Google News, which cumulates articles from all over the internet – and all you need is a search term! News Corpus Builder (Documentation HERE) has proven to be the best Python package suited for the job, so let’s start our function here:
def GetNewsURLs(Ticker):
links = NewsGen.google_news_search(Ticker+' stocks',100)
Given the ticker TSLA, NCB will search “TSLA stocks” and return the top 100 articles, which should all be written in the last couple of days. We then need a way to store these URLs in a URL List – but caution: we only want articles from the following companies, which we defined above. Using our SourceInt() function, if the URL returns a score of 0, we know it is not on the “Trusted List” – and thus we won’t include it in our URL List.
The following code does this perfectly:
URLList=[]
for i in range(0,len(links)-1):
URLLink = links[i][1]
if not (SourceInt(URLLink) == 0):
URLList.append(URLLink)
return URLList
C) URLtoText()
Finally, we want a function that takes each News Article URL and return the body text of the article, as well as the date it was published (for creating our Training Dataset). The Newspaper package (Documentation HERE) is one of the best packages suited for this job. It’s quite simple too:
def URLtoText (url):
article = Article(url)
article.download()
article.parse()
return (article.text, article.publish_date)
Not much explaining to do here. article.text conveniently provides you with the body text, and article.publish_date gets you the date it was written!
By this point you should have 3 working functions: SourceInt(), GetNewsURLs(), and URLtoText(). We’ve got our Article Data. Now what do we do? In the next part of the tutorial, we will learn how to create some pretty awesome dictionaries to be able to classify our Articles as Positive or Negative!
Any questions can be left below, for Ethan to answer!
Want to Follow This Series?
How do you train your model? Don’t you need to have data that goes far more behind in time ?