Exploring Crypto Market Sentiment Analysis Using NLP For The Purpose of Price Prediction

Marvin Lee
5 min readJun 22, 2021

Sentiment analysis is a technique that uses natural language processing (NLP), statistics and more to determine the attitude present in a given text toward a specific topic.

With respect to investing, sentiment analysis is often used to ascertain market, or investor, sentiment. The words bullish and bearish are used frequently to describe overall market sentiment, or the individual sentiment of a person.

In this respect, one could say the stock market and crypto market both operate in the same way:

Sentiment abounds.

The rapid developments within the world of crypto and blockchain technology are apparent, and even with both fundamental and technical analysis in hand, adding sentiment analysis can add significant context and perspective when one is seeking to get a well-rounded sense for the market, and especially if one wants to be able to predict price movement.

In this post, I share a bit of my findings from exploring crypto market sentiment analysis with natural language processing.

I decided to analyze and perform machine learning on crypto publication headlines to determine whether any predictive power could be gained, mostly because I thought it would be fun, but also because of the undeniable attention that both publishers and readers give to headlines.

They can even change the way we think.

My question was:

Could a collection of headlines from a given week be used to predict whether the price of a crypto asset went up or down?

The Process For Performing Sentiment Analysis On Crypto Headlines

Here are the majority of the steps I took in the process:

  1. Collected article headlines — and their publication dates — from 3 top crypto publications, using the keyword “Bitcoin” for the search
  2. Pulled historical price data on Litecoin (2014–present)
  3. Performed some natural language processing on the headlines which included the removal of stopwords (or, common words that wouldn’t add much value to the modeling process)
  4. Organized the article headlines according to the week they were published
  5. Created a feature that described the weekly price change of Litecoin, and then transformed this feature into a categorical variable that described whether the price of Litecoin went up or down in a given week
  6. Performed some machine learning, by splitting the data into train and test sets, and then using RandomForests, Logistic Regression and XGBoost to do some modeling and determine the most important features as well as see whether any of the models could predict with precision
  7. Imported wordcloud from NLTK’s library to create word clouds for the most frequent words associated with both upward and downward price movement for Litecoin
  8. Determined the words that had the most significance to price movement (this is different from frequency)

After completing these steps, this is the word cloud consisting of the words most frequent when there was upward price movement after a week:

And here’s the word cloud for words most frequent when there was downward price movement after a week:

Notice anything interesting?

Well, one interesting observation is that the word ‘bitcoin’ is prevalent in both upward and downward price movement. This shows us that although the word ‘bitcoin’ is prevalent, it doesn’t help us much in terms of understanding market sentiment let alone being able to make any predictions.

It’s safe to say that there isn’t much difference between the two, right?

OK, well now let’s take a look at the word cloud that consists of the words that had the greatest significance, or rather had the greatest feature importance, after running an XGBoost model with hyperparameters chosen by way of a GridSearch:

These are the most significant words, or, the words that were the most useful during the modeling process for the purpose of making a prediction.

Observe anything interesting?

Well, one thing I thought was interesting is the large presence of action verbs, with words like reveals, jumps, calls, drops, continues and more.

One important note here is that this word cloud doesn’t necessarily tell us that these words refer to upward or downward weekly price movement for Litecoin. It just establishes the significance of the words to the modeling.

However, upon checking for the presence of the word ‘reveals’ within both the green and red word clouds, interestingly, I found this:

The word ‘reveals’ shows up 7 times when weekly price change went up, as opposed to 1 time when weekly price change went down.

So, can we say that, in the context of this dataset, the word ‘reveals’ seems to have a bullish sentiment?

Even if we could, I imagine it might be wise to ask:

Does a bullish sentiment for ‘reveals’ mean a bullish sentiment for the market? And, assuming this does mean a bullish market sentiment, does this mean the prediction is the market goes up next, or goes down? For how long?

Exploring in data science often leads to more questions, but perhaps one step closer to something useful.

If you’d like to check out the Github for this, you can find it here.

--

--