In Brief
- The quantitative equity research team recently researched and implemented an enhancement to the research analyst sentiment score utilized within the Blended Research portfolios.
- FinBERT is materially better than the previous tool at contextualizing text within a document and quantifying its sentiment.
- By leveraging sophisticated tools such as FinBERT to analyze MFS’ proprietary datasets we believe we can offer a differentiated alpha signal.
MFS® Blended Research strategies are equity investment strategies managed using an alpha signal that combines both quantitative and fundamental inputs. The fundamental alpha includes several features that capture research analysts’ views on the stocks they cover. This includes their ratings (buy, hold, sell), but there are also two other components to the signal: a conviction boost for names held by the analyst in MFS’ research strategies and a sentiment score that uses natural language processing (NLP) to ‘read’ the analysts notes and evaluate their sentiment.
The quantitative equity research team recently researched and implemented an enhancement to this sentiment score. Prior to this research, the sentiment was estimated by an NLP algorithm called a “Bag-of-Words” model, which scores text by referencing a dictionary that categorizes words as having positive or negative sentiment. The model we upgraded to is a large language model called FinBERT. Relative to Bag-of-Words, FinBERT is materially better at contextualizing text within a document and quantifying its sentiment.1
In this paper, we’ll outline what the two models are and how they work. We will also provide an example from our proprietary library of historical analyst notes, which show why we believe FinBERT is more effective at capturing analyst sentiment.
An Overview of Natural Language Processing Models
The Bag-of-Words approach previously used by the Blended Research strategies uses a financial lexicon, developed by Tim Loughran and Bill McDonald at the University of Notre Dame, which tags words as either positive or negative based on how they are typically used in financial documents. The count of those words in each note is then tallied to measure the note’s sentiment. The Loughran McDonald dictionary is a relatively short lexicon because it tries to avoid misclassifying words that can be viewed differently in a business setting than in plain-language English. For example, most lexicons consider ‘vice’ to be a negative word, but given ‘vice president’ is a phrase that is often used in a business context, ‘vice’ isn’t included in the Loughran McDonald dictionary and is considered neutral. The advantage to a Bag-of-Words sentiment measurement is that it’s easy to implement, easy to understand and lets the user have control over which words are assigned a positive or negative value. The disadvantage to the model is that its simplicity doesn’t allow for it to understand context and is sensitive to the words in the lexicon being used.
Exhibit 1: Comparison of the two models
Bag of Words |
FinBERT |
Advantages |
Easy to implement |
Can better understand context |
Easy to understand |
More familiar with financial jargon |
User control over tagging method |
Better able to measure sentiment |
Disadvantages |
Doesn't understand context |
More complex |
FinBERT is a large language model (LLM) based on Google’s BERT (Bidirectional Encoder Representations from Transformers) model. BERT models are widely used in language-related tasks. Examples include predicting the next word in a text message or email, helping chatbots answer questions, etc. FinBERT is fine-tuned on a large corpus of financial text and trained to predict sentiment using the Financial PhraseBank dataset from Malo et al. (2014).2 The fine-tuning makes FinBERT’s language encoding model more familiar with financial jargon and the sentiment layer teaches it how to measure sentiment as positive or negative. Large language models are composed of multiple neural network “layers” or computational packs that work in tandem to process the input text and generate output. The sentiment layer results from the computational processes determining the writer’s attitude towards the subject.
BERT was developed as a language model to encode and predict language and is trained to represent words and sentences and the relationships between them. By fine tuning the model on financial-specific text and creating a sentiment model, FinBERT takes advantage of BERT’s ability to understand plain-language English and focuses it on the task of measuring the sentiment of financial text. The advantage of FinBERT is that it’s not sensitive to a lexicon and can understand plain-English language, context and complex relationships that aren’t captured by Bag-of-Words, but the disadvantage is that it’s more complex and harder to understand exactly what is driving any given score. For more details around FinBERT’s development, see Araci (2019).3
It’s important to note that while FinBERT is a large language model, it is not a generative model like ChatGPT. As a result, it doesn’t suffer some of the stability issues of ChatGPT (e.g., it doesn’t hallucinate answers). Given the same inputs, it will produce the same output every time.
Comparing the Models Using an MFS Analyst Research Note
When evaluating the two models we looked at both their ability to effectively measure sentiment and the forward returns associated with the scores when used as a systematic quantitative factor. The returns to the FinBERT model outperformed those of the Bag-of-Words model, but the most important difference we noticed was in its ability to measure sentiment in a way that is more consistent with how we, as humans, read the notes.
Take the following note as an example, written by an MFS analyst regarding a US technology company and defense contractor in November 2016:
Paragraph 1 – “Posted an in-line quarter after normalizing the tax rate. Organic revs were down 2%, but the rate of decline appears to have bottomed. Orders were strong with B2B 1.17x.
Paragraph 2 – I had been worried about high margin tactical radio sales, and this quarter increased for the first time in three quarters, with B2B 1.22x versus 0.92x last quarter. I caution bookings across all its businesses, which are lumpy, but this is enough evidence for me that things are bottoming. International radio bookings increased almost 30% sequentially. The US radio business was always set to grow in 2018 given the wins, but now the ensuing bathtub doesn’t look so deep. The rest of the business should start to grow organically, and portfolio pruning continues.
Paragraph 3 – The team continue to execute the synergy plan (margins +50 bps to 13.7%), while the declines in the total business are rapidly decelerating. Valuation still looks okay at 17x CY17. I see a pathway to $1B run rate FCF by next year, which places the shares at approximately 8% yield. Upgrade to a 1.”
The note is clearly positive on the company and the analyst is stating that the forward prospects for the business look good and is upgrading it to a ‘Buy’ rating. FinBERT accurately scores this note as positive, while the Bag-of-Words model scores it as negative.
Exhibit 2: The scores for FinBERT and Bag-of-Words broken out by paragraph
|
FinBERT Score |
FinBERT Sentiment |
Bag-of-Words Score |
Bag-of-Words Sentiment |
Paragraph 1 |
-0.58 |
negative |
-0.08 |
negative |
Paragraph 2 |
0.90 |
positive |
-0.04 |
negative |
Paragraph 3 |
0.82 |
positive |
-0.05 |
negative |
Overall Note Score |
0.38 |
positive |
-0.06 |
negative |
Note that FinBERT and Bag-of-Words are not on the same scale, but both are centered at 0 with near-0 being neutral tone, positive numbers implying positive sentiment, and negative numbers implying negative sentiment.
While both models view the first paragraph as negative, driven by “organic revs were down,” the FinBERT model picks up on the positives in the second two paragraphs. Bag-of-Words takes a neutral stance on many of the sentences because none of the words happen to be tagged in the Loughran McDonald financial lexicon. This can miss out on important indicators of sentiment. For example, the phrases “upgrade to a 1,” “the rest of the business should start to grow organically” and “international radio bookings increased almost 30% sequentially” are considered neutral under Bag-of-Words because none of the words are tagged. FinBERT accurately views all these sentences positively and effectively captures how the positives in this note outweigh the negatives.
Bag-of-Words can be sensitive to the lexicon used in the scoring and, for shorter notes such as this example, a small number of sentences can drive the score due to the model taking a neutral view on most sentences. FinBERT seems to capture the sentiment more like a human. While you can see some words or sentences in the example that could be viewed as negative, the main point of the note is that the analyst has a positive forward view on the stock.
Scoring Accuracy Comparison
As part of the analysis, we looked at notes where the two models disagreed the most and scored these notes by hand as positive, negative or neutral. In these cases, the FinBERT score was not only more correlated with our hand-scored values, but FinBERT’s scores agreed with our positive or negative scores 85% of the time.
Exhibit 3: Model scores versus MFS Quant team scores
|
FinBERT Score |
Bag-of-Words |
Correlation With Quant Team Scores |
0.43 |
0.22 |
Total % Correct |
85% |
38% |
The Blended Research Advantage
While FinBERT is a more sophisticated and effective approach to measuring sentiment, it’s also worth highlighting that the main advantage in terms of an investment edge is not the model itself, but the data it’s being applied to. MFS employs a global team of fundamental analysts to analyze stocks and the sentiment scores are being calculated on the proprietary dataset of analyst notes available only to MFS investors. By leveraging sophisticated tools such as FinBERT to analyze proprietary datasets, we believe we can offer a differentiated alpha signal that gives exposure to the insights developed by our fundamental teams.
Endnotes
1 Sentiment refers to how the natural language processing models understand text in ways like humans. It’s different than the sentiment factor utilized within the MFS Blended Research Quantitative Alpha model.
2 Malo, P., Sinha, A., Korhonen, P., Wallenius, J. and Takala, P. (2014), Good Debt or Bad Debt. J Assn Inf Sci Tec, 65: 782-796. https://doi.org/10.1002/asi.23062.
3 D. Araci, “Finbert: Financial sentiment analysis with pre-trained language models,” arXiv preprint arXiv:1908.10063, 2019.
The views expressed are those of the author(s) and are subject to change at any time. These views are for informational purposes only and should not be relied upon as a recommendation to purchase any security or as a solicitation or investment advice. No forecasts can be guaranteed. Past performance is no guarantee of future results.
MFS’ investment analysis, development and use of quantitative models, and selection of investments may not produce the intended results and/or can lead to an investment focus that results in underperforming portfolios with similar investment strategies and/or the markets in which the portfolio invests. The proprietary and third party quantitative models used by MFS may not produce the intended results for a variety of reasons, including the factors used, the weight placed on each factor, changing sources of market return, changes from the market factors’ historical trends, and technical issues in the development, application, and maintenance of the models (e.g., incomplete or inaccurate data, programming/software issues, coding errors and technology failures).