Automated sentiment analysis gives poor showing in accuracy test
Tests of a range of different social media monitoring tools conducted by the research consultancy found that comments were, on average, correctly categorised only 30% of the time.
FreshMinds’ experiment involved tools from Alterian, Biz360, Brandwatch, Nielsen, Radian6, Scoutlabs and Sysomos. The products were tested on how well they assessed comments made about the coffee chain Starbucks, with the comments also having been manually coded.
On aggregate the results look good, said FreshMinds. Accuracy levels were between 60% and 80% when the automated tools were reporting whether a brand mention was either positive, negative or neutral.
“However, this masks what is really going on here,” writes Matt Rhodes, a director of sister company FreshNetworks, in a blog post. “In our test case on the Starbucks brand, approximately 80% of all comments we found were neutral in nature.
“For brands, the positive and negative conversations are of most importance and it is here that automated sentiment analysis really fails,” Rhodes said.
Excluding the neutral comments, FreshMinds manually coded conversations that the tools judged to be either positive or negative in tone. “We were shocked that, without ‘training the tools’, they could be so wrong,” said the firm. “While positive sentiment was more consistently categorised than negative, not one tool achieved the 60-80% accuracy we saw at the aggregate level.
“To get real value from any social media monitoring tool, ongoing human refinement and interpretation is essential,” said the company.
The full whitepaper can be download online here. Get the lowdown on social media monitoring here.

We hope you enjoyed this article.
Research Live is published by MRS.
The Market Research Society (MRS) exists to promote and protect the research sector, showcasing how research delivers impact for businesses and government.
Members of MRS enjoy many benefits including tailoured policy guidance, discounts on training and conferences, and access to member-only content.
For example, there's an archive of winning case studies from over a decade of MRS Awards.
Find out more about the benefits of joining MRS here.
22 Comments
Annie Pettit
15 years ago
It all comes down to doing the work to get the automated systems working as best as possible. The amount of validation work that must going into creating an automated sentiment analysis system that is accurate is simply enormous and continually ongoing. Systems that do not incorporate ongoing validity mechanisms cannot improve and will only worsen over time as speech and language changes with the times. What this says to me is buyer beware and buyer do your homework. Ask your vendor if they validate their engines, how they do it, and how often they do it. Annie Pettit, Chief Research Officer www.conversition.com
Like Reply Report
Mark Westaby
15 years ago
These findings will come as no surprise to companies that use automated analysis properly. Using automated analysis for individual pieces of coverage and without 'training' the software is never going to produce good results; and, in this respect, the Freshminds study is itself flawed because, frankly, they should understand that. Equally, the companies studied should not be offering generic automated analysis services for exactly this reason, so in that respect the study is valid. In fact, automated analysis used properly can achieve remarkably accurate results. Something the study does not do, of course, is compare the use of properly trained and correctly used automated analysis against humans to analyse, say, 1000 pieces of online coverage in real-time, which is increasingly required in today's highly connected world. Had they done so the automated analysis would win hands-down. In other words it's 'horses for courses' and this study really should have pointed that out.
Like Reply Report
Nikki Wright, FreshMinds
15 years ago
Thanks for your comments Matt. Many of our clients come to us having attempted to try Social Media Monitoring for themselves and after discovering the issues we highlighted in our report. Through the research it was our intention to see how the tools varied without such 'training' as this is not always consistent and is certainly not always used by clients. We've had some great feedback, particularly from the tool providers and we plan to update this research shortly.
Like Reply Report
Jo Shaw
15 years ago
Couldn't agree more. Social media measurement has a LONG way to go, and no number of funky dashboards and black box algorhythms is going to make a difference until some pretty fundamental weaknesses have been addressed. http://tiny.cc/d9ld5
Like Reply Report
Mike Daniels
15 years ago
As regular visitors to this site may recall, Mark Westaby and I debated this very question - whether automated tools could provide sufficiently accurate sentiment analysis to support critical business decisions - earlier this year. This study supports what is now generally considered a settled view – that automated analysis tools cannot, and generally do not pretend to deliver the same levels of sentiment accuracy as well trained, fully briefed human analysts. However, as others have pointed out, there are inadequacies in this study. But I would contend that these research issues do not detract from the central finding that when it comes to sentiment, automated tools are simply not as accurate or consistent as humans. Faced with this finding, proponents of automated tools, as Mark is, often retreat into justifying their use by virtue of their benefits in "real time" analysis. However, in practice, real time analysis is really only necessary in crisis or rapid response situations. And the paradox is that in these situations, whilst actionable results can be achieved by automated tools, there is actually no need for sentiment analysis. Crises are marked by specific topics under discussion – it is these that need to be tracked. Their very presence will indicate where remedial or defensive action may be required – no sentiment required... In more strategic contexts, where business insight and support for business outcomes are critical, delivering accurate, reliable, consistent and robust sentiment analysis from trained human analysts massively outweighs the constant nagging doubt about the consistency and accuracy of data from an automated platform. In our experience, owners of valuable brands simply cannot, and indeed do not take the risk of using such potentially inaccurate data in determining the performance of their assets. The noticeable swing back to human-derived analytics from companies previously using automated only tools is tangible proof of that particular pudding. As a sidebar, I would strongly dispute the study's view that neutral coverage is somehow less important than positive/negative sentiment , especially in relation to building and sustaining brands and reputation – and even more so in a competitive context. There are plenty of research studies showing that neutral brand visibility helps build awareness, and, more importantly also serves to build up reputational "trust bank" reserves...
Like Reply Report
Brian Tarran
15 years ago
Thanks for all the comments. Just to pick up on Mike Daniels' reference to his head-to-head debate with Mark Westaby on people vs. machine analysis – that piece can be found here: http://bit.ly/gLHJ8
Like Reply Report
David Geddes
15 years ago
First, we see automated sentiment scoring as part of our business process to assist analysts rather than as a stand alone tool. Second, the white paper is not especially transparent about the statistical tools and methods used to arrive at their conclusion. It is meaningless to lump all the systems together and lament that only 30% of the posts were scored accurately. Likewise, it is not meaningful to say that the best system achieved around 50% accuracy. How was this calculated? Third, I continue to be amazed by the success achieved by computer scientists in their models using automated sentiment scoring. Fourth, I am surprised by the claim that Twitter is easier to rate due to short text length. All academic and research papers I have read state the opposite. Finally, are we falling into a trap of feeling that we have to provide a sentiment score for everything to achieve the results we need? I am regularly impressed with results reported by academics where they use manual scoring on a small sample of stories (say 1,000). Why do business clients always want scoring of everything? Is this overkill? Should we instead revert to an appropriate sample-based research design to address specific client questions.
Like Reply Report
Mark Evans
15 years ago
Brian, While automated sentiment technology isn't perfect, it is improving on a steady basis as the technology evolves. At the same time, it is important to recognize that technology does a lot of grunt work in processing millions of conversations - something that couldn't be done manually. As well, there is a role for people to play alongside automated sentiment technology to make sure that the results can be edited or tweaked to reflect context, sarcasm, etc. In many respects, social media sentiment works effectively if there is a solid marriage between technology and people. cheers, Mark Mark Evans Director of Communications Sysomos Inc.
Like Reply Report
Katie Paine
15 years ago
Unfortunately the study tested the most popular, but least reliable of the systems available. I'm convinced that PR people would rather measure what is easy to measure than to measure accurately. Just as an FYI, we routinely test humans against humans to ensure a minimum 90% intercoder reliability score and THEN test automated sentiment analysis against that. The only system that comes close is SAS's Social Media Analysis, but that's in part because they are a client of ours and used our coding instructions to design their system.
Like Reply Report
Aditi Muralidharan
15 years ago
Trying them out "without training" makes no sense, and if I were a company using this sort of software to analyze my brand I'd make sure to train it first. To anyone who's familiar with the literature on this topic, it's not at all surprising that untrained accuracies would be abysmal. I agree with Mark Westaby, it's been demonstrated over and over again that an automatic sentiment analyzer needs to be trained to avoid being hopelessly bad, so this study is flawed.
Like Reply Report
Mark Westaby, Spectrum
15 years ago
I love these conversations with Mike D -- I'm still trying to drag him into the 21st century but he insists on living in the past (at least he paid for lunch last time)! OK, Mr D, here we go... Re the comment about real-time analysis there are mega benefits to being able to do this over and above any need for 'crisis or rapid response' situations for those of us who want to do serious outcome measurement. Let me give you a simple but absolutely fundamental example that really makes the point: by analysing in real-time we can very accurately see the impact of different communication activities on key outcomes. For instance, we can monitor and analyse online/social media on an hourly basis (we can actually do this more frequently but usually find this suffices). So, when a client's ad breaks on TV we can measure its impact within the hour, or hours, that it appears; and when that same client makes a press announcement we can do exactly the same. Likewise for direct mail, digital campaigns, sales promotions, whatever. Real-time analysis enables us -- and more importantly our clients -- to see exactly what is generating outcomes. It also provides an incredibly rich source of data we can then use for statistical analysis, such as regression, to validate our findings (or not as the case might be). There is just no way this could be done without real-time analysis. In addition, most of our clients want reports on at least a daily basis and they want great detail, as standard. Try doing even daily analysis with human analysts in huge detail across blogs, web, forums, Twitter, and you'll be in serious trouble. Last but not least, why such a fixation on the accuracy of human analysis? Everybody assumes that all human analysis is accurate but, do you know what, it ain't! In fact automated analysis is FAR more accurate than humans for the detailed analysis we do, which is now about insight rather than evaluation. Frankly, I'm delighted so many of you still believe human analysis is best. It just means those of us who use automated analysis properly (and I emphasise the properly) will have an even greater share of the insight market. That's the way it's going, folks!
Like Reply Report
Eric Sanders
15 years ago
Thanks for this analysis and insight; it confirms the findings we've seen amongst our clients and further validates the approach we've taken at Crimson Hexagon. Our approach eliminates the failings of keyword and linguistics analysis, and instead uses a statistics-based approach that provides the dual benefit of increased accuracy (97% hand-validated) and language agnosticism. Although we don't yet have the reach of some of the companies you evaluated, we work with several dozen of the world's leading brands who have also found the mainline "buzz" tools wanting when it comes to true opinion analysis. We'd welcome the opportunity to participate in your study, or answer any questions you might have.
Like Reply Report
Tom H. C. Anderson
15 years ago
This is why we at Anderson Analytics developed the 'Validation through Triangulation' methodology several year ago. However, these numbers are still surprisingly high. We conduct validation tests all the time. I think it's amusing though that we assume humans are 100% accurate. They certainly are NOT. What I've always said is that the most important benefit of text analytics is dependability (i.e. CONSISTENT) results. This is where human analysis fails.
Like Reply Report
Subrata
15 years ago
This study hardly provides any details about how they went on to do their analysis. Unless we see what was the reliability of the raters (intra and inter rater) that were used to manually code the conversations, any comparison to human coders is meaningless. There are very sophisticated text processing algorithms available out there that could be used for this task. However, if the rudimentary description of the existing algorithms (on Page 7) is true, then the findings are not surprising at all.
Like Reply Report
Tom H. C. Anderson
15 years ago
Example Text Analytics Meme from the #NGMR Meme Contest: http://www.tomhcanderson.com/wp-content/uploads/2010/06/terminator_nextgen.jpg
Like Reply Report
Pascal Soucy
15 years ago
I agree with many comments posted here, that training is generally a must to get acceptable results. There are some exceptions though. My company does sentiment analysis on a specific domain: the monitoring of restaurant and hospitality mentions. In this domain, we have two kind of mentions: formal reviews (on sites such as Yelp and TripAdvisor), for which we get accuracy over 90% (between positive, negative and neutral), and general mentions (on Twitter or blogs for instance), where the accuracy is a bit lower (around 80%, but we're still working to improve it). Obviously, we do not train specifically for each individual restaurant or hotel to guess sentiments. Guessing the sentiment correctly is not the only challenge, disambiguisation of the brand (does the comment really is about StarBucks?) can be a real problem too. Pascal Soucy VP Products, InfoGlutton
Like Reply Report
srw
15 years ago
Working on sentiment analysis applied research I just can say: beware of social media providers! every domain and brand to monitor is different and there is not silver bullet for that, the challenge is to mix real people with algorithms, but in the startup phase people is more important.
Like Reply Report
Audrey of Infinit Outsourcing
15 years ago
While automated sentiment analysis presents a great solution, at the moment, the technology has yet to catch up to the ability of the human mind to learn, analyze and understand. While you may be able to 'teach' computers to 'learn', it does not have the adaptability that the human mind possesses. While automating tasks may be faster, the very reason that you require the data is in order to get a snapshot of the current state of your market. If you base decisions for your business on inaccurate data, then where will that leave you? You cannot totally eliminate humans in the equation. At this point, companies shouldn't just need speed, and not just accuracy. They need both.
Like Reply Report
Michelle
15 years ago
Hi Brian, I'm always a fan of Fresh Networks and their work. I met Matt Rhodes at a conference in London and it's a pleasure to see their reports. With regards to this particular study, I have to say we completely agreed. While there are excellent tools that are making advances in the semantic web, "ongoing human refinement and interpretation" is exactly where we stand. There needs to be a human analysis of unstructured data from the web in order to fully understand the trends, why something is more popular in one country vs. another, what a company should address as being the most urgent, etc. Only sorry I hadn't seen this post sooner, Brian, it's a good one :) Best, Michelle @Synthesio
Like Reply Report
Richard Foley
14 years ago
I wish I would have seen this earlier, it is an interesting post, and does touch on a very important and much discussed topic--Accuracy. I do want to point out a few issues and comments about this post though 1. Your flip the coin analogy is slightly flawed. You note that many of the comments are neutral. If that is an option an I assume a normal distribution then I should only have a 33% chance of guessing correctly, which is closer to the 30% mentioned, ie not that far off of a coin toss. 2. w/o knowing how you classified the documents it is pretty hard to analyze how correct this study is. 3. Calulating Accuracy is an interesting topic and prone to much misunderstanding. I just blogged about calculating accuracy http://bit.ly/dKbK5h But as everyone has stated before you really do need a properly trained set of data to do good analytics, this is true for any analytics not just text. for full disclosure I am the product manager of text analytics at SAS all in all your post allows for great conversation and discussion, and touches on important issues in Text Analytics thanks Richard Foley Product Manager Text Analtyics at SAS
Like Reply Report
Anon
11 years ago
Great article. It was referenced here http://www.scribd.com/doc/189572739/God-Does-Not-Play-Dice-with-Social-Sentiments
Like Reply Report
Michalis MIchael
11 years ago
There are over 300 social media monitoring tools out there, but they have not been developed for market research. They were developed to serve the PR discipline. They have managed to get away with low sentiment accuracy for so many years because the PR Managers whether ignored the inaccuracies and used the data as if they were accurate or they did manual enhancements or they simply focussed on buzz and ignored sentiment. With the right combination of machine learning algorithms and computational linguistic methods it is without a doubt possible to consistently achieve over 85% sentiment accuracy in any language, in an automated. It is hard work to set it up and it needs a little maintenance from time to time but absolutely possible. Michalis A. Michael CEO-DigitalMR www.digital-mr.com
Like Reply Report