Tracking opinions on EU referendum through data


The United Kingdom’s withdrawal from the European Union, known as Brexit, is one of the most significant events of the 21st century. It has changed the whole life of millions of people, inside and outside the UK, with political, economic, and social consequences, among others.

Since the Prime Minister of the United Kingdom announced a referendum on the country’s permanence in the European Union, it has been a long and complex process of ups and downs, with tough negotiations, and a variety of shifting opinions in all directions.

The Brexit referendum had a very tight result (48.1% against-51.9% in favor) and participation of 72.2%. Although it was in 2016, the official exit was in 2020. Nonetheless, the process is not over yet, as the EU and the UK must negotiate new trade agreements, testing the union of the country towards the entire European Union.

Nevertheless, how could this happen? Can we know which are the main reasons? Which was the real feeling of the people about Brexit? How did important speakers had influenced the population? Fortunately, the Kava-ADA team has put itself in the shoes of investigative journalists, trying to clarify and obtain conclusions from data analysis of quotes.


We have divided the project into four questions, that are our main goals:

  • How did people really feel towards Brexit?
  • What arguments did the members of each group use to support their beliefs?
  • Which are the dominant features of the speakers in each group?
  • How did the opinion towards Brexit change during the 5-year span? Did the arguments of each group also change?

Data feasibility

The project is mainly based on the analysis of quotes from the Quotebank dataset. This data source is described best by its makers:

Quotebank is a dataset of 178 million unique, speaker-attributed quotations that were extracted from 196 million English news articles crawled from over 377 thousand web domains between August 2008 and April 2020. The quotations were extracted and attributed using Quobert, a distantly and minimally supervised end-to-end, language-agnostic framework for quotation attribution.
(Vaucher, Timoté, Spitz, Andreas, Catasta, Michele, & West, Robert. (2021). Quotebank: A Corpus of Quotations from a Decade of News (1.0) [Data set]. Zenodo.

In our case, we use data collected between January 2015 and April 2020. We decided to put the start date in 2015, as this was the year of the General Election in the United Kingdom, where the Conservative Party (which won the majority in the House of Commons) put a promise of an in-out referendum about the membership of the United Kingdom in the European Union. Then, we can study the effect during all these years, when many other events have happened around Brexit (such as the official leave of the UK on the 31st of January of 2020).

As well as Quotebank, we are using another dataset, Wikidata, to enrich the data of the speakers of the quotes. This dataset, meant primarily for use in Wikimedia projects like Wikipedia or Wiktionary, contains properties and references describing an item, i.e., a person or a country. Because Quotebank uses Wikidata QIDs to refer to speakers, we can easily link speakers to their attributes in Wikidata. Wikidata entries can contain an infinite number of features and references, so we have decided to use only a small number of attributes, which we can then use for demographic analysis. These attributes are gender, date of birth, nationality, occupation, political party, academic degree, political support, and religion.

Analyzing sentiments

How did people really feel towards Brexit?

The Brexit process was triggered by the majority vote to leave in the 2016 referendum. However, this did not mean that the debate about Brexit was over. Polls suggest that voters remained divided on the issue even until the start of 2020. Therefore, it will be interesting to analyze people’s feelings towards the decision and understand if this ‘shock’ result was really that.

To achieve this goal, we begin by classifying the speakers according to the sentiment of their quotes. This will help us identify the opinions that speakers have about Brexit and the percentages of positive and negative opinions that exist. To do so, we will use sentiment analysis.

Sentiment analysis is the use of natural language processing (NLP), text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information (Wikipedia).

This process allows us to statistically work out whether a piece of text is positive, negative or neutral.

There are mainly two sentiment analysis approaches:

  • Polarity-based: pieces of text are classified as either positive or negative.
  • Valence-based: The intensity of the sentiment is taken into account.

We decided to use VADER (Valence Aware Dictionary and sEntiment Reasoner) "a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains".

This library focuses on content found everywhere, and it is known for being effective at general use. VADER gives us a `compound` statistics which expresses the aggregated sentiment score of each quote.

The pie chart below shows the percentages of positive, negative and neutral quotes obtained after the sentiment analysis. If you hover over the different slices you can also see the exact number of quotes. By using the dropdown you can select other countries and see how percentages change.

We can see that the results are quite close to reality. The global percentage of negative opinions (35.2%) is less than that of positive ones (51.1%), while there is a small percentage of neutral opinions (13.7%).

To make a smarter approach, we can take a look at the percentages within some countries of interest:

  • UK: the results are quite similar to the global ones. We obtain a 50.9% positive, 36% negative and 13.1% neutral. This matches what we expected, as there is a small increase in the negative sentiment and a decrease in the neutral sentiment. This represents how the atmosphere is slightly more polarized within the country.
  • UE countries and associates: they show percentages around 50% for positive quotes, 30%-35% for the negative quotes and 15% for neutral. Other countries like Japan, USA or Canada follow the same pattern.
  • Scotland and Wales: these regions show a higher percentage of negative sentiment, around 43%. It can be seen how both positive and negative percentages are even, demonstrating the general dissatisfaction of these regions with Brexit.

It is also interesting to analyze how sentiment changes over the years. The graph below shows the evolution of the sentiment over the years. It can be seen that positivity arises as the Brexit campaign progresses, reaching its highest point in 2017, once the decision was announced. However, during the negotiation period (2017-2020), it can be seen how this positivity decreases, which represents an increase in the negativity percentage. We will perform a deeper analysis in Q4: Brexit over time.

Analyzing arguments

What arguments did the members of each category use to support their beliefs?

Arguments presented during the referendum campaign covered politics, economics and national identity. Our goal is to identify these topics within the quotes and to be able to explain the situation through our data. To achieve this goal, we use two different approaches:

Word frecuency approach

The ability to count the frequency of terms in a text is one of the key steps in NLP. We are going to look at the word frequency, i.e., what words are repeated most often in the quotes of each sentiment group, to better understand the arguments used by the members of each category.

Word frecuency analysis analysis is achieved using the FreqDist function from nltk. This function gives us lists with the most frequent samples, bigrams and trigrams within the quotes. To clarify:

  • a bigram is pair of words that are next to each other in a quote
  • a trigram is trio of words that are next to each other in a quote

As a first step, we analyze word frequency over the whole set of quotes. The following barplot shows the samples and their frequency of appearance. By using the dropdown menu you can switch between word, bigram or trigram.

To no surprise, ‘Brexit’ is the most repeated word with 99773 appearances, followed by ‘eu’ with 23554 and ‘leave’ with 20002. If we analyze the bigrams, ‘leave eu’ came up the most with 9320 appearances, followed by ‘european union’ with 7457 and ‘leave european’ with 4417. Related to the trigrams, ‘leave european union’ appears the most, a total of 4384 times, followed by ‘get brexit do’ with 1491 appearances and ‘vote leave eu’ with 953.

This analysis does not give us much insight into how people perceive Brexit in general. To do an in-depth analysis, we are going to split the quotes into two groups: positive and negative sentiment. Later, we will compare both sets of words to obtain the samples that differ in both of them.

If we analyze the upper plots, we can conclude the following:

The different samples from the negative group show a feeling of dissatisfaction with the situation.

  • Bigrams like “brexit disastrous”, “idiotic brexit” and “brexit fiasco” are three of the five most frequent to appear. They show how this process is seen as something crazy or idiotic that does not make sense for this group. The sample “threat leave” exemplifies sentiment towards Brexit as a threat. The fourth most frequent sample “dead water” represents the sentiment by which it is believed that during the year 2019 Brexit is “dead in the water”, due to the impossibility of reaching agreements with the EU.
  • If we analyze trigrams, the most frequent sample “real risk brexit” represents how Brexit is seen as a risk to the society. Other samples like “diversion idiotic brexit” or “need diversion idiotic” continue with the trend shown in the bigrams analysis by which Brexit is seen as meaningless.

The different samples of the positive group show the conception of Brexit as a hope and an opportunity.

  • Brigrams like “new opportunity” and “great opportunity” show how positive speakers think that Brexit is a chance to improve their country. The sample “whole unite” sums up the feeling of union of the Brexit supporters.
  • Analyzing the different trigrams, the most frequent sample is “unite kingdom leaf”. This strange sample is due to the lemmatization process, which has replaced “leave” with “leaf”. It really should be “leaves”. Obviously, it talks about the process of “UK leaving” the EU. Other samples like “without even sketch” or “even sketch plan” or “sketch plan carry” appear 61, 61 and 53 times.

To sum up, this first analysis allows us to extract the most frequent expressions used by each opinion group rather than the arguments. It is possible to vaguely relate these expressions to the thoughts of the different opinion groups but not as an absolute truth. Therefore, we will carry out the second analysis using the LDA tool to achieve a better result.

LDA approach

In the following approach we use the LDA (Latent Dirichlet Allocation algorithm) model. To better interpret the results, it is interesting to do it interactively and visually. We will employ pyLDAvis package that allows to:

  • Better understand and interpreting individual topics.
  • Better understand the relationships between the topics.

For the topic interpretation, you can select each topic to view its top most frequent and/or “relevant” terms, using different values of the λ parameter.

  • When λ = 1 you can see the most frequent terms in the top of the list, and their frequency inside the topic.
  • When λ = 0, then more weight is given to words that only appear in that topic, and not so much in the rest, this is, more ‘unique’ words, which help you understand the topic in an ‘unique’ or ‘differential’ way. This can help when you’re trying to assign a human interpretable name or “meaning” to each topic.

For the relationship between the topics, it is possible to explore the Intertopic Distance Plot. In this case, the pyLDAvis package extracts the two most important Principal Components (PC1, PC2) and separates them according to their ‘closeness’. This can help you learn about how topics are related to each other, see which ones are ovelapped which means that they share many words, or characteristics, and when several topics are really overlapping, this could indicate that maybe a greater topic that englobes all of them should be defined.

Positive topics

The following plot shows the topics for the positive quotes. As some examples that you can see, if you fix lambda = 0.66 (optimal value) and you select a topic and move over the words, you can differentiate topics like:

  • Topic 2: more related with business and investment
  • Topic 3 dominates by the date of Brexit in October.
  • Topic 6 more related to labour and trade.
  • Topic 9: agreement related to northern Ireland seems to be important.

Negative topics

The following plot shows the topics for the negative quotes. If we analyze the Intertopic Distance Map, we can highlight the similarities in topics 12, 2 and 6 n PC1. They include terminology that could refer to the uncertainty that Brexit introduces in the economy, in the markets and in employment. Terms like EU, the UK and Scotland are mentioned, as well as risk and hard associated with the Brexit event.

On the other hand, the PC2 highlights terms related to brexit in which the government intervenes, but also the ideas or decisions of the people. That is, issues more related to decision-making and less to economic aspects.

Analyzing speakers

Which are the dominant features of the speakers in each group?

Wikidata is an open and free knowledge base that collects structured data to provide support to services such as Wikipedia. The repository contains mainly items, and each of them has a unique item identifier (QID). Quotebank (out main dataset), also contains this unique identifier (QID) for each of the speakers, so the connection between both datasets can be made easily through this common identifier.:

We have matched each quote and obtained the following information about the speaker: gender, date of birth, nationality, occupation, political party, academic degree, political support, and religion

First, we have represented the most active speakers by the number of quotes during the studied period. In the following plot we are representing three variables:

  • The mean of the sentiment of the quotes of the speaker (axis x).
  • The standard deviation of the sentiment of the quotes of the speaker (axis y). With this parameter, we can see how is the variation of the mean of each speaker. If the speaker bubble is more towards the top, it has a higher variation over the sentiment of its quotes.
  • The total number of quotes of the speaker (size of the bubble of each speaker). The first thing that we can observe is that the most active speakers are situated more on the right side, meaning that they have a more positive sentiment. On the other hand, there are no active speakers with a very negative sentiment about Brexit. This is a key point about our analysis because it means that the speakers that are more active (and therefore with a higher impact), have a more positive feeling about Brexit. Secondly, we can observe some important speakers, such as Boris Jhonson or Theresa May, which are the most active ones. Feel free to explore the plot, and see the sentiment of each of the speakers.

Now is time to see the geographical distribution of the quotes by the nationality of the speakers who wrote them. It is possible to filter the quotes by positive, negative, or neutral sentiment. Without many surprises, the most active country is the United Kingdom, followed by the United States. Remind that the dataset that we are using contains quotes in English, so there would be other quotes about Brexit from other countries in other languages.

In the following plot, we show absolute values about different demographic characteristics of the speakers of the quotes, segmented by those belonging to positive and negative strong sentiment groups. We have only taken into mind speakers with a strong positive or negative sentiment. We have represented gender, occupation, and membership in a political party.

  • Gender: in both cases, the percentage of males/females is almost equal, so we cannot see significant differences regarding the gender between speakers with a different sentiment toward Brexit. However, we can observe that in both cases the speakers are mainly male (above 80%), which shows that the current political environment is dominated by them.

In both occupation and political parties, the total number of speakers is larger in the case of strong positive sentiment (7379 speakers) than in strong negative sentiment (5233 speakers). As we have seen before in the Top-50 most active speakers (this is because the most active have a more positive sentiment toward Brexit).

  • Occupation: in absolute numbers, all occupations have more strongly positive than strong negative speakers. We have also analyzed the percentages for each occupation inside each group (positive or negative), to obtain some differences between the two groups. We have seen that occupations such as economists, journalists, and writers have a significantly larger relative value of speakers with negative sentiments inside each group. Moreover, lawyers have a very little superior relative value of speakers with negative sentiments. We can obtain some information about that, especially about economists, as one of the most pessimistic issues about Brexit is the economic viability and consequences. On the other hand, politicians, actors, football players, researchers, businesspeople, or television actors’ occupations have a slightly larger relative value of speakers with positive sentiments inside each group. We can suggest that perhaps speakers with occupations more related to the emotions (related to entertainment or sport) represent a greater percentage inside the positive group. However, maybe occupations related to reasoning and analysis of the situation about Brexit (such as economists) represent a greater percentage inside the negative group.
  • Political party: in the same way as in the occupations, all political parties have an absolute larger number of speakers with strong positive sentiment. However, analyzing the percentages inside each of the groups (positive/negative), political parties such as Scottish National Party, Labour party, and Liberal Democrats have a greater percentage of negative speakers inside the negative group than positive speakers in the positive group.

Brexit over time

How did the opinion towards Brexit change during the 6-year span?

In previous sections, we discussed the overall picture of Brexit debate: what percentage of speakers supported or opposed Brexit, what were their arguments, and what were their demographics. However, this is just an overall picture, compressing six years of discussions and arguments into a single graph. But with media cycles, emerging topics, and constantly changing dynamics of a campaign (or later, negotiations and campaigns), it may be interesting to use previously stated questions and analyse them closer and how they may have changed over a span of time, with three periods considered together.

We will divide the analysis into three main eras:

  • Before the referendum (January 1st, 2015 — February 20th, 2016): It took some time before the official campaign began, on February 20, 2016 (source), but this does not mean that discussions around the topic of United Kingdom did not arise to this date. As mentioned above, a referendum vote between a newly negotiated agreement with the EU and leaving the Union has been promised in Conservative manifesto for 2015 election, so it may be interesting to look at how arguments and demographics have changed between the start of 2015 and referendum campaign start.
  • The campaign (February 21st, 2016 — June 23rd, 2016): as campaign progressed, from February 2016 to June 2016, we may see more organised messaging from both sides - as every side of the campaign was organised under two umbrella organisations, Britain Stronger In Europe and Vote Leave. This means that those who argue for their side can use resources of their umbrella organizations, alongside various party-oriented (like Labour Leave or Conservatives In) or independent organizations (like Leave.EU or Scientists for EU). This is also a time where both sides may want to reach out directly to voters, leading to more activity on both sides.
  • After the Leave vote (June 24th, 2016 — January 31st, 2020): with Leave victory, the discussion has shifted from ‘if’ to ‘how’. From discussion around when to trigger article 50 (of Treaty of the European Union, which described an exit procedure from the EU), to the level of integration with the EU post-Brexit, to Northern Ireland issue and how leaving the EU fits with the framework of Good Friday Agreement, we may expect more arguments around minutiae of the Brexit process.

All of these events are summarized in the interactive timeline below. Take a look at it and feel free to navigate through it and follow the chronological narrative of Brexit. The three main eras contain the different events that build up to major occurrences.


Summary of the project

Throughout this project, we have attempted to use methods mentioned in the course on the Quotebank dataset. The goal of using these methods was to filter relevant quotes, connect them to identities of speakers, and quantify these quotes in terms of their sentiment, topics mentioned and demographics. Questions Q1-Q3 have given us a general look over the picture of Brexit debate that we have created, while question Q4 delved deeper into this look, trying to connect what we have seen with real life events of the debate.

Results obtained

The results obtained give us some insight into the debate surrounding Britain’s decision to leave the European Union. Based on our data, we could note that they show us a picture of discussion based mostly on positive sentiments. Statements with positive statements tended to lean towards topics of achieving Brexit and Free Trade Agreements, while negative statements focused on the fear of no-deal Brexit and general uncertainty - both types of statements also focused on the issue of Good Friday Agreement on the island of Ireland. Looking at demographics of our speakers, we have noticed that they tended to be British, overwhelmingly male, politicians. After delving more into how answers to these questions have changed over the time, we have been able to notice spikes in quotes recorded and connect them to important events of Brexit process, such as announcing referendum date, first Withdrawal Agreement being agreed, or General Elections post-referendum. We have also seen shift in arguments over time: from border, to economy, to Northern Ireland and Withdrawal Agreement form. We have also taken a look at demographics over time, noting interesting information the data has shown.

About the team

Alicia Soria Álvaro Bautista Kamil Czerniak Víctor González