Bringing innovation to political science research through computing

Portuguese researchers have created a tool that allows online access and research on documents throughout 40 years of Portuguese democracy. The scientific community – and even journalists – have already been recognizing its benefits.

Using computational methods in studies concerning national politics is becoming more and more frequent, but much remains to be done. Joana Gonçalves de Sá and Paulo Almeida, from LIP’s Social Physics and Complexity (SPAC) group, in collaboration with some colleagues, set out to make Portuguese democracy searchable on the Internet. In a practical way, it was as if they had found an untidy library of political texts – parliamentary debates and electoral programs – and organized it so that anyone could easily find the information they were looking for. “There was no way to do a simple search on the parliament’s website, we couldn’t understand which topics were discussed in which debates. So we thought about developing a tool that would be useful for the scientific community in general,” explains the researcher, emphasizing that, to do so, they used resources from the National Infrastructure for Distributed Computing (INCD).

The first step was to create a database (the corpus) of documents. “Basically, it is a collection of texts that is structured in a searchable way,” explains Paulo Almeida. They needed to collect all the information on the parliament’s website regarding parliamentary debates. “We had to use a program that visits the site’s pages and collects and saves relevant information so as not to harm any service,” says Joana Gonçalves de Sá. Although documents were available from the time when Portugal was still a monarchy, LIP researchers wanted to focus on the debates that took place after 1976, when the country was already a young democracy.

Once the files were gathered, they had to process them with programming languages: “they came in HTML and we had to process each one of them until we got to XML, which is a semi-structured text format that uses tags,” he adds. Through them, it was possible to divide the various moments of the sessions, identify the different speakers and the parties they belonged to. Along the way, the challenges kept piling up, especially when the task was to correct as many spelling mistakes as possible and teach the computer to detect the speakers.


“It took us two years to have a final corpus, with all the debates, with a minimum of spelling mistakes, and with all the intervening parties and respective political forces identified,” Joana Gonçalves de Sá admits, noting that the work was never done on a full-time basis and that there was no associated funding, but that it was possible to carry it out with “a little bit of money from other projects and a lot of good will from the researchers involved.

Without the resources provided by the INCD, both researchers acknowledge that this project would have been extremely difficult to carry out. The infrastructure provided all the support free of charge – and still does: “The parliamentary corpus is hosted in a virtual machine in the INCD infrastructure, which allows it to be accessible on the Internet without maintenance costs for the research group and that is important for us,” says Paulo Almeida. “It really made our work easier,” added Joana Gonçalves de Sá.

Four decades of Portuguese democracy transcribed into a machine have culminated in a website that makes research easier and that can serve researchers and the population in general. “We’ve added a search engine to the corpus that allows us to extract the information the user is looking for,” explains Paulo Almeida. “If someone wants to know how many times a word was said during a parliamentary session, they can get not only a graph showing how many times it was mentioned, when and who said it, but also the frequency with which it was used,” Joana Gonçalves de Sá exemplifies, noting that it is also possible to “extract a complete text and understand in what context that same word was used”.

The first study using the corpus

In 2018, the researchers decided to test the database they had been working on since 2016 and rolled up their sleeves to apply it to a study on Portuguese democracy. The analysis, entitled “Spot the differences, a computational approach to inferring party positions from electoral manifestos, parliamentary discourses, and voting patterns,” covers the period from 1999 – when the Left Bloc (BE) elects its first deputy – to 2019, and in addition to the one already mentioned, it covers four other parties: the center-left socialist party (PS), the center-right social democratic party (PSD), the portuguese communist party (PCP) and the right wing party of the democratic and social center (CDS-PP). “With Lília Perfeito, also from LIP’s SPAC, Manuel Marques Pita from Lusófona University and Sofia Serra da Silva from ICS/UL, we compared debates, party votes and measures stated in the electoral programs,” explains the researcher. “Basically, we used three different corpus – the parliamentary debates, the electoral programs and the sense of vote – to try to understand how the parties align in these areas” she adds.

The findings revealed that, as far as the relative positioning of the parties is concerned, there are two very well established party blocs – the left-wing one, which includes BE and PCP, and the right-wing one, with PSD and CDS-PP. Nowadays, these blocs are much more polarized. On the other hand, in the PS there is a variation of position, moving between these two extremes.

The speeches, the votes and the programmatic contents of the respective electoral programs highlight even more the existence of these two blocs that very clearly separate the right from the left wing. “Within each bloc, the parties are almost indistinguishable” notes Joana Gonçalves de Sá. “PCP and BE vote the same way more than 90% of the time, as is the case between the PSD and the CDS-PP. And even in terms of discourse, we have great difficulty in separating them,” she continues.

These results caught the attention of journalists from Visão magazine, who decided to publish a story based on this study developed by LIP researchers and colleagues. This was a good leverage for more interested people to request access to the corpus in order to carry out journalistic work. The next step will be to make it known to the general public, so that everyone knows that this tool exists and is more than capable of contributing to innovation in the way political science is done in Portugal.