“The goal is to be able to automatically read enciphered, historical manuscripts”

2019-03-26

Beáta Megyesi, senior lecturer in Computational Linguistics at Department of Linguistics and Philology.

Meet Beáta Megyesi. As Senior Lecturer in Computational Linguistics, she is head of research for an interdisciplinary project to develop research in historical cryptology that has modern applications. Find out what the project is about.

“We develop automated methods for reading secret, enciphered, historical manuscripts for which we no longer have an encryption key. The goal is to enable a historian, for example, to upload an enciphered manuscript and automatically have the writing rendered into plain text. To do this, we need experts from various disciplinary domains – from cryptographers and computer scientists to historians and linguists.”

Are there many enciphered, historical manuscripts?
“We don’t really know how many there are, among other things because the enciphered manuscripts that we cannot read are difficult to categorise by libraries. But historians usually estimate that about one per cent of historical texts are enciphered, and most are currently unreadable. And people probably chose to encrypt the most significant information, making this important for historical research.”

What are the enciphered manuscripts usually about?
“Intelligence information for people in power, but also gossip. For example, the Pope’s envoy might write home to the Vatican about who partied with whom at a royal court. And it could be extremely important information for political purposes. The best modern comparison might be WikiLeaks material.”

How will you go about it?
“The first step is to collect ciphertexts and authentic keys via a website. That is easier said than done because they are hard to find. Many keys have also been destroyed for security reasons or stored in various places without any connection to the enciphered communication. If there is not a contemporary key but someone has solved the encryption, we also want to have a description of how the encryption was broken because we automate what cryptographers do.”

What is the next step in the project?
“The project includes two major technical challenges. First, we need to input the

In 2011 researchers succeeded in reading the
Copiale Cipher, an enciphered, handwritten
manuscript from the 1730s which brought to light a
text related to the secret society “the Oculist order”.
The solution of the Copiale Cipher gave inspiration
to building research into historical cryptology.

manuscript digitally, using image analysis from a photograph of the manuscript. This is not as easy as it might sound, because the manuscripts are handwritten and can contain many different types of symbols, numbers, letters and diacritical signs from different alphabets and also unique symbols taken from the zodiac or alchemy.

The next challenge is to automatically solve the encryption and produce a readable text. This means, for example, understanding the method used to encrypt the text and the language in which the underlying text is written. That’s probably the most difficult challenge.

But we need to find ways to automatically solve other problems, such as variations in spelling and grammar that are not uniform in historical texts. It was also common in the past, exactly like today, to mix different languages. In addition, language changes over time, so we need different language models, depending on the manuscript’s age.”

Can the methods you develop be used for something else?
“Absolutely. That is perhaps the main point of the project, which involves basic research in a number of areas. Since the project teaches us how to automatically solve a number of text analysis problems, it becomes easier to analyse such texts as students’ national standardised tests in Swedish.”

How can the interpretation of historical manuscripts facilitate analyses of national standardised tests in Swedish?
“Problems in the texts resemble each other. For example, students mix languages, using English words in a Swedish text, they misspell and they do not always use correct grammar. Historical texts commonly include abbreviations. Punctuation marks like periods and commas occur more rarely, similar to the way we often write in informal contexts today, especially in social media.

All this entails problems for automatic analysis of various, more or less formal, texts, where methods from this research project can help with solutions.”

Can you give more examples of applications?
“The methods can help us to automatically find sensitive information in texts, such as personal names and places, which are often enciphered in a special way, and this in turn can be used in what is called pseudonymisation.”

What is pseudonymisation?
“We often anonymise sensitive information in medical or social sciences research, for example, so it cannot be connected to the individual. But under the law the individuals can subsequently require that the information be erased. If we have anonymised the information, we cannot subsequently determine which information belongs to which people.

Pseudonymisation means masking or replacing sensitive information about the individual, but with a key that makes it possible to connect the information using the key, if necessary. These methods become highly relevant in the new General Data Protection Regulation (GDPR) for protecting personal information.”

Facts

The interdisciplinary research project “DECRYPT: Decryption of historical manuscripts” aims to develop a new interdisciplinary research subject in historical cryptology. The project is funded by the Swedish Research Council with a grant of SEK 29.5 million over a seven-year period. Beáta Megyesi, senior lecturer in Computational Linguistics at the Department of Linguistics and Philology at Uppsala University, heads the project.

The project includes researchers from six different disciplines: Computational Linguistics, Cryptology, Image Processing, Computer Science, History and Linguistics.

DECRYPT is a continuation of the previous DECODE research project in which researchers began developing technical solutions for reading enciphered historical manuscripts.

The General Data Protection Regulation (GDPR) went into effect throughout the European Union in May 2018 with the aim of creating a consistent and equal level of protection for personal data.

“The goal is to be able to automatically read enciphered, historical manuscripts”

Facts

Read more

Subscribe to the Uppsala University newsletter