Large amounts of data require new tools
The amount of data has increased tremendously over the past ten years. More and more information is stored digitally and is available to many. At the same time, there is a need of tools to analyse all this data and create new knowledge. This is especially true within biology, where new technology has led to an explosion of data.
‘A completely new working method has emerged over the past decade,’ says Ola Spjuth.
He is a researcher at SciLifeLab in Uppsala and leads the Uppnex project. They have built up large computing resources for what’s known as “next-generation sequencing” - i.e., large-scale gene analysis.
The technology makes it possible to quickly obtain the DNA sequence from samples from humans, plants and animals. It is useful in cancer research, pharmaceutical research and biology - and generates lots of data.
‘One run on a single sample can generate billions of bases (the letters A, T, C and G) and it's not like reading a book, but takes days, weeks or in some instances months of calculations,’ says Ola Spjuth.
‘Suddenly, researchers were swimming in data, hard drives were stacked up on the lab benches so a decision was made to make a joint effort to try to solve the problems.
Uppmax was already in place at Uppsala University, with high-performance computers that served researchers in fields such as physics and chemistry. In 2010 the server room was expanded with Uppnex for biological research. It is the part that has expanded the fastest and it’s still growing.
A new server room was recently opened at SciLifeLab, which is directly linked to Uppnex. All in all there is currently data from over 800 different projects and a storage capacity of 7 petabytes, which is equivalent to 7000 times more than will fit on a typical hard drive.
‘We have recently increased the computing capacity three times and the storage capacity five times, and apart from purchasing computers, we have built up a high level of expertise,’ says Ola Spjuth.
The new technical possibilities have led to a great deal of new research results, for example, the mapping of dog and flycatcher genomes. Sequencing is used in medicine to increase knowledge about cancer, hereditary diseases and resistant bacteria.
In practical terms, researchers send their samples to the sequencing platform, who after sequencing, store and analyse the results on Uppnex. The researchers then receive a project account where they can log in. They then continue to work on their data on Uppnex, instead of on their own computer.
‘It has been successful as we have a strong focus on the users. This differs from the high-performance computing in physics, where the researchers are more self-sustaining. All of a sudden, we have hundreds of biologists, who need to use the technology, but do not know so much about computers,’ says Ola Spjuth.
He adds that they have invested a lot in support and training. Together with SciLifeLab, they offer a course where researchers have learned the basics of using large-scale computer systems and try to log in and use Uppnex.
‘It is usually overbooked.’ Very many research groups employ bioinformaticians now, but research leaders also need to understand how it works.
Many projects are in progress for a long time. To map the genome of an organism, for example, is just a starting point for further studies. Ola Spjuth envisions that the data volume will continue to increase.
‘Projects are getting bigger and more people want to sequence. At the same time, the process is faster and we can get more and more data. Projects expand 5-10 times on the drives during the analyses and biologists like to save all their data. It’s a massive challenge to be able to scale up storage and analyses.
Consequently, developments place new demands on the research infrastructure. This concerns both the ability to store data and to analyse the information.
‘We will need to develop new methods and tools,’ says Ingela Nyström.
She is a professor at the department of Information Technology and coordinator of Essence, a strategic research initiative that is run from Uppsala University. Lund and Umeå Universities are also involved in the initiative.
‘Essence assembles researchers who wish to improve their research with e-scientific methods. Strong research both in the field and in method development are required in order for this to work,’ says Ingela Nyström.
She sees research being able to answer completely new questions now that it’s possible to process larger amounts of data.
‘But also old problems, researchers will now be able to readdress old problems that were set aside ten years ago. If previously it was possible to study 100 molecules perhaps 1 million molecules were needed to get a realistic picture. Ten years ago this was not possible, but today we can do a lot more full-scale experiments.’
Essence part funds 25-30 different projects and invests SEK 26 million each year in research within a wide range of fields, from material physics to linguistics. Common to all these is that they make use of large amounts of data, but also that they treat them in an advanced manner.
‘E-science is more than standard methods. All of our researchers in Essence use one of the computer centres and need more than what is on the desktop.´
For example, computer support is needed to sort out what is relevant data and to quickly find significant information. As for calculations, it is a question of performing as much as possible in parallel and simultaneously keep track of the calculations so that any errors are kept under control.
The vision for Essence is to build a “toolbox” for researchers, which can be used to customise solutions according to the problem to be solved. And here, researchers can benefit from collaboration and sharing with each other.
‘If methods have been created that work for one problem, perhaps they’ll work on another,’ says Ingela Nyström.
One of the experts in the field is Sverker Holmgren, professor at the department of Information Technology. He has researched computer-based tools and methods for some time and in recent years, needs have changed radically. If initially it was a question of smart calculations and simulations, now it is also about how large amounts of data are to be managed and analysed.
They say “big data” and the explosion of data from many different sources is a new challenge for IT researchers.
‘Computer simulations are an established activity, now we need to develop the analysis of data as well as how to store and manage data. This requires metadata that describes the data and that there is agreement on how data should be marked up. It's a whole new world!´
Data should not only be saved, but also made available for research. He has links to the “Research Data Alliance”, a global project brought about to build an “internet” for researchers where research data can be stored and simultaneously made available to others.
‘This require data to be marked up in the same manner with a common standard.’
He sees serious challenges ahead, and above all, it is about working more interdisciplinary.
‘We need completely new tools and we need to join the different areas of application with computing science and mathematics. Essence plays an important role here.´
The actual base, research infrastructure, is the same in different disciplines, but requires each research field to be developed with its own methods and tools.
‘The hard drives and the computers are the same, but the further you move up, the more specific different areas become. In the next few years we will need to develop a completely new type of tool, and this will demand more than a short-term approach.’
These thoughts are also shared by Ola Spjuth at SciLifeLab. He has researched the future of biological research and what is required to keep up with developments.
‘Biologists require much more storage space than traditional users. They frequently work with a large number of smaller sub-problems that require a great deal of primary storage to process. They are also more impatient, while physicists are accustomed to the time it takes, biologists want it to go fast.´
One way to store large amounts of data is to do as Google and spread data across all computers. You can then send calculations to different places and calculate in parallel.
‘Within research it is not as easy to divide information as much is interrelated, for example within a chromosome, so it requires more advanced methods in order to be useful.´
Today researchers must keep up with developments.
‘We are trying to keep at the forefront with the methods we employ. If we do not have the latest software available, then Swedish researchers will automatically fall a half year behind those at the forefront of research,’ says Ola Spjuth.
---
FACTS/Computer resources at Uppsala
UPPMAX (Uppsala Multidisciplinary Center for Advanced Computational Science) is Uppsala University's resource for high performance computers, large-scale storage and expertise in high performance computer usage. Established in 2003 as one of six centres within the national infrastructure SNIC (Swedish National Infrastructure for Computing), which Uppsala University hosts.
UPPNEX stands for “UPPmax NEXt generation sequencing Cluster & Storage”, and is a project at UPPMAX, which offers computation and storage resources as a national resource within the next-generation sequencing (NGS), primarily within Science for Life Laboratory (SciLifeLab).
eSSENCE is a strategic research programme in e-Science that is run in a collaboration between Uppsala University, Lund University and Umeå University. eSSENCE was initiated by the government to support research that was strategically important for society and industry. The vision is to lift Swedish e-Science to the highest international level, by building a creative research environment where new tools and applications are developed. eSSENCE also interacts with industry.
Annica Hulth