Equipment, tools, and applications

Below are some of the resources and equipment at CDHU which are used in projects and workshops. You are also welcome to visit our GitHub page for further resources we have been working on at CDHU and with our collaborators. Feel free to contact us regarding your project or research needs, and keep an eye on our workshop events!

VR equipment

The CDHU has three Meta Oculus Quest 2 headsets which can be used for custom 3D virtual reality visualiations as well as for exploring VR representations in popular culture (VR games and experiences).

 

NodeGoat GO and NodeGoat server

The NodeGoat server hosted at the CDHU allows the installed NodeGoat software to run multiple parallel projects, each accessible to multiple users. CDHU administers the research environments for data modelling, and configuring datasets collaboratively or alone. NodeGoat facilitates geographic spatial and temporal visualisation plus a built-in network analysis tool.

 

Recognito and Recogito server

Semantic Annotation without the pointy brackets! Work on texts and images, identify and mark named entities, use your data in other tools or connect to other data on the Web. Recogito offers semantic annotation and connections to online data without the need to learn to code, as well as a customised connection to gazetteers for mediterranean archaeology.

 

Computation and storage

Local computation and storage capabilities at the CDHU are housed on three servers: “Beast” and "Aurora", which are used internally for fast computation and information processing, and “Beauty”, used for network attached storage.

  • 1 Dell workstation/server (“Beast”), used internally for computation in the CDHU technical infrastructure. CPU: 2x Intel(R) Xeon(R) Gold 6240R @ 2.40GHz, 96 threads. GPU: 2x Nvidia Turing T4, 16 GB VRAM. RAM: 64GB. Storage: Disk array set up in ZFS mirror-mode, providing around 12TB of efficient storage.
  • 1 Dell workstation/server (“Beauty”), used for network attached storage in the CDHU technical infrastructure. CPU: Intel(R) Xeon(R) Gold 6226R @ 2.90GHz, 32 threads. RAM: 64GB. Storage: Disk array set up in ZFS parity 3, providing 50TB of efficent storage.
  • 1 Dell workstation/server ("Aurora"), used internally for computation at CDHU. CPU: 2x Intel Xeon Platinum 8260 2.4GH, 96 threads. RAM: 1TB. GPU: 3x Nvidia RTX A5000, 24GB VRAM. Storage: 15 TB in total.

 

Document scanner and paper guillotine

Research being conducted at the CDHU in collaboration with the Department of History of Science and Ideas presently makes use of a Cannon G2090 document scanner for fast mass scanning stacks of documents up to A3 size. This machine scans 300-600 dpi, ~200 pages per minute, with an automatic document feeder. Used in conjunction with the guillotine to remove spines from books/bound volumes, this enables fast, though destructive, digitisation of cultural heritage materials.

 

Sketchfab

Sketchfab is a 3D asset website used to publish, share, discover, buy and sell 3D, VRand AR content. It provides a viewer based on the WebGL and WebXR technologies that allows users to display 3D models on the web, to be viewed on any mobile browser, desktop browser or Virtual Reality headset. Check out our Sketchfab, where you can find some of the 3D models we've developed. The models include 3D scans of environments, primarily archaeological excavations, that are part of research projects where Uppsala University is involved. You can also find reconstructions of historical environments, such the as fortified Late Helladic I settlement at Malthi in northern Messenia, Greece and the 18th century village of Ekeby outside Uppsala.

For a full overview of CDHU software, scripts and models go to our GitHub page.

Attention HTR model

Attention HTR is an attention-based sequence-to-sequence model for handwritten word recognition. To overcome training data scarcity, this work leverages models pre-trained on scene text images as a starting point towards tailoring the handwriting recognition models. Source code and pre-trained models are available at GitHub.

 

Marginalia and machine learning (Pytorch)

For the detection of text written in document margins or handwritten notes, we have a PyTorch implementation of a Handwritten Text Recognition (HTR) system that focuses on automatic detection and recognition of handwritten marginalia texts. Faster R-CNN network is used for detection of marginalia and AttentionHTR is used for word recognition. The data comes from early book collections (printed) found in the Uppsala University Library, with handwritten marginalia texts. Source code and pre-trained models are available at Github. This is a work in progress.

 

Word rain: Semantically motivated word clouds

This development of a software library for text visualisation is built on shallow and deep neural networks. It is a work in progress led by CDHU, in Collaboration with Språkbanken Sam. Source code is available at Github.

 

Libralinked: Modelisation of Scandinavian library data

These scripts generate interactive graphics and display them as html, following webscraping of the National Library of Sweden. The graphic generation can also be generalised, as data are structured enough for instance via a CSV file.

 

Epub text-extraction tool

The epub text-extraction tool is a tool for extracting textual data from EPUB-books. The scripts convert epubs into txt files and perform basic statistics such as number of words, most frequent words etc.

 

Scripts and notebooks for scraping SOU-pdf:s

This code scrapes all urls to pdf:s from Kungliga Biblioteket and outputs a CSV-file. This repository includes a notebook that turns the csv-file into a download-script; it also sanitises and normalises filenames.

 

Scripts for optical character recognition in batches

This repository contains various scripts and tools for preparing (bursting, converting, renaming) and OCR:ing pdf:s using Tesseract-OCR. We also have an OCR program based on Pytesseract - a wrapper for Tesseract. It includes language models to enhance the OCR performance.

 

BerryBERT

This is a BERT text classification for Finnish OCR texts, originally used for research on the commodification of wild lingon berries. This work is a part of CDHU's Pilot Projects 2021-2022, with a project titled "Text Mining Commodification: The Geography Of the Nordic Lingonberry Rush, 1860-1910". Source code is available at Github.

FOLLOW UPPSALA UNIVERSITY ON

facebook
instagram
twitter
youtube
linkedin