Karolina Smolinska Garbulowska: Elucidation of complex diseases by machine learning
- Datum: 26 januari 2022, kl. 13.00
- Plats: A1:111A, Biomedicinskt centrum (BMC), Husargatan 3, Uppsala
- Typ: Disputation
- Respondent: Karolina Smolinska Garbulowska
- Opponent: Tuuli Lappalainen
- Handledare: Jan Komorowski, Claes Wadelius
- Forskningsämne: Bioinformatik
- DiVA
Abstract
Uncovering the interpretability of models for complex health-related problems is a crucial task that is often neglected in machine learning (ML). The amount of available data makes the problem even more complicated. The focal point of my research was building and applying specialized tools that identify relevant descriptors (features and their values). These tools cover a spectrum of methods that originate in ML, statistics and network visualization.
In the first part of the thesis, we predicted regulatory elements with potential regulatory impact on gene expression by incorporating several annotations tracks. Then, we created the funMotifs framework that enables the identification and analysis of functional transcription factor (TF) motifs in a tissue-specific manner (Paper I). The TF motifs were described by different chromatin signals from various genomics platforms. Afterwards, the data were merged into a functional score of the motif using logistic regression.
Subsequently, funMotifs was used to characterize a map of regulatory mutations and regulatory elements in 37 cancer types from 2,515 samples (Paper II). We were able to identify 5,749 mutated regulatory elements containing 11,962 regulatory mutations. Additionally, we identified several dysregulated cancer-associated genes nearby the mutated elements. Finally, enrichment of cancer-related pathways was observed for the genes linked with the mutated elements.
In the second part, we focused on interpretable ML modeling with rule-based classifiers. A rule-based model (RBM) consists of a set of IF-THEN rules, which are legible and allow to determine combinations of descriptors. To analyze RBMs, we created the R.ROSETTA R package that is a wrapper of ROSETTA (Paper III). As a result R.ROSETTA gained several additional functionalities that simplify validation and interpretation of RBMs.
Visual inspection of RBMs is an essential step towards the identification of interesting descriptors of a classifier. In order to support the analysis of complex RBMs, we created the VisuNet R tool for rule network (RN) visualization (Paper IV). These networks are constructed from IF-THEN rules that constitute RBM; nodes are descriptors in rules, and an edge connects two nodes if the corresponding descriptors occur in the same rule. By creating RN for RBM, we are able to use network concepts to analyze complex health-related processes. We applied VisuNet on various datasets to illustrate the properties of the tool.
In our studies, we showed the importance of identification of relevant descriptors for biological problems. Moreover, our methods may contribute to a better understanding of complex diseases.