David Black-Schaffer
Professor at Department of Information Technology; Division of Computer Systems
- Telephone:
- +46 18 471 68 30
- Mobile phone:
- +46 76 824 20 17
- E-mail:
- david.black-schaffer@it.uu.se
- Visiting address:
- Hus 10, Regementsvägen 10
- Postal address:
- Box 337
751 05 UPPSALA
- Academic merits:
- Docent
- CV:
- Download CV
- ORCID:
- 0000-0002-8250-8574
Short presentation
My research focuses on approaches for moving data more efficiently in computer systems, using both software and hardware techniques. Our results have been commercialized through a startup and incorporated in industry standards. Prior to joining Uppsala, I contributed to the OpenCL standard while working at Apple, Inc. I have received multiple teaching awards and lead a startup that helped over 80,000 students.
As of June 2023 I am the Dean of Research for the Faculty of Science and Technology.
Keywords
- performance
- computer architecture
- memory systems
- simulation
- runtimes
- scheduling
- efficiency
- active teaching
- commercialization
Biography
I received my PhD in Electrical Engineering from Stanford University in 2008. My PhD thesis was on programming for real-time embedded processing on many-core processors in the Concurrent VLSI Architecture Group working with William Dally. After my PhD I worked at Apple on the development of the first OpenCL implementation for heterogeneous parallel processing across CPUs and GPUs, and then as a postdoc researcher in computer architecture in the Dept. of Information Technology at Uppsala University. I was appointed assistant professor in 2010 in the architecture research group at Uppsala looking at parallel programming systems and optimizations as part of the UPMARC research center. I received the docent title in 2014 and a promotion to full professor in 2017.
At Uppsala University, I was the Research Responsible Professor for the Computer Architecture and Communications Systems program from 2020-2022, the head of the Division of Computer Systems from 2022, and the department representative to the faculty Advisory Committee for Research since 2021.
I have been very active in flipped-classroom teaching. In particular, I lead the ScalableLearning project from 2012-2020, which developed an online system to support at-home and in-class flipped classroom teaching used by over 80,000 students. My active teaching techniques have been recognized by the Uppsala Engineering and Science Student Union Pedagogical Prize (2012), the Uppsala University Pedagogical Prize (2016), and the Uppsala Technical Physics Students' Teaching Award (2019).
I have also worked to bring my research results into industry, both through startups and industrial collaboration. Together with my colleague Erik Hagersten, we commercialized our new power-efficient memory system designs, resulting in their being acquired by a major international corporation. I have also worked with my colleague Chang Hyun Park and collaborators at Arm Ltd., in the UK, to get our memory system designs into the specification for future Arm processors.
Grants and Awards
- Knut and Alice Wallenberg Foundation, Wallenberg Academy Fellowship Prolongation (2020-2025)
- Swedish Research Council (VR) Project Grant (2019-2024)
- European Research Council ERC Starting Grant (2017-2022)
- Uppsala University Pedaogical Prize (2016)
- Swedish Foundation for Strategic Research (SSF), Smart Systems Framework Grant (Co-PI, 2016-2021) Automating System SpEcific Model-Based LEarning (ASSEMBLE)
- Knut and Alice Wallenberg Foundation, Wallenberg Academy Fellow (2016-2021)
- Swedish Research Council (VR), Young Researcher Project Grant (2015-2018)
- Swedish Foundation for Strategic Research (SSF), Future Research Leaders (2013-2018)
- EU FP7, Addressing Energy in Parallel Technologies (Co-PI 2013-2016)
- Uppsala University, Pedagogical Development Grant for Flipped Classroom (2013)
- Swedish Research Council (VR), Framework Grant (Co-PI, 2012-2017)
- Uppsala Union of Engineering and Science Students, Teaching Award (2012)
- Stanford University, Centennial Teaching Assistant Award (2004)
- Stanford University, Hugh Hildreth Skilling Teaching Assistant Award (2003)
Teaching
- Computer Architecture 1 (To view the interactive online course lectures, register at ScalableLearning and join with the enrollment key YRLRX-25436.)
- Sample: Introduction to Digital Logic Design (88 minutes)
- Sample: Introduction to Virtual Memory (70 min)
- Parallel Programming for Efficiency (MSc level)
- Sample: Power and Energy in Computer Systems (52 min)
- Introduction to Computer Architecture Research (PhD level)
Presentations
- Predicting Next-Generation Multicore Performance in a Fraction of a Second (Keynote, SICS Multicore Day, 2015)
- GPUs: The Hype, The Reality, and the Future (Keynote, SICS Multicore Day, 2013) PDF (2011)
- Flipped Classroom Teaching in an Introductory CS Course (KTH, 2013) PDF
- Resource Sharing in Multicore Processors (Keynote, Ericsson Software Research Day 2011)
- Introduction to OpenCL PDF
- Optimizing OpenCL PDF
- GPU Architectures for Non-Graphics People PDF

Publications
Recent publications
CoGraf: Fully Accelerating Graph Applications with Fine-Grained PIM
Part of Proceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2026
- DOI for CoGraf: Fully Accelerating Graph Applications with Fine-Grained PIM
- Download full text (pdf) of CoGraf: Fully Accelerating Graph Applications with Fine-Grained PIM
Hiding Page Fault Latencies in Graph Processing Applications that Cannot Fit in Memory
Part of 40th IEEE International Parallel & Distributed Processing Symposium (IPDPS) 2026, 2026
Mark-Scavenge: Waiting for Trash to Take Itself Out
Part of Proceedings of the ACM on Programming Languages, 2024
- DOI for Mark-Scavenge: Waiting for Trash to Take Itself Out
- Download full text (pdf) of Mark-Scavenge: Waiting for Trash to Take Itself Out
Mutator-Driven Object Placement using Load Barriers
Part of PROCEEDINGS OF THE 21ST ACM SIGPLAN INTERNATIONAL CONFERENCE ON MANAGED PROGRAMMING LANGUAGES AND RUNTIMES, MPLR 2024, p. 14-27, 2024
- DOI for Mutator-Driven Object Placement using Load Barriers
- Download full text (pdf) of Mutator-Driven Object Placement using Load Barriers
Mutator-Driven Object Placement using Load Barriers
Part of MPLR 2024: Proceedings of the 21st ACM SIGPLAN International Conference on Managed Programming Languages and Runtimes, 2024
All publications
Articles in journal
Mark-Scavenge: Waiting for Trash to Take Itself Out
Part of Proceedings of the ACM on Programming Languages, 2024
- DOI for Mark-Scavenge: Waiting for Trash to Take Itself Out
- Download full text (pdf) of Mark-Scavenge: Waiting for Trash to Take Itself Out
Exploring the Latency Sensitivity of Cache Replacement Policies
Part of IEEE Computer Architecture Letters, p. 93-96, 2023
- DOI for Exploring the Latency Sensitivity of Cache Replacement Policies
- Download full text (pdf) of Exploring the Latency Sensitivity of Cache Replacement Policies
Dependence-aware Slice Execution to Boost MLP in Slice-out-of-order Cores
Part of ACM Transactions on Architecture and Code Optimization (TACO), 2022
A Reusable Characterization of the Memory System Behavior of SPEC2017 and SPEC2006
Part of ACM Transactions on Architecture and Code Optimization (TACO), 2021
Early Address Prediction: Efficient Pipeline Prefetch and Reuse
Part of ACM Transactions on Architecture and Code Optimization (TACO), 2021
- DOI for Early Address Prediction: Efficient Pipeline Prefetch and Reuse
- Download full text (pdf) of Early Address Prediction: Efficient Pipeline Prefetch and Reuse
Maximizing limited resources: A limit-based study and taxonomy of out-of-order commit
Part of Journal of Signal Processing Systems, p. 379-397, 2019
- DOI for Maximizing limited resources: A limit-based study and taxonomy of out-of-order commit
- Download full text (pdf) of Maximizing limited resources: A limit-based study and taxonomy of out-of-order commit
Analyzing performance variation of task schedulers with TaskInsight
Part of Parallel Computing, p. 11-27, 2018
Exploring scheduling effects on task performance with TaskInsight
Part of Supercomputing frontiers and innovations, p. 91-98, 2017
Part of IEEE Transactions on Computers, p. 3537-3551, 2016
Part of Svenska Dagbladet, 2013
Chapters in book
Efficient cache modeling with sparse data
Part of Processor and System-on-Chip Simulation, p. 193-209, Springer, 2010
Conference papers
CoGraf: Fully Accelerating Graph Applications with Fine-Grained PIM
Part of Proceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2026
- DOI for CoGraf: Fully Accelerating Graph Applications with Fine-Grained PIM
- Download full text (pdf) of CoGraf: Fully Accelerating Graph Applications with Fine-Grained PIM
Hiding Page Fault Latencies in Graph Processing Applications that Cannot Fit in Memory
Part of 40th IEEE International Parallel & Distributed Processing Symposium (IPDPS) 2026, 2026
Mutator-Driven Object Placement using Load Barriers
Part of PROCEEDINGS OF THE 21ST ACM SIGPLAN INTERNATIONAL CONFERENCE ON MANAGED PROGRAMMING LANGUAGES AND RUNTIMES, MPLR 2024, p. 14-27, 2024
- DOI for Mutator-Driven Object Placement using Load Barriers
- Download full text (pdf) of Mutator-Driven Object Placement using Load Barriers
Mutator-Driven Object Placement using Load Barriers
Part of MPLR 2024: Proceedings of the 21st ACM SIGPLAN International Conference on Managed Programming Languages and Runtimes, 2024
Protean: Resource-efficient Instruction Prefetching
Part of The International Symposium on Memory Systems (MEMSYS '23), p. 1-13, 2023
- DOI for Protean: Resource-efficient Instruction Prefetching
- Download full text (pdf) of Protean: Resource-efficient Instruction Prefetching
Part of The International Symposium on Memory Systems (MEMSYS '23), p. 1-11, 2023
- DOI for Large-scale Graph Processing on Commodity Systems: Understanding and Mitigating the Impact of Swapping
- Download full text (pdf) of Large-scale Graph Processing on Commodity Systems: Understanding and Mitigating the Impact of Swapping
Faster FunctionalWarming with Cache Merging
Part of PROCEEDINGS OF SYSTEM ENGINEERING FOR CONSTRAINED EMBEDDED SYSTEMS, DRONESE AND RAPIDO 2023, p. 39-47, 2023
Every Walk's a Hit: Making Page Walks Single-Access Cache Hits
Part of Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’22), February 28 – March 4, 2022, Lausanne, Switzerland, 2022
- DOI for Every Walk's a Hit: Making Page Walks Single-Access Cache Hits
- Download full text 1 (pdf) of Every Walk's a Hit: Making Page Walks Single-Access Cache Hits
- Download full text 2 (pdf) of Every Walk's a Hit: Making Page Walks Single-Access Cache Hits
Architecturally-independent and time-based characterization of SPEC CPU 2017
Part of 2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), p. 107-109, 2020
- DOI for Architecturally-independent and time-based characterization of SPEC CPU 2017
- Download full text 1 (pdf) of Architecturally-independent and time-based characterization of SPEC CPU 2017
- Download full text 2 (pdf) of Architecturally-independent and time-based characterization of SPEC CPU 2017
Modeling and Optimizing NUMA Effects and Prefetching with Machine Learning
Part of ICS '20: Proceedings of the 34th ACM International Conference on Supercomputing, 2020
- DOI for Modeling and Optimizing NUMA Effects and Prefetching with Machine Learning
- Download full text (pdf) of Modeling and Optimizing NUMA Effects and Prefetching with Machine Learning
Perforated Page: Supporting Fragmented Memory Allocation for Large Pages
Part of Proceedings of the 47th Annual ACM/IEEE International Symposium on Computer Architecture (ISCA), p. 913-925, 2020
- DOI for Perforated Page: Supporting Fragmented Memory Allocation for Large Pages
- Download full text (pdf) of Perforated Page: Supporting Fragmented Memory Allocation for Large Pages
Delay and Bypass: Ready and Criticality Aware Instruction Scheduling in Out-of-Order Processors
Part of 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), p. 424-434, 2020
Efficient temporal and spatial load to load forwarding
Part of Proc. 26th International Symposium on High-Performance and Computer Architecture, 2020
Efficient thread/page/parallelism autotuning for NUMA systems
Part of ICS '19, p. 342-353, 2019
- DOI for Efficient thread/page/parallelism autotuning for NUMA systems
- Download full text (pdf) of Efficient thread/page/parallelism autotuning for NUMA systems
Filter caching for free: The untapped potential of the store-buffer
Part of Proc. 46th International Symposium on Computer Architecture, p. 436-448, 2019
- DOI for Filter caching for free: The untapped potential of the store-buffer
- Download full text (pdf) of Filter caching for free: The untapped potential of the store-buffer
FIFOrder MicroArchitecture: Ready-Aware Instruction Scheduling for OoO Processors
Part of 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE), p. 716-721, 2019
- DOI for FIFOrder MicroArchitecture: Ready-Aware Instruction Scheduling for OoO Processors
- Download full text (pdf) of FIFOrder MicroArchitecture: Ready-Aware Instruction Scheduling for OoO Processors
Freeway: Maximizing MLP for Slice-Out-of-Order Execution
Part of 2019 25th IEEE International Symposium On High Performance Computer Architecture (HPCA), p. 558-569, 2019
- DOI for Freeway: Maximizing MLP for Slice-Out-of-Order Execution
- Download full text (pdf) of Freeway: Maximizing MLP for Slice-Out-of-Order Execution
Behind the Scenes: Memory Analysis of Graphical Workloads on Tile-based GPUs
Part of Proc. International Symposium on Performance Analysis of Systems and Software, p. 1-11, 2018
- DOI for Behind the Scenes: Memory Analysis of Graphical Workloads on Tile-based GPUs
- Download full text (pdf) of Behind the Scenes: Memory Analysis of Graphical Workloads on Tile-based GPUs
Tail-PASS: Resource-based Cache Management for Tiled Graphics Rendering Hardware
Part of Proc. 16th International Conference on Parallel and Distributed Processing with Applications, p. 55-63, 2018
Dynamically Disabling Way-prediction to Reduce Instruction Replay
Part of 2018 IEEE 36th International Conference on Computer Design (ICCD), p. 140-143, 2018
Adaptive cache warming for faster simulations
Part of Proc. 9th Workshop on Rapid Simulation and Performance Evaluation, 2017
- DOI for Adaptive cache warming for faster simulations
- Download full text (pdf) of Adaptive cache warming for faster simulations
A split cache hierarchy for enabling data-oriented optimizations
Part of Proc. 23rd International Symposium on High Performance Computer Architecture, p. 133-144, 2017
Addressing energy challenges in filter caches
Part of Proc. 29th International Symposium on Computer Architecture and High Performance Computing, p. 49-56, 2017
Understanding the interplay between task scheduling, memory and performance
Part of Proc. Companion 8th ACM International Conference on Systems, Programming, Languages, and Applications, p. 21-23, 2017
TaskInsight: Understanding task schedules effects on memory and performance
Part of Proc. 8th International Workshop on Programming Models and Applications for Multicores and Manycores, p. 11-20, 2017
Analyzing Graphics Workloads on Tile-based GPUs
Part of Proc. 20th International Symposium on Workload Characterization, p. 108-109, 2017
A graphics tracing framework for exploring CPU+GPU memory systems
Part of Proc. 20th International Symposium on Workload Characterization, p. 54-65, 2017
POSTER: Putting the G back into GPU/CPU Systems Research
Part of 2017 26TH INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT), p. 130-131, 2017
Formalizing data locality in task parallel applications
Part of Algorithms and Architectures for Parallel Processing, p. 43-61, 2016
Characterizing Task Scheduling Performance Based on Data Reuse
Part of Proc. 9th Nordic Workshop on Multi-Core Computing, 2016
Spatial and Temporal Cache Sharing Analysis in Tasks
2016
Data placement across the cache hierarchy: Minimizing data movement with reuse-aware placement
Part of Proc. 34th International Conference on Computer Design, p. 117-124, 2016
Partitioning GPUs for Improved Scalability
Part of Proc. 28th International Symposium on Computer Architecture and High Performance Computing, p. 42-49, 2016
Long Term Parking (LTP): Criticality-aware Resource Allocation in OOO Processors
Part of Proc. 48th International Symposium on Microarchitecture, p. 334-346, 2015
StatTask: Reuse distance analysis for task-based applications
Part of Proc. 7th Workshop on Rapid Simulation and Performance Evaluation, p. 1-7, 2015
Full speed ahead: Detailed architectural simulation at near-native speed
Part of Proc. 18th International Symposium on Workload Characterization, p. 183-192, 2015
Micro-Architecture Independent Analytical Processor Performance and Power Modeling
Part of 2015 IEEE International Symposium on Performance Analysis and Software (ISPASS), p. 32-41, 2015
AREP: Adaptive Resource Efficient Prefetching for Maximizing Multicore Performance
Part of Proc. 24th International Conference on Parallel Architectures and Compilation Techniques, p. 367-378, 2015
- DOI for AREP: Adaptive Resource Efficient Prefetching for Maximizing Multicore Performance
- Download full text (pdf) of AREP: Adaptive Resource Efficient Prefetching for Maximizing Multicore Performance
The Direct-to-Data (D2D) Cache: Navigating the cache hierarchy with a single lookup
Part of Proc. 41st International Symposium on Computer Architecture, p. 133-144, 2014
Fix the code. Don't tweak the hardware: A new compiler approach to Voltage–Frequency scaling
Part of Proc. 12th International Symposium on Code Generation and Optimization, p. 262-272, 2014
Bandwidth Bandit: Quantitative Characterization of Memory Contention
Part of Proc. 11th International Symposium on Code Generation and Optimization, p. 99-108, 2013
Shared Resource Sensitivity in Task-Based Runtime Systems
Part of Proc. 6th Swedish Workshop on Multi-Core Computing, 2013
TLC: A tag-less cache for reducing dynamic first level cache energy
Part of Proceedings of the 46th International Symposium on Microarchitecture, p. 49-61, 2013
Towards more efficient execution: a decoupled access-execute approach
Part of Proc. 27th ACM International Conference on Supercomputing, p. 253-262, 2013
- DOI for Towards more efficient execution: a decoupled access-execute approach
- Download full text (pdf) of Towards more efficient execution: a decoupled access-execute approach
Modeling performance variation due to cache sharing
Part of Proc. 19th IEEE International Symposium on High Performance Computer Architecture, p. 155-166, 2013
- DOI for Modeling performance variation due to cache sharing
- Download full text (pdf) of Modeling performance variation due to cache sharing
Towards Power Efficiency on Task-Based, Decoupled Access-Execute Models
Part of PARMA 2013, 4th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures, 2013
Phase Behavior in Serial and Parallel Applications
Part of International Symposium on Workload Characterization (IISWC'12), 2012
Phase Guided Profiling for Fast Cache Modeling
Part of International Symposium on Code Generation and Optimization (CGO'12), p. 175-185, 2012
Efficient techniques for predicting cache sharing and throughput
Part of Proc. 21st International Conference on Parallel Architectures and Compilation Techniques, p. 305-314, 2012
- DOI for Efficient techniques for predicting cache sharing and throughput
- Download full text (pdf) of Efficient techniques for predicting cache sharing and throughput
Bandwidth bandit: Quantitative characterization of memory contention
Part of Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT, p. 457-458, 2012
Cache Pirating: Measuring the Curse of the Shared Cache
Part of Proc. 40th International Conference on Parallel Processing, p. 165-175, 2011
A simple statistical cache sharing model for multicores
Part of Proc. 4th Swedish Workshop on Multi-Core Computing, p. 31-36, 2011
A simple model for tuning tasks
Part of Proc. 4th Swedish Workshop on Multi-Core Computing, p. 45-49, 2011
Using hardware transactional memory for high-performance computing
Part of Proc. 25th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum, p. 1660-1667, 2011
Fast modeling of shared caches in multicore systems
Part of Proc. 6th International Conference on High Performance and Embedded Architectures and Compilers, p. 147-157, 2011
StatCC: a statistical cache contention model
Part of Proc. 19th International Conference on Parallel Architectures and Compilation Techniques, p. 551-552, 2010
Block-Parallel Programming for Real-time Embedded Applications
Part of Proc. 39th International Conference on Parallel Processing, p. 297-306, 2010
- DOI for Block-Parallel Programming for Real-time Embedded Applications
- Download full text (pdf) of Block-Parallel Programming for Real-time Embedded Applications
Datasets
Manuscripts (preprints)
Reports
Faster Functional Warming with Cache Merging
2022
Minimizing Replay under Way-Prediction
2019
Perf-Insight: A Simple, Scalable Approach to Optimal Data Prefetching in Multicores
2015
Full Speed Ahead: Detailed Architectural Simulation at Near-Native Speed
2014
- Download full text (pdf) of Full Speed Ahead: Detailed Architectural Simulation at Near-Native Speed
Quantitative Characterization of Memory Contention
2012
Cache Pirating: Measuring the curse of the shared cache
2011
Computing Systems: Research Challenges Ahead: The HiPEAC Vision 2011/2012
2011