I work with GPUs at work every day, but I’ve always struggled to really understand what’s happening under the hood. Most of my work is through high-level libraries and frameworks, which abstract away all the scheduling, memory hierarchies, and execution details. I knew GPUs were fast, but I didn’t really understand why. At university, exercises in reverse engineering CPUs really helped me understand cache hierarchies, instruction latencies, and performance quirks. I decided to try something similar on an Nvidia Jetson Orin Nano: treat the GPU as a black box and reverse engineer its microarchitecture through simple experiments. ...