cuda-100
  1. 100 days of CUDA
  • 100 days of CUDA
  • Day 0 - playing with PyCUDA
  • Day 1 - playing with nvcc
  • Day 2 - RGB to grayscale
  • Day 3 - RGB blur
  • Day 4 - Naive matmul+exercises
  • Day 5 - Matrix-vector multiplication
  • Day 6 - Tiled matmul
  • Day 7 - Tiled matmul experiments
  • Day 8 - Thread coarsening
  • Day 9 - Conv 2D
  • Day 10 - Improving Conv2d performance
  • Day 11 - conv2d with shared memory
  • Day 12 - conv2d with shared memory and halo

100 days of CUDA

Installation instructions (how I do)
  • Create a mamba environment
  • mamba install python=3.12 (pycuda do not work in 3.13 yet)
  • mamba cuda
  • pip install pycuda
Progress
  • Day 0 playing with PyCUDA
  • Day 1 playing with NVCC, vector addition
  • Day 2 RGB 2 gray
  • Day 3 RGB blur
  • Day 4 Naive matmul+exercises
  • Day 5 Matrix-vecor multiplication
  • Day 6 Tiled matmul
  • Day 7 Tiled matmul - experiments
  • Day 8 Tiled matmul - thread coarsening
  • Day 9 Naive conv2d with arbitrary number of channels
  • Day 10 faster conv2d
  • Day 11 conv2d with shared memory
  • Day 12 conv2d with shared memory + halo

Some CUDA (or C) quirks to note:

Signed-unsigned comparison is dumb

uint32_t a =  1;
int32_t  j = -1;
j >= a == true
j +  a == 0

Somehow this is how type casting works in C. :/

Benchmarking

Run this script before benchmarking to lock gpu/mem frequence and hopefully avoid thermal throttling and unstable timings


sudo nvidia-smi -pm 1 # Set GPU to persistent mode
sleep 2

sudo nvidia-smi -lgc 1000,1000 # Lock clocks to prevent frequency scaling
sudo nvidia-smi -lmc 5000,5000 # Memory clock
sudo nvidia-smi --auto-boost-default=0  # Disable auto-boost