Skip to content

csl-iisc/gpu-profiling-tutorial

Repository files navigation

CUDA Profiling Lightning-Tutorial

Welcome! This repo contains the 30-minute Nsight Systems (nsys) + Nsight Compute (ncu) tutorial you can run on any CUDA-capable Linux box (or WSL / Docker image that has the CUDA Toolkit ≥ 11.0 installed).


1. What's Inside

File / Target Purpose
gemm.cu (alias naive_gemm.cu) Baseline matrix-multiply (global memory only).
nvtx_gemm.cu Same math + single NVTX range around the kernel.
pinned_gemm.cu Adds pinned host memory (overlap).
tiled_gemm.cu Shared-memory tiled kernel with tunable #define BLOCK 16/32.
gemm_debug_friendly.cu Tutorial version with 3 kernels: correct, no-bounds, no-sync for debugging practice.
Makefile make all or make naive/nvtx/pinned/tiled/debug builds individual demos.
run.sh One-click script that:
1. profiles each variant with nsys & ncu
2. drops reports next to the binaries.
*.ncu-rep Pre-generated Nsight Compute reports you can open in the GUI if you have no GPU handy.
*.nsys-rep Pre-generated Nsight Systems timelines (same idea).

Tip: If you're on a headless cluster, copy the *.rep files to your laptop and open them there.


2. Prerequisites ⚙️

Need Version Check
CUDA Toolkit ≥ 11.0 nvcc --version
Nsight Systems CLI & GUI 2022.2 or later nsys --version
Nsight Compute CLI & GUI 2022.2 or later ncu --version
GPU driver Matching Toolkit nvidia-smi
cuda-gdb Included with CUDA cuda-gdb --version
compute-sanitizer Included with CUDA compute-sanitizer --version

Inside Docker? Use NVIDIA's nvcr.io/nvidia/cuda:<version>-devel container.


3. Build 🛠️

git clone <this-repo>.git
cd cuda_profiling_tutorial
make            # builds all five binaries
# or individually:
make naive      # -> ./naive_gemm
make nvtx       # -> ./nvtx_gemm
make pinned     # -> ./pinned_gemm
make tiled      # -> ./tiled_gemm
make debug      # -> ./debug_gemm 

4. Quick Start 🚀

4.1 Run the baseline kernel

./naive_gemm 2048          # C = A×B (row-major), N=2048

4.2 Profile with Nsight Systems

nsys profile --trace=cuda -o gemm_full ./naive_gemm 2048
# Produces gemm_full.nsys-rep (timeline incl. copies + kernel)

Want to capture only the kernel (NVTX range)?

nsys profile --trace=cuda,nvtx --capture-range=nvtx \
             -o nvtx_only ./nvtx_gemm 2048

4.3 Deep dive with Nsight Compute

ncu --set full -o tiled_report ./tiled_gemm 2048
# Creates tiled_report.ncu-rep  (open in GUI)

Text-dump the raw page:

ncu --import tiled_report.ncu-rep --page raw --csv \
    > tiled_raw.csv

5. Debugging Tutorial 🐛

The debug_gemm binary contains three kernel versions for learning CUDA debugging:

Kernel Type Bug Command
correct None - reference implementation ./debug_gemm correct 1024
no-bounds Missing bounds checks → out-of-bounds reads ./debug_gemm no-bounds 1000
no-sync Missing __syncthreads() → race conditions ./debug_gemm no-sync 2048

5.1 Testing the Correct Version

./debug_gemm correct 1024
# Output: C[0] = 1024 (expected: 1024) ✓

5.1 Bug #1: Missing Bounds Checks

Debug with cuda-gdb:

cuda-gdb --args ./debug_gemm no-bounds 10000

(gdb) break matmul_no_bounds
(gdb) run
(gdb) cuda thread (31,31,0) block (31,31,0)
(gdb) print row
# Shows: row = 1023 (but N = 1000, so out of bounds!)
(gdb) print N
# Shows: N = 1000
(gdb) print row * N + threadIdx.x
# Shows index > 1000000 (out of bounds!)

5.2 Bug #2: Missing Synchronization

The Problem: Race condition - threads read from shared memory before other threads finish writing.

./debug_gemm no-sync 2048
# C[0] = unpredictable (changes each run)

Debug with compute-sanitizer racecheck:

compute-sanitizer --tool racecheck ./debug_gemm no-sync 128

5.3 Quick Debug Reference

# Race conditions (missing __syncthreads)
compute-sanitizer --tool racecheck ./debug_gemm no-sync 128

# Check for uninitialized memory
compute-sanitizer --tool initcheck ./debug_gemm no-bounds 1000

# Interactive debugging
cuda-gdb --args ./debug_gemm no-bounds 10000

Common cuda-gdb commands:

(gdb) break matmul_no_bounds     # Set breakpoint
(gdb) run                        # Start execution
(gdb) cuda thread (x,y,z)        # Switch to specific thread
(gdb) cuda kernel block thread   # Show current location
(gdb) print threadIdx.x          # Inspect variables
(gdb) print row * N + col        # Evaluate expressions
(gdb) continue                   # Resume

6. What to Demo (cheat-sheet)

Variant Focus Command snippet
naive_gemm Timeline anatomy (copies ≫ kernel) nsys profile --trace=cuda -o naive ./naive_gemm 2048
nvtx_gemm NVTX + --capture-range nsys profile --trace=cuda,nvtx --capture-range=nvtx -o nvtx ./nvtx_gemm 2048
pinned_gemm Overlap copies with kernel Compare naive.nsys-rep vs pinned.nsys-rep
tiled_gemm SM occupancy / Roofline ncu --set full -o tiled ./tiled_gemm 2048
debug_gemm Out-of-bounds memory access compute-sanitizer ./debug_gemm no-bounds 1000
debug_gemm Race conditions compute-sanitizer --tool racecheck ./debug_gemm no-sync 128
debug_gemm Interactive debugging cuda-gdb --args ./debug_gemm no-bounds 1000

7. Cleanup 🧹

make clean          # remove binaries
rm -f *.nsys-rep *.ncu-rep

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published