Welcome! This repo contains the 30-minute Nsight Systems (nsys) + Nsight Compute (ncu) tutorial you can run on any CUDA-capable Linux box (or WSL / Docker image that has the CUDA Toolkit ≥ 11.0 installed).
| File / Target | Purpose |
|---|---|
gemm.cu (alias naive_gemm.cu) |
Baseline matrix-multiply (global memory only). |
nvtx_gemm.cu |
Same math + single NVTX range around the kernel. |
pinned_gemm.cu |
Adds pinned host memory (overlap). |
tiled_gemm.cu |
Shared-memory tiled kernel with tunable #define BLOCK 16/32. |
gemm_debug_friendly.cu |
Tutorial version with 3 kernels: correct, no-bounds, no-sync for debugging practice. |
Makefile |
make all or make naive/nvtx/pinned/tiled/debug builds individual demos. |
run.sh |
One-click script that: 1. profiles each variant with nsys & ncu 2. drops reports next to the binaries. |
*.ncu-rep |
Pre-generated Nsight Compute reports you can open in the GUI if you have no GPU handy. |
*.nsys-rep |
Pre-generated Nsight Systems timelines (same idea). |
Tip: If you're on a headless cluster, copy the
*.repfiles to your laptop and open them there.
| Need | Version | Check |
|---|---|---|
| CUDA Toolkit | ≥ 11.0 | nvcc --version |
| Nsight Systems CLI & GUI | 2022.2 or later | nsys --version |
| Nsight Compute CLI & GUI | 2022.2 or later | ncu --version |
| GPU driver | Matching Toolkit | nvidia-smi |
| cuda-gdb | Included with CUDA | cuda-gdb --version |
| compute-sanitizer | Included with CUDA | compute-sanitizer --version |
Inside Docker? Use NVIDIA's nvcr.io/nvidia/cuda:<version>-devel container.
git clone <this-repo>.git
cd cuda_profiling_tutorial
make # builds all five binaries
# or individually:
make naive # -> ./naive_gemm
make nvtx # -> ./nvtx_gemm
make pinned # -> ./pinned_gemm
make tiled # -> ./tiled_gemm
make debug # -> ./debug_gemm ./naive_gemm 2048 # C = A×B (row-major), N=2048nsys profile --trace=cuda -o gemm_full ./naive_gemm 2048
# Produces gemm_full.nsys-rep (timeline incl. copies + kernel)Want to capture only the kernel (NVTX range)?
nsys profile --trace=cuda,nvtx --capture-range=nvtx \
-o nvtx_only ./nvtx_gemm 2048ncu --set full -o tiled_report ./tiled_gemm 2048
# Creates tiled_report.ncu-rep (open in GUI)Text-dump the raw page:
ncu --import tiled_report.ncu-rep --page raw --csv \
> tiled_raw.csvThe debug_gemm binary contains three kernel versions for learning CUDA debugging:
| Kernel Type | Bug | Command |
|---|---|---|
correct |
None - reference implementation | ./debug_gemm correct 1024 |
no-bounds |
Missing bounds checks → out-of-bounds reads | ./debug_gemm no-bounds 1000 |
no-sync |
Missing __syncthreads() → race conditions |
./debug_gemm no-sync 2048 |
./debug_gemm correct 1024
# Output: C[0] = 1024 (expected: 1024) ✓Debug with cuda-gdb:
cuda-gdb --args ./debug_gemm no-bounds 10000
(gdb) break matmul_no_bounds
(gdb) run
(gdb) cuda thread (31,31,0) block (31,31,0)
(gdb) print row
# Shows: row = 1023 (but N = 1000, so out of bounds!)
(gdb) print N
# Shows: N = 1000
(gdb) print row * N + threadIdx.x
# Shows index > 1000000 (out of bounds!)The Problem: Race condition - threads read from shared memory before other threads finish writing.
./debug_gemm no-sync 2048
# C[0] = unpredictable (changes each run)Debug with compute-sanitizer racecheck:
compute-sanitizer --tool racecheck ./debug_gemm no-sync 128# Race conditions (missing __syncthreads)
compute-sanitizer --tool racecheck ./debug_gemm no-sync 128
# Check for uninitialized memory
compute-sanitizer --tool initcheck ./debug_gemm no-bounds 1000
# Interactive debugging
cuda-gdb --args ./debug_gemm no-bounds 10000Common cuda-gdb commands:
(gdb) break matmul_no_bounds # Set breakpoint
(gdb) run # Start execution
(gdb) cuda thread (x,y,z) # Switch to specific thread
(gdb) cuda kernel block thread # Show current location
(gdb) print threadIdx.x # Inspect variables
(gdb) print row * N + col # Evaluate expressions
(gdb) continue # Resume| Variant | Focus | Command snippet |
|---|---|---|
naive_gemm |
Timeline anatomy (copies ≫ kernel) | nsys profile --trace=cuda -o naive ./naive_gemm 2048 |
nvtx_gemm |
NVTX + --capture-range |
nsys profile --trace=cuda,nvtx --capture-range=nvtx -o nvtx ./nvtx_gemm 2048 |
pinned_gemm |
Overlap copies with kernel | Compare naive.nsys-rep vs pinned.nsys-rep |
tiled_gemm |
SM occupancy / Roofline | ncu --set full -o tiled ./tiled_gemm 2048 |
debug_gemm |
Out-of-bounds memory access | compute-sanitizer ./debug_gemm no-bounds 1000 |
debug_gemm |
Race conditions | compute-sanitizer --tool racecheck ./debug_gemm no-sync 128 |
debug_gemm |
Interactive debugging | cuda-gdb --args ./debug_gemm no-bounds 1000 |
make clean # remove binaries
rm -f *.nsys-rep *.ncu-rep