CUDA Profiling Lightning-Tutorial

Welcome! This repo contains the 30-minute Nsight Systems (nsys) + Nsight Compute (ncu) tutorial you can run on any CUDA-capable Linux box (or WSL / Docker image that has the CUDA Toolkit ≥ 11.0 installed).

1. What's Inside

File / Target	Purpose
`gemm.cu` (alias `naive_gemm.cu`)	Baseline matrix-multiply (global memory only).
`nvtx_gemm.cu`	Same math + single NVTX range around the kernel.
`pinned_gemm.cu`	Adds pinned host memory (overlap).
`tiled_gemm.cu`	Shared-memory tiled kernel with tunable `#define BLOCK 16/32`.
`gemm_debug_friendly.cu`	Tutorial version with 3 kernels: correct, no-bounds, no-sync for debugging practice.
`Makefile`	`make all` or `make naive/nvtx/pinned/tiled/debug` builds individual demos.
`run.sh`	One-click script that: 1. profiles each variant with `nsys` & `ncu` 2. drops reports next to the binaries.
`*.ncu-rep`	Pre-generated Nsight Compute reports you can open in the GUI if you have no GPU handy.
`*.nsys-rep`	Pre-generated Nsight Systems timelines (same idea).

Tip: If you're on a headless cluster, copy the *.rep files to your laptop and open them there.

2. Prerequisites ⚙️

Need	Version	Check
CUDA Toolkit	≥ 11.0	`nvcc --version`
Nsight Systems CLI & GUI	2022.2 or later	`nsys --version`
Nsight Compute CLI & GUI	2022.2 or later	`ncu --version`
GPU driver	Matching Toolkit	`nvidia-smi`
cuda-gdb	Included with CUDA	`cuda-gdb --version`
compute-sanitizer	Included with CUDA	`compute-sanitizer --version`

Inside Docker? Use NVIDIA's nvcr.io/nvidia/cuda:<version>-devel container.

3. Build 🛠️

git clone <this-repo>.git
cd cuda_profiling_tutorial
make            # builds all five binaries
# or individually:
make naive      # -> ./naive_gemm
make nvtx       # -> ./nvtx_gemm
make pinned     # -> ./pinned_gemm
make tiled      # -> ./tiled_gemm
make debug      # -> ./debug_gemm

4. Quick Start 🚀

4.1 Run the baseline kernel

./naive_gemm 2048          # C = A×B (row-major), N=2048

4.2 Profile with Nsight Systems

nsys profile --trace=cuda -o gemm_full ./naive_gemm 2048
# Produces gemm_full.nsys-rep (timeline incl. copies + kernel)

Want to capture only the kernel (NVTX range)?

nsys profile --trace=cuda,nvtx --capture-range=nvtx \
             -o nvtx_only ./nvtx_gemm 2048

4.3 Deep dive with Nsight Compute

ncu --set full -o tiled_report ./tiled_gemm 2048
# Creates tiled_report.ncu-rep  (open in GUI)

Text-dump the raw page:

ncu --import tiled_report.ncu-rep --page raw --csv \
    > tiled_raw.csv

5. Debugging Tutorial 🐛

The debug_gemm binary contains three kernel versions for learning CUDA debugging:

Kernel Type	Bug	Command
`correct`	None - reference implementation	`./debug_gemm correct 1024`
`no-bounds`	Missing bounds checks → out-of-bounds reads	`./debug_gemm no-bounds 1000`
`no-sync`	Missing `__syncthreads()` → race conditions	`./debug_gemm no-sync 2048`

5.1 Testing the Correct Version

./debug_gemm correct 1024
# Output: C[0] = 1024 (expected: 1024) ✓

5.1 Bug #1: Missing Bounds Checks

Debug with cuda-gdb:

cuda-gdb --args ./debug_gemm no-bounds 10000

(gdb) break matmul_no_bounds
(gdb) run
(gdb) cuda thread (31,31,0) block (31,31,0)
(gdb) print row
# Shows: row = 1023 (but N = 1000, so out of bounds!)
(gdb) print N
# Shows: N = 1000
(gdb) print row * N + threadIdx.x
# Shows index > 1000000 (out of bounds!)

5.2 Bug #2: Missing Synchronization

The Problem: Race condition - threads read from shared memory before other threads finish writing.

./debug_gemm no-sync 2048
# C[0] = unpredictable (changes each run)

Debug with compute-sanitizer racecheck:

compute-sanitizer --tool racecheck ./debug_gemm no-sync 128

5.3 Quick Debug Reference

# Race conditions (missing __syncthreads)
compute-sanitizer --tool racecheck ./debug_gemm no-sync 128

# Check for uninitialized memory
compute-sanitizer --tool initcheck ./debug_gemm no-bounds 1000

# Interactive debugging
cuda-gdb --args ./debug_gemm no-bounds 10000

Common cuda-gdb commands:

(gdb) break matmul_no_bounds     # Set breakpoint
(gdb) run                        # Start execution
(gdb) cuda thread (x,y,z)        # Switch to specific thread
(gdb) cuda kernel block thread   # Show current location
(gdb) print threadIdx.x          # Inspect variables
(gdb) print row * N + col        # Evaluate expressions
(gdb) continue                   # Resume

6. What to Demo (cheat-sheet)

Variant	Focus	Command snippet
`naive_gemm`	Timeline anatomy (copies ≫ kernel)	`nsys profile --trace=cuda -o naive ./naive_gemm 2048`
`nvtx_gemm`	NVTX + `--capture-range`	`nsys profile --trace=cuda,nvtx --capture-range=nvtx -o nvtx ./nvtx_gemm 2048`
`pinned_gemm`	Overlap copies with kernel	Compare `naive.nsys-rep` vs `pinned.nsys-rep`
`tiled_gemm`	SM occupancy / Roofline	`ncu --set full -o tiled ./tiled_gemm 2048`
`debug_gemm`	Out-of-bounds memory access	`compute-sanitizer ./debug_gemm no-bounds 1000`
`debug_gemm`	Race conditions	`compute-sanitizer --tool racecheck ./debug_gemm no-sync 128`
`debug_gemm`	Interactive debugging	`cuda-gdb --args ./debug_gemm no-bounds 1000`

7. Cleanup 🧹

make clean          # remove binaries
rm -f *.nsys-rep *.ncu-rep

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CUDA Profiling Lightning-Tutorial

1. What's Inside

2. Prerequisites ⚙️

3. Build 🛠️

4. Quick Start 🚀

4.1 Run the baseline kernel

4.2 Profile with Nsight Systems

4.3 Deep dive with Nsight Compute

5. Debugging Tutorial 🐛

5.1 Testing the Correct Version

5.1 Bug #1: Missing Bounds Checks

5.2 Bug #2: Missing Synchronization

5.3 Quick Debug Reference

6. What to Demo (cheat-sheet)

7. Cleanup 🧹

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Makefile		Makefile
README.md		README.md
gemm.cu		gemm.cu
gemm_debug_friendly.cu		gemm_debug_friendly.cu
gemm_full.nsys-rep		gemm_full.nsys-rep
gemm_full3.nsys-rep		gemm_full3.nsys-rep
gemm_pinned.nsys-rep		gemm_pinned.nsys-rep
gemm_pinned3.nsys-rep		gemm_pinned3.nsys-rep
naive_gemm		naive_gemm
nvtx_gemm		nvtx_gemm
nvtx_gemm.cu		nvtx_gemm.cu
nvtx_gemm.nsys-rep		nvtx_gemm.nsys-rep
pinned_gemm		pinned_gemm
pinned_gemm.cu		pinned_gemm.cu
pinned_gemm_report.ncu-rep		pinned_gemm_report.ncu-rep
run.sh		run.sh
tiled_gemm		tiled_gemm
tiled_gemm.cu		tiled_gemm.cu
tiled_gemm_report.ncu-rep		tiled_gemm_report.ncu-rep
tiled_gemm_report2.ncu-rep		tiled_gemm_report2.ncu-rep

csl-iisc/gpu-profiling-tutorial

Folders and files

Latest commit

History

Repository files navigation

CUDA Profiling Lightning-Tutorial

1. What's Inside

2. Prerequisites ⚙️

3. Build 🛠️

4. Quick Start 🚀

4.1 Run the baseline kernel

4.2 Profile with Nsight Systems

4.3 Deep dive with Nsight Compute

5. Debugging Tutorial 🐛

5.1 Testing the Correct Version

5.1 Bug #1: Missing Bounds Checks

5.2 Bug #2: Missing Synchronization

5.3 Quick Debug Reference

6. What to Demo (cheat-sheet)

7. Cleanup 🧹

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages