I will give you personal experience learning CUDA that might be helpful.
Disclaime: I don't claim that this is actually a systematic way to learn it and it is more for academic work.
I got assigned to a project that needs learning CUDA as part of my PhD. There was no one in my research group who have any experience or know CUDA. I started with standard NVIDIA courses (Getting Started with Accelerated Computing with CUDA C/C++ and there is python version too).
This gave me good introduction to the concepts and basic ideas but I think after that I did most of learning by trial and error. I tried a couple of online tutorials for specific things and some books but it was always a deprecated function there or here or a change of API that make things obsolete. Or basically things changed for your GPU and now you have to be careful because yoy might be using GPU version not compatible with what I develop for in production and you need things to work for both.
I think learning CUDA for me is an endeavor of pain and going through "compute-sanitizer" and Nsight because you will find that most of your time will go into debugging why things is running slower than you think.
Take things slowly. Take a simple project that you know how to do without CUDA then port it to CUDA ane benchmark against CPU and try to optimize different aspect of it.
The one advice that can be helpful is not to think about optimization at the beginning. Start with correct, then optimize. A working slow kernel beats a fast kernel that corrupts memory.
I can share a similar PhD story (the result being visible here: https://github.com/NX-AI/flashrnn). Back then I didn't find any tutorials that cover anything beyond the basics (which are still important). Once you have understood the principle working mode and architecture of a GPU, I would recommend the following workflow: 1. First create an environment so that you can actually test your kernels against baselines written in a higher-level language. 2. If you don't have an urgent project already, try to improve/re-implement existing problems (MatMul being the first example). Don't get caught by wanting to implement all size cases. Take an example just to learn a certain functionality, rather than solving the whole problem if it's just about learning. 3. Write the functionality you want to have in increasing complexity. Write loops first, then parallelize these loops over the grid. Use global memory first, then put things into shared memory and registers. Use plain matrix multiplication first, then use mma (TensorCore) primitives to speed things up. 4. Iterate over the CUDA C Programming Guide. It covers all (most) of the functionality that you want to learn - but can't be just read an memorized. When you apply it you learn it. 5. Might depend on you use-case but also consider using higher-level abstractions like CUTLASS or ThunderKitten. Also, if your environment is jax/torch, use triton first before going to CUDA level.
Overall, it will be some pain for sure. And to master it including PTX etc. will take a lot of time.
> I think learning CUDA for me is an endeavor of pain and going through "compute-sanitizer" and Nsight because you will find that most of your time will go into debugging why things is running slower than you think.
This is so true it hurts.