many-core.group

Cambridge CUDA Course 25-27 May 2009

[CUDA badge]

Around 40 people attended this 6 lecture course on CUDA. Links to the presentations and the codes used can be found below.

GPUs are cheap, massively parallel, programmable compute devices that can be used for many general purpose (non-graphics) tasks. They are a "good fit" for many scientific applications and significant speedups (as compared to contemporary CPUs) have been reported. The CUDA language makes NVIDIA GPUs accessible to developers through a series of extensions to C (with no mention of pixels or shading!). In this series of lectures, we aim to show: how to build and configure a CUDA computer; how to write some simple (and some less-simple) CUDA kernels; and how to go about optimising your CUDA code to achieve better performance.

Day 1 - Monday 25th May

Getting started - Graham Pullan

[Lecture 1]

This lecture introduces the hardware and software needed to run CUDA. What PCs and GPUs are suitable? How do we run our first CUDA "Hello World" program?

Downloads:


Threads - Graham Pullan

[Lecture 2]

In this lecture, a simple one equation PDE (heat conduction equation) is used as an example problem. We move from a naive CUDA implementation to a more optimal one and examine the reasons for the changes in performance.

Downloads:


Day 2 - Tuesday 26th May

Developing kernels - Part 1 - Steven Gratton

[Lecture 3]

This lecture introduces the need for multi-kernel CUDA programs. The example application is Cholesky matrix factorisation.

Downloads:


Developing kernels - Part 2 - Steven Gratton

[Lecture 4]

To make further improvements to CUDA programs, we must take still more care to tune our code to the underlying hardware. In this lecture, the Cholesky factorisation code is improved by making changes to the underlying algorithm. We also look at the .ptx and .cubin codes.

Downloads:


Day 3 - Wednesday 27th May

CUDA with multiple GPUs - Tobias Brandvik

[Lecture 5]

In this lecture, we show how to run CUDA codes over multiple GPUs on multiple hosts. The basics of MPI and their application to grid calculation methods on the CPU and GPU are presented.

Downloads:


Application example - medical imaging registration - Richard Ansorge

[Lecture 6]

In the final lecture, a target application is discussed in more detail. In addition, the use of 2D and 3D texture memory is described and compared to the more usual global and shared memory types.

Downloads: