ECE408

ECE408 (Applied Parallel Programming) is a 4-credit-hour senior level course that satisfies the Advanced Computing elective requirement for Computer Engineering majors.

ECE408 is cross-listed as CS483.

Content Covered:

CUDA C Programming
NVIDIA Nsight Compute / Nsys profile tools
GPU vs CPU Architecture
State of the industry
Memory Bandwith Optimizations
Reductions and Convolution
Matrix Multiplication Techniques
Sparse Matrix Formats
Neural Networks
Atomics and Tiling

The primary framework used in this class is CUDA, a C-like language maintained by NVIDIA that supports programming on GPUs and GPU clusters.

This course has also included a few guest lectures in past semesters, from people who use parallel programming in their work or research. Guest lectures include those from NVIDIA, AMD, Intel and others.

Prerequisites

ECE220

ECE 220 is listed as a prerequisite, as most of this course is taught in C and requires a good understanding of pointers and memory management. For students not confident in their C skills, it may be wise to wait until taking ECE 391 or CS 225 in order to improve their programming skills first.

When To Take It

The course is typically offered during both the Fall and Spring semesters. Most students take this technical elective during their Junior or Senior year.

The class does not require a very high workload, so don't worry much about taking this with other hard ECE classes. Taking this class before 411 may be a good introduction to the differences between GPUs and CPUs.

In general, ECE 408 is a valuable class to take as someone wanting to work with low-level software or computer architecture. Parallel Programming and the differences between GPUs and CPUs are important topics to understand for any Computer Engineering student. For those interested in Nvidia internships, knowing CUDA will definitely give you an edge, but the concepts taught in the class extend to all companies doing anything with parallel programming.

Course Structure

Lecture is traditionally twice a week for 75 minutes each.

The assignments consist of 6-7 MPs that are all written in CUDA and executed on a GPU cluster (WebGPU or NCSA Delta Supercomputer). Each MP has a set of questions for each problem, worth 10% of the MP. The quiz questions are usually on canvas and either multiple choice or short answer. NCSA Delta is a public resource and as such queue times can be considerable especially during profiling for the Final Project.

The exams have multiple choice, short answer, fill in the blank and full page coding questions. Do not underestimate the exams. The content of the exams covers algorithms used in MPs as well as unused options brought up in lecture. Another significant topic is the effect of certain parameters and algorithms on the performance of the program.

This course also includes a final project where one designs and optimizes the forward propagation stage of a neural network. The final project is primarily graded on the checkpoint reports and optimizations used. There is a final competition at the end, with opportunities for extra credit for those with the fastest kernels. The project is split up into three checkpoints, or milestones. They are split up as follows:

Milestone 1: Design the CPU, naive GPU, and Matrix Unrolling (a method where a convolution filter and an input are converted into two matrices that can be multiplied together to produce the identical output) GPU Kernel.
Milestone 2: Implement a Kernel Fusion solution merging the three kernels that make up the Matrix Unrolling Milestone 1. Profile the CPU, GPU and Kernel Fusion implementations comparing them with NSight Systems and Compute.
Milestone 3: Add optimizations (both required ones and optional ones of one's choice) to speed up the GPU kernel. Write a report detailing all the optimiziations you've added.

Starting in Spring 2025 an alternative GPT-2 Transformer Project has been offered, replacing the Milestone 2 and Milestone 3 of the original project. The GPT-2 project is done in teams of 3-4 with the goal of building a fully-functional version of the model, diving deep into the transformer architecture. The project deliverables include the encoder kernel, layer normalization, GELU, as well as other required kernels.

Students can drop one lab and have an automatic 3-day extenstion on each lab, extenstions past the 3-days are only given under extenuating circumstances. No late credit is given for labs or project checkpoints. Sometimes an extension is given by default if the queue for lab tests is too high and students are not able to test their code in time for traffic reasons that are not their fault.

Grade breakdown (for Fall 2024):

Exam 1: 25%
Exam 2: 25%
Lab Assignments: 25%
Correctness (number of data sets passed) 90%
Quiz Questions 10% (Lab with the lowest score is dropped.)
Final Project: 25%

Instructors

Applied Parallel Programming has been taught by Professor Wen-Mei Hwu in the past, who designed the class material. It is currently being taught by Professor Sanjay Patel and Professor Volodymyr Kindratenko in Spring 2023. Kindratenko has taught the most recent semesters in Fall 2024 and Spring 2025. There are usually a few TAs who grade the MPs and Exams as well as hold office hours.

Course Tips

For 4 credit hours, this class is not terribly work intensive during the semester. The lecture material prepares students very well for the MPs, with important portions of the project are typically explained and analyzed thoroughly in lecture before the due date. Having a strong understanding of the problem and the parallel solution helps incredibly in test and MP performance. Do start early on the project milestones, as debugging is difficult and the traffic can get congested towards the milestone deadlines, which can be detrimental to profiling analysis and the project milestone report.

Do also pay attention in lectures (and actually attend), because the concepts on the exams can be difficult. Start early on studying during exam weeks and attend office hours frequently.

The course is doable with a decently-loaded schedule: taking it with courses like ECE391 and ECE411 might create a tough schedule, but the course is more managable with courses like ECE385 or ECE374.

Life After:

If you really enjoyed this course, check out ECE 492 (CS 420) - Parallel Progrmg: Sci & Engrg or ECE 508 - Manycore Parallel Algorithms.

If you are interested in working on the CUDA language itself or contributing to open parallelism platforms such as OpenMP, consider applying for internships at companies such as NVIDIA, AMD, and Intel.

If you are open to applying CUDA or parallel computing in industry, there are a wide variety of opportunities available.

In addition, expertise in parallel programming is highly valued in the academic community, particularly in research areas that deal with large amounts of data such as computational genomics and biomedical imaging. There are many opportunities for undergraduate and graduate students to get involved with research on campus if they are competent in parallel software design.

Infamous Topics

The Midterms: Do not underestimate the difficulty of the midterms.
The Textbook: As of Spring 2023, there were various errors in the textbook provided, be careful trusting the textbook too much.
Profiling: Writing code for the project is only half the battle: When writing optimizations, students are required to write a report about the optimizations they did for the project and show how their optimizations are correct. A few tools are introduced in the guest lecture to get students started: the main ones are Nsight Systems and Nsight Compute. Students are enabled with the tools to view their kernel results, perform profile analysis and analyze the improvements of their optimizations for the report. Make sure you profile your code and analyze your results carefully.

Resources

[CUDA C++ Programming Guide] (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html)