ECE408
ECE408 (Applied Parallel Programming) is a 4-credit-hour senior level course that satisfies the Advanced Computing elective requirement for Computer Engineering majors.
ECE408 is cross-listed as CS483.
Content Covered:
- CUDA C Programming
- NVIDIA Nsight Compute / Nsys profile tools
- GPU vs CPU Architecture
- State of the industry
- Memory Bandwith Optimizations
- Reductions and Convolution
- Matrix Multiplication Techniques
- Sparse Matrix Formats
- Neural Networks
- Atomics and Tiling
The primary framework used in this class is CUDA, a C-like language maintained by NVIDIA that supports programming on GPUs and GPU clusters.
This course has also included a few guest lectures in past semesters, from people who use parallel programming in their work or research. Guest lectures include those from NVIDIA, AMD, Intel and others.
Prerequisites
ECE 220 is listed as a prerequisite, as most of this course is taught in C and requires a good understanding of pointers and memory management. For students not confident in their C skills, it may be wise to wait until taking ECE 391 or CS 225 in order to improve their programming skills first.
When To Take It
The course is typically offered during both the Fall and Spring semesters. Most students take this technical elective during their Junior or Senior year.
The class does not require a very high workload, so don't worry much about taking this with other hard ECE classes. Taking this class before 411 may be a good introduction to the differences between GPUs and CPUs.
In general, ECE 408 is a valuable class to take as someone wanting to work with low-level software or computer architecture. Parallel Programming and the differences between GPUs and CPUs are important topics to understand for any Computer Engineering student. For those interested in Nvidia internships, knowing CUDA will definitely give you an edge, but the concepts taught in the class extend to all companies doing anything with parallel programming.
Course Structure
Lecture is traditionally twice a week for 75 minutes each.
The assignments consist of 6-7 MPs that are all written in CUDA and executed on a GPU cluster (WebGPU). Each MP has a set of questions for each problem, worth 10% of the MP. The quiz questions are usually on canvas and either multiple choice or short answer.
The exams have multiple choice, short answer, and code problems where you fill in the missing parameters. Do not underestimate the exams. The content of the exams covers algorithms used in MPs as well as unused options brought up in lecture. Another significant topic is the effect of certain parameters and algorithms on the performance of the program.
This course also includes a final project where one designs and optimizes the forward propagation stage of a neural network. The final project is primarily graded on the checkpoint reports and optimizations used. There is a final competition at the end, with opportunities for extra credit for those with the fastest kernels. The project is split up into three checkpoints, or milestones. They are split up as follows:
- Milestone 1: Design the basic parallelism for a convolution neural network
- Milestone 2: Optimize the convolution by converting it into a matrix multiplication problem using matrix unrolling (a method where a convolution filter and an input are converted into two matrices that can be multiplied together to produce the identical output)
- Milestone 3: Add optimizations (both required ones and optional ones of one's choice) to speed up the GPU kernel.
Students can drop one lab and also use two late extensions combined across project checkpoints and labs. No late credit is given for labs or project checkpoints. Sometimes an extension is given by default if the queue for lab tests is too high and students are not able to test their code in time for traffic reasons that are not their fault.
Grade breakdown (for Spring 2023):
- Exam 1: 20%
- Exam 2: 20%
- Lab Assignments: 35% (5% each)
- Correctness (number of data sets passed) 90%
-
Quiz Questions 10% (Lab with the lowest score is dropped.)
-
Final Project: 25%
This grade breakdown has changed in recent semesters.
Grade breakdown (for Fall 2024):
- Exam 1: 25%
- Exam 2: 25%
- Lab Assignments: 25%
- Correctness (number of data sets passed) 90%
- Quiz Questions 10% (Lab with the lowest score is dropped.)
- Final Project: 25%
Instructors
Applied Parallel Programming has been taught by Professor Wen-Mei Hwu in the past, who designed the class material. It is currently being taught by Professor Sanjay Patel and Professor Volodymyr Kindratenko in Spring 2023. Kindratenko has taught the most recent semesters in Spring 2024 and Fall 2024. There are usually a few TAs who grade the MPs and Exams as well as hold office hours.
Course Tips
For 4 credit hours, this class is not terribly work intensive during the semester. The lecture material prepares students very well for the MPs, with important portions of the project are typically explained and analyzed thoroughly in lecture before the due date. Having a strong understanding of the problem and the parallel solution helps incredibly in test and MP performance. Do start early on the project milestones, as debugging is difficult and the traffic can get congested towards the milestone deadlines, which can be detrimental to profiling analysis and the project milestone report.
Do also pay attention in lectures (and actually attend), because the concepts on the exams can be difficult. Start early on studying during exam weeks and attend office hours frequently.
The course is doable with a decently-loaded schedule: taking it with courses like ECE391 and ECE411 might create a tough schedule, but the course is more managable with courses like ECE385 or ECE374.
Life After:
If you really enjoyed this course, check out ECE 492 (CS 420) - Parallel Progrmg: Sci & Engrg or ECE 508 - Manycore Parallel Algorithms.
If you are interested in working on the CUDA language itself or contributing to open parallelism platforms such as OpenMP, consider applying for internships at companies such as NVIDIA, AMD, and Intel.
If you are open to applying CUDA or parallel computing in industry, there are a wide variety of opportunities available.
In addition, expertise in parallel programming is highly valued in the academic community, particularly in research areas that deal with large amounts of data such as computational genomics and biomedical imaging. There are many opportunities for undergraduate and graduate students to get involved with research on campus if they are competent in parallel software design.
Infamous Topics
- The Midterms: Do not underestimate the difficulty of the midterms.
- The Textbook: As of Spring 2023, there were various errors in the textbook provided, be careful trusting the textbook too much.
- Profiling: Writing code for the project is only half the battle: When writing optimizations, students are required to write a report about the optimizations they did for the project and show how their optimizations are correct. A few tools are introduced in the guest lecture to get students started: the main ones are Nsight Systems and Nsight Compute. Students are enabled with the tools to view their kernel results, perform profile analysis and analyze the improvements of their optimizations for the report. Make sure you profile your code and analyze your results carefully.
Resources
- [CUDA C++ Programming Guide] (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html)