51.511 Multimodal Generative AI

Course description

This course offers a comprehensive and practice-oriented introduction to the latest advances in multimodal generative AI, equipping students with the skills needed to critically assess, apply, and keep pace with state-of-the-art models beyond graduation. Emphasising language, vision, and audio modalities, the course covers foundational architectures such as transformers, diffusion models, with modality-specific encoders/decoders. Students will gain hands-on experience with leading-edge models and APIs for tasks like text generation, image synthesis, speech and music generation, and cross-modal alignment.

Alongside technical foundations, the course teaches students to critically evaluate generative models in real-world contexts, considering efficiency, scalability, accuracy, bias, resource requirements, and ethical implications. It also introduces the emerging paradigm of agentic AI, offering practical labs with tools such as LangChain and the OpenAI Assistants API, enabling students to prototype agents that plan, reason, and interact with external tools. Throughout the course, a strong emphasis is placed on rigorous evaluation methodologies, ensuring that students not only build with generative models, but can responsibly select and assess them for specific applications.

Instructor

Dorien Herremans, Zhang Wenxuan

Number of credits: 12