← Dictionary
AI Concepts·intermediate

Multimodal

An AI that can handle text plus images, audio, or video — not just typed input.

Early LLMs only did text in, text out. Modern models are multimodal — you can feed them images, PDFs, audio, sometimes video, and they'll reason about all of it alongside your text.

Practical examples:

  • Upload a screenshot of a bug and ask "what's wrong?"
  • Photograph a handwritten recipe and ask for the ingredient list as JSON.
  • Paste a chart and ask the model to describe the trend.

Mosaic students meet this early when they upload homework photos to claude.ai. "Multimodal" is just the technical name for that capability.

Related