What multimodality is
Pre-2024 AI was mainly text. Multimodal means the model natively processes multiple input/output formats: text, image, audio, video. Not by stacking (OCR → text → LLM) but in unified architecture.
Vision: more than OCR
Modern vision in AI: document understanding (read forms, charts, diagrams), visual reasoning (answer questions about complex images), visual debugging (analyze code screenshots), visual accessibility (describe images for blind users).
Voice: real conversation
GPT-4o changed expectations with 320ms TTFB: voice conversation indistinguishable from human in basic interactions. The model captures tone, emotion, ambient sound. Doesn't just transcribe — understands.
Video: the next frontier
Video as input: summarize long meetings, scene analysis, sports moment retrieval. Gemini 2.0 leads on video input. Video as output (Sora, Veo) is the most expensive but most creative frontier.
Cross-modal: the real power
The interesting thing isn't processing each modality separately — it's linking them: "describe this image AND tell me how it relates to this text" or "summarize this audio in chart form".
Practical use cases
Customer support: customer sends product photo + voice complaint. AI processes both. Education: students upload notes (images) + ask questions in voice. Medical: radiologist with AI that processes imaging + clinical text. Retail: visual search + natural language.
Conclusion
Multimodality isn't a "feature" — it's the new baseline. Modern applications expect users to interact naturally (text, voice, image, video). Building only text in 2026 is leaving 50% of UX on the table.