← Blog

Real multimodality: who processes audio and video natively in 2026

11 jun 2026

"Multimodal" has become a marketing term. Almost every model released in 2026 describes itself as multimodal. But there is a fundamental technical difference between a model that accepts images via a module added after training and one that was trained with all modalities from the start. In production, this difference shows up in reasoning quality, response coherence, and especially the ability to process audio and video — not just static images.

What "native" means technically

Multimodal models can be built in two ways.

The first is early fusion: text, image, audio, and video data are converted into tokens from the beginning of the training pipeline. The model learns joint representations of all modalities simultaneously. When you ask the model about the content of a video, it reasons about video the same way it reasons about text — it is not a separate operation.

The second is late fusion: a text model receives inputs from other encoders such as "vision" or "audio" as additional modules. It is easier to build, but the model does not learn deep relationships between modalities during pre-training. The result tends to be more superficial multimodal reasoning and lower robustness when the modalities need to be integrated.

Most multimodal models from 2024 used late fusion. In 2026, the leaders migrated to early fusion — but adoption is not universal.

The capabilities map in May 2026

Gemini 3.1 Pro (Google): The current leader in complete native multimodality. It processes text, image, audio, video, and PDF in a unified context window of 1 million tokens. It is the only frontier model that accepts all five input types natively in the same API call. Output context of 64K tokens. Cost: US$ 2.00/US$ 12.00 per million tokens. GPQA Diamond: 94.3%.

GPT-5.5 (OpenAI): Text, images, and audio. Video support is limited — the model does not process video directly as a stream; frame extraction is still necessary for most cases. Enhanced computer use is the highlight of the April update.

Claude Opus 4.7 (Anthropic): Text and images only. Native audio and video are not supported. Anthropic's focus has been reasoning, code, and agency — not multimodal expansion. For use cases that require audio or video processing, Claude is not the option in May 2026.

Llama 4 Scout/Maverick (Meta): Text and image via early fusion — the first open model with native multimodal MoE. Video and audio are not supported. For an open model, the level of text-image integration is superior to what was available before.

Gemma 4 E2B/E4B (Google): The only open source models with native video and audio, via early fusion, designed to run on edge devices. The limitation is size: they are models of 2 and 4 billion effective parameters, with more limited general capability than the larger models.

Grok 4.3 (xAI): Text, image, and native video — direct processing of video files without frame extraction. Audio is not listed as a supported modality. This is one of the differentiators of 4.3 relative to 4.20.

Why native audio changes use cases

Most applications that process audio today use a separate pipeline: transcription via Whisper or equivalent, followed by text processing by the LLM. It works, but it has limitations.

Transcription loses paralinguistic information — intonation, pauses, emotional emphasis. A model with native audio can infer mood, certainty, or hesitation directly from how something was said, not just from what was said. For analysis of customer support calls, interviews, and meetings, this dimension is relevant.

The transcription+text pipeline also adds latency and cost. For real-time applications — voice assistants, live transcription with simultaneous analysis — the latency of the dual pipeline can be prohibitive. Models with native audio eliminate a step.

Why native video changes use cases

Video is composed of frames (images) plus audio, plus the temporal dimension between frames. Extracting frames and processing them as individual images loses the temporal context — what changed between frame 1 and frame 100, the movement, the sequence of events.

Native video processing captures the temporal dimension. For security analysis (CCTV cameras), industrial monitoring, sports analysis, or recorded medical procedures, the sequence matters as much as the individual frame.

Gemini 3.1 Pro in the 1-million-token context can process long videos, not just short clips. For reviewing video documentation, analyzing training, or process audits, this opens up use cases that previously required full human analysis.

The 2026 frontier: Real-time video

What does not yet exist reliably in any commercially available model in May 2026 is live video processing — a real-time camera stream with simultaneous reasoning.

There are technical demonstrations and previews in some labs. But for stable production, the current frontier is recorded video, not a live stream. For industrial IoT with real-time cameras, there is still dependence on hybrid pipelines with specialized computer vision processing before the LLM.

This is the next barrier — and the race to overcome it is already underway.

Recibe las publicaciones

Nuevos artículos sobre IA, Vibe Code y Builder Code — por correo o Telegram.

o
Recibir en Telegram

Al suscribirte, aceptas recibir correos/mensajes y la Política de Privacidad. Puedes cancelar cuando quieras. Sin spam.