Dual-GPU Powerhouse: Cohere's New Vision Model Outperforms Leading AI Systems in Visual Recognition Tests

If you’re seeking smarter insights delivered directly to your inbox, consider subscribing to our weekly newsletters tailored for enterprise leaders in AI, data, and security.

The Emergence of Deep Research Features

The advent of Deep Research features and AI-driven analysis has led to a surge in models and services aimed at simplifying the process of interpreting the vast array of documents that businesses utilize. Canadian AI company Cohere is positioning itself at the forefront of this trend with its innovative models, including the recently launched visual model designed specifically for enterprise applications.

Introducing Command A Vision

Cohere has introduced Command A Vision, a visual model that enhances enterprise use cases, built upon the foundation of its Command A model. This model boasts an impressive 112 billion parameters, enabling it to derive valuable insights from visual data and make precise, data-driven decisions through document optical character recognition (OCR) and image analysis. According to the company, “Whether it’s interpreting product manuals with complex diagrams or analyzing photographs of real-world scenes for risk detection, Command A Vision excels at tackling the most demanding enterprise vision challenges.”

Upcoming AI Impact Series

On August 5, the AI Impact Series will return to San Francisco, offering a unique opportunity to engage with leaders from Block, GSK, and SAP. This event will explore how autonomous agents are transforming enterprise workflows, from real-time decision-making to comprehensive automation. Spaces are limited, so be sure to secure your spot: https://bit.ly/3GuuPLF.

Capabilities of Command A Vision

Command A Vision is adept at reading and analyzing a wide range of images that enterprises frequently encounter, including graphs, charts, diagrams, scanned documents, and PDFs. The model is designed for enterprise multimodal use cases, allowing it to interpret product manuals, analyze photographs, and query charts effectively.

Technical Specifications

Built on the architecture of Command A, Command A Vision requires two or fewer GPUs, similar to its text counterpart. It retains the text processing capabilities of Command A, enabling it to read text within images and comprehend at least 23 languages. Cohere emphasizes that, unlike other models, Command A Vision reduces the total cost of ownership for enterprises and is fully optimized for retrieval use cases.

Architectural Insights

Cohere developed Command A models, including the visual variant, using a Llava architecture that converts visual features into soft vision tokens. These tokens are segmented into various tiles, which are subsequently processed by the Command A text tower, a dense 111 billion parameter textual language model. In this way, a single image can utilize up to 3,328 tokens. Cohere trained the visual model through a three-stage process: vision-language alignment, supervised fine-tuning (SFT), and post-training reinforcement learning with human feedback (RLHF). This method allows for effective mapping of image encoder features to the language model’s embedding space.

Benchmark Performance

Cohere conducted benchmark tests comparing Command A Vision against several notable models, including OpenAI’s GPT 4.1, Meta’s Llama 4 Maverick, and Mistral’s Pixtral Large and Medium 3. In these tests, Command A Vision demonstrated superior performance in various assessments, including ChartQA, OCRBench, AI2D, and TextVQA. The model achieved an average score of 83.1%, outperforming GPT 4.1 (78.6%), Llama 4 Maverick (80.5%), and Mistral Medium 3 (78.3%).

The Importance of Multimodal Models

Most large language models (LLMs) today are multimodal, capable of generating or understanding visual media such as photos and videos. However, enterprises typically rely on more graphical documents, such as charts and PDFs. Extracting information from these unstructured data sources can be challenging. As Deep Research gains traction, the demand for models that can read, analyze, and even download unstructured data has intensified.

Cohere is also offering Command A Vision through an open weights system, hoping to attract enterprises that wish to transition away from closed or proprietary models. Initial feedback from developers has been promising, with many expressing admiration for the model’s accuracy in extracting handwritten notes from images.

Dual-GPU Powerhouse: Cohere’s New Vision Model Outperforms Leading AI Systems in Visual Recognition Tests

The Emergence of Deep Research Features

Introducing Command A Vision

Upcoming AI Impact Series

Capabilities of Command A Vision

Technical Specifications

Architectural Insights

Benchmark Performance

The Importance of Multimodal Models

Navigation

Top Infos

Testo Showcases New Thermal Camera and Digital Measurement Ecosystem at Efintec 2025 – Energética 21 Insights

Dehn to Unveil Next-Gen Surge Protection Solution at Efintec 2025: Advanced Safety Features for Industrial Power Systems

Telematel Showcases Digital Solutions for Installers at Efintec 2025: Smart Tech Transforming the Energy Sector

Latest Updates on Genera/Matelec 2025: Key Insights and Developments From Energética 21 for Industry Professionals and Stakeholders

Efintec 2025: Leading Energy Innovation Summit Returns With Advanced Solutions for Sustainable Power Technology

Favorites

It’s time to look up and admire the spectacle: how and where to see the most beautiful Perseid starfall in 2024?

The final blow for electric cars? Italy reduces its investments in electricity: a blow of 4.6 billion for the industry

Solos Introduces Airgo A5 and Airgo V2 Smart Glasses with Hands-free Ai Technology