If you’re seeking smarter insights delivered directly to your inbox, consider subscribing to our weekly newsletters tailored for enterprise leaders in AI, data, and security.
The Emergence of Deep Research Features
The advent of Deep Research features and AI-driven analysis has led to a surge in models and services aimed at simplifying the process of interpreting the vast array of documents that businesses utilize. Canadian AI company Cohere is positioning itself at the forefront of this trend with its innovative models, including the recently launched visual model designed specifically for enterprise applications.
Introducing Command A Vision
Cohere has introduced Command A Vision, a visual model that enhances enterprise use cases, built upon the foundation of its Command A model. This model boasts an impressive 112 billion parameters, enabling it to derive valuable insights from visual data and make precise, data-driven decisions through document optical character recognition (OCR) and image analysis. According to the company, “Whether it’s interpreting product manuals with complex diagrams or analyzing photographs of real-world scenes for risk detection, Command A Vision excels at tackling the most demanding enterprise vision challenges.”
Upcoming AI Impact Series
On August 5, the AI Impact Series will return to San Francisco, offering a unique opportunity to engage with leaders from Block, GSK, and SAP. This event will explore how autonomous agents are transforming enterprise workflows, from real-time decision-making to comprehensive automation. Spaces are limited, so be sure to secure your spot: https://bit.ly/3GuuPLF.
Capabilities of Command A Vision
Command A Vision is adept at reading and analyzing a wide range of images that enterprises frequently encounter, including graphs, charts, diagrams, scanned documents, and PDFs. The model is designed for enterprise multimodal use cases, allowing it to interpret product manuals, analyze photographs, and query charts effectively.
Technical Specifications
Built on the architecture of Command A, Command A Vision requires two or fewer GPUs, similar to its text counterpart. It retains the text processing capabilities of Command A, enabling it to read text within images and comprehend at least 23 languages. Cohere emphasizes that, unlike other models, Command A Vision reduces the total cost of ownership for enterprises and is fully optimized for retrieval use cases.
Architectural Insights
Cohere developed Command A models, including the visual variant, using a Llava architecture that converts visual features into soft vision tokens. These tokens are segmented into various tiles, which are subsequently processed by the Command A text tower, a dense 111 billion parameter textual language model. In this way, a single image can utilize up to 3,328 tokens. Cohere trained the visual model through a three-stage process: vision-language alignment, supervised fine-tuning (SFT), and post-training reinforcement learning with human feedback (RLHF). This method allows for effective mapping of image encoder features to the language model’s embedding space.
Benchmark Performance
Cohere conducted benchmark tests comparing Command A Vision against several notable models, including OpenAI’s GPT 4.1, Meta’s Llama 4 Maverick, and Mistral’s Pixtral Large and Medium 3. In these tests, Command A Vision demonstrated superior performance in various assessments, including ChartQA, OCRBench, AI2D, and TextVQA. The model achieved an average score of 83.1%, outperforming GPT 4.1 (78.6%), Llama 4 Maverick (80.5%), and Mistral Medium 3 (78.3%).
The Importance of Multimodal Models
Most large language models (LLMs) today are multimodal, capable of generating or understanding visual media such as photos and videos. However, enterprises typically rely on more graphical documents, such as charts and PDFs. Extracting information from these unstructured data sources can be challenging. As Deep Research gains traction, the demand for models that can read, analyze, and even download unstructured data has intensified.
Cohere is also offering Command A Vision through an open weights system, hoping to attract enterprises that wish to transition away from closed or proprietary models. Initial feedback from developers has been promising, with many expressing admiration for the model’s accuracy in extracting handwritten notes from images.