Go to content

CIRCLE - Marco Garosi

Marco Garosi
AI in action
Skip menu
Large Multimodal Models as General In-Context Classifiers
CVPR Findings 2026

Large Multimodal Models, Open World, Image classification
2026
A representation of CIRCLE (CIRCLE Iteratively Refines Contextual Learning Examples): starting from unannotated images, it first assigns pseudo-label to each one independently. Next, it iteratively refines their labels by taking into account all the other images. As a result, CIRCLE produces a context that can be used to classify new inputs via In-Context Learning (ICL).
ABSTRACT

Which multimodal model should we use for classification? Previous studies suggest that the answer lies in CLIP-like contrastive Vision-Language Models (VLMs), due to their remarkable performance in zero-shot classification. In contrast, Large Multimodal Models (LMMs) are more suitable for complex tasks. In this work, we argue that this answer overlooks an important capability of LMMs: in-context learning. We benchmark state-of-the-art LMMs on diverse datasets for closed-world classification and find that, although their zero-shot performance is lower than CLIP's, LMMs with a few in-context examples can match or even surpass contrastive VLMs with cache-based adapters, their "in-context" equivalent. We extend this analysis to the open-world setting, where the generative nature of LMMs makes them more suitable for the task. In this challenging scenario, LMMs struggle whenever provided with imperfect context information. To address this issue, we propose CIRCLE, a simple training-free method that assigns pseudo-labels to in-context examples, iteratively refining them with the available context itself. Through extensive experiments, we show that CIRCLE establishes a robust baseline for open-world classification, surpassing VLM counterparts and highlighting the potential of LMMs to serve as unified classifiers, and a flexible alternative to specialized models.

• The first systematic analysis of ICL in LMMs for closed-world image classification.
• In-depth comparison of LMM behavior and caching-based VLMs, showing that LMMs with ICL can match and even surpass VLMs.
• Introduction of CIRCLE, a new approach that enhances LMMs for open-world classification using only unlabeled images as ICL examples, iteratively refining their pseudo-labels.
• Extensive benchmarking of CIRCLE against naïve ICL, showing that the latter struggles in open-world settings.
• Performance improvements: CIRCLE largely improves the performance of the base model, consistently surpassing VLMs, making a valid case for adopting LMMs for discriminative tasks.

We provide additional details and complete per-dataset and per-model results in the supplementary material. You can find it at the end of the arXiv paper.

• Marco Garosi (DISI, University of Trento)
• Matteo Farina (DISI, University of Trento)
• Alessandro Conti (DISI, University of Trento)
• Massimiliano Mancini (DISI, University of Trento)
• Elisa Ricci (DISI, University of Trento, and Fondazione Bruno Kessler)

You can cite this work as:


@inproceedings{garosi2026circle,
title = {Large Multimodal Models as General In-Context Classifiers},
author = {Garosi, Marco and Farina, Matteo and Conti, Alessandro and
Mancini, Massimiliano and Ricci, Elisa},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition Findings},
year = {2026}
}
Marco Garosi
Something about ME

PhD student at the University of Trento (Italy).
Software engineer.
Lessons

Video_01
Video_02
Video_03
Video_04

Support

Exercises
FAQ
AI in action
Opinions my own
© 2024 - Marco Garosi - All rights reserved | Privacy Policy | Terms
Back to content