An Efficient Post-hoc Framework for Reducing Task Discrepancy of Text Encoders for Composed Image Retrieval | ✓ Link | 56.74 | | | RTD + LinCIR (CLIP G/14) | 2024-06-13 |
Language-only Efficient Training of Zero-shot Composed Image Retrieval | ✓ Link | 55.40 | | | LinCIR (CLIP G/14) | 2023-12-04 |
Semantic Editing Increment Benefits Zero-Shot Composed Image Retrieval | ✓ Link | 54.45 | | | SEIZE (CLIP G/14) | 2024-10-28 |
CoLLM: A Large Language Model for Composed Image Retrieval | ✓ Link | 49.9 | 39.1 | 60.7 | CoLLM (finetuned - BLIP-L/16) | 2025-03-25 |
SCOT: Self-Supervised Contrastive Pretraining For Zero-Shot Compositional Retrieval | | 49.24 | 38.45 | 60.03 | SCOT (WACV 2025) | 2025-01-12 |
CoVR-2: Automatic Data Construction for Composed Video Retrieval | ✓ Link | 48.3 | 38.15 | 58.44 | CoVR-BLIP-2 | 2023-08-28 |
MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions | ✓ Link | 48.1 | 38 | 58.2 | MagicLens (CoCa L) | 2024-03-28 |
Reason-before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero-Shot Composed Image Retrieval | ✓ Link | 47.34 | | | OSrCIR (CLIP G/14) | 2024-12-15 |
Training-free Zero-shot Composed Image Retrieval via Weighted Modality Fusion and Similarity | ✓ Link | 47.16 | | | WeiMoCIR (CLIP G/14) | 2024-09-07 |
Pretrain like Your Inference: Masked Tuning Improves Zero-Shot Composed Image Retrieval | ✓ Link | 46.42 | | | MTCIR (CLIP L/14) | 2023-11-13 |
CompoDiff: Versatile Composed Image Retrieval With Latent Diffusion | ✓ Link | 45.37 | | | CompoDiff (CLIP G/14) | 2023-03-21 |
CoLLM: A Large Language Model for Composed Image Retrieval | ✓ Link | 45.3 | 34.6 | 56.0 | CoLLM (Pretrained - BLIP-L/16) | 2025-03-25 |
MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions | ✓ Link | 45.3 | | | MagicLens (CoCa B) | 2024-03-28 |
Zero-shot Composed Text-Image Retrieval | ✓ Link | 44.75 | | | TransAgg (Laion-CIR-Combined) | 2023-06-12 |
Training-free Zero-shot Composed Image Retrieval via Weighted Modality Fusion and Similarity | ✓ Link | 44.58 | | | WeiMoCIR (CLIP H/14) | 2024-09-07 |
CompoDiff: Versatile Composed Image Retrieval With Latent Diffusion | ✓ Link | 44.11 | | | CompoDiff (CLIP L/14) | 2023-03-21 |
LDRE: LLM-based Divergent Reasoning and Ensemble for Zero-Shot Composed Image Retrieval | ✓ Link | 43.98 | | | LDRE (CLIP G/14) | 2024-07-11 |
Reason-before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero-Shot Composed Image Retrieval | ✓ Link | 42.87 | | | OSrCIR (CLIP B/32) | 2024-12-15 |
Reason-before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero-Shot Composed Image Retrieval | ✓ Link | 42.82 | | | OSrCIR (CLIP L/14) | 2024-12-15 |
Vision-by-Language for Training-Free Compositional Image Retrieval | ✓ Link | 42.28 | | | CIReVL (CLIP G/14) | 2023-10-13 |
MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions | ✓ Link | 41.6 | 30.7 | 52.5 | MagicLens (CLIP L) | 2024-03-28 |
Training-free Zero-shot Composed Image Retrieval via Weighted Modality Fusion and Similarity | ✓ Link | 41.27 | | | WeiMoCIR (CLIP L/14) | 2024-09-07 |
An Efficient Post-hoc Framework for Reducing Task Discrepancy of Text Encoders for Composed Image Retrieval | ✓ Link | 40.66 | | | RTD + LinCIR (CLIP L/14) | 2024-06-13 |
Training-free Zero-shot Composed Image Retrieval via Weighted Modality Fusion and Similarity | ✓ Link | 39.84 | | | WeiMoCIR (CLIP B/32) | 2024-09-07 |
CoLLM: A Large Language Model for Composed Image Retrieval | ✓ Link | 39.8 | 30.1 | 49.5 | CoLLM (Pretrained - CLIP-L/14) | 2025-03-25 |
iSEARLE: Improving Textual Inversion for Zero-Shot Composed Image Retrieval | ✓ Link | 39.39 | | | iSEARLE-XL-OTI (CLIP L/14) | 2024-05-05 |
Vision-by-Language for Training-Free Compositional Image Retrieval | ✓ Link | 38.82 | | | CIReVL (CLIP B/32) | 2023-10-13 |
Vision-by-Language for Training-Free Compositional Image Retrieval | ✓ Link | 38.56 | | | CIReVL (CLIP L/14) | 2023-10-13 |
Context-I2W: Mapping Images to Context-dependent Words for Accurate Zero-Shot Composed Image Retrieval | ✓ Link | 38.35 | | | Context-I2W (CLIP L/14) | 2023-09-28 |
iSEARLE: Improving Textual Inversion for Zero-Shot Composed Image Retrieval | ✓ Link | 38.24 | | | iSEARLE-XL (CLIP L/14) | 2024-05-05 |
Zero-Shot Composed Image Retrieval with Textual Inversion | ✓ Link | 37.76 | | | SEARLE-XL-OTI (CLIP L/14) | 2023-03-27 |
MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions | ✓ Link | 36.85 | | | MagicLens (CLIP B) | 2024-03-28 |
Language-only Efficient Training of Zero-shot Composed Image Retrieval | ✓ Link | 36.39 | | | LinCIR (CLIP L/14) | 2023-12-04 |
Zero-Shot Composed Image Retrieval with Textual Inversion | ✓ Link | 35.90 | | | SEARLE-XL (CLIP L/14) | 2023-03-27 |
iSEARLE: Improving Textual Inversion for Zero-Shot Composed Image Retrieval | ✓ Link | 34.93 | | | iSEARLE-OTI (CLIP B/32) | 2024-05-05 |
iSEARLE: Improving Textual Inversion for Zero-Shot Composed Image Retrieval | ✓ Link | 34.60 | | | iSEARLE (CLIP B/32) | 2024-05-05 |
Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image Retrieval | ✓ Link | 34.20 | | | Pic2Word | 2023-02-06 |
Zero-Shot Composed Image Retrieval with Textual Inversion | ✓ Link | 32.71 | | | SEARLE (CLIP B/32) | 2023-03-27 |
Zero-Shot Composed Image Retrieval with Textual Inversion | ✓ Link | 32.39 | | | SEARLE-OTI (CLIP B/32) | 2023-03-27 |
"This is my unicorn, Fluffy": Personalizing frozen vision-language representations | ✓ Link | 28.51 | | | PALAVRA | 2022-04-04 |
ImageScope: Unifying Language-Guided Image Retrieval via Large Multimodal Model Collective Reasoning | ✓ Link | | 31.36 | 50.78 | ImageScope (CLIP-ViT-L/14) | 2025-03-13 |