CREPE: Can Vision-Language Foundation Models Reason Compositionally? | ✓ Link | 39.44 | 33.81 | 47.86 | 60.78 | ViT-L-14 (LAION400M) | 2022-12-13 |
CREPE: Can Vision-Language Foundation Models Reason Compositionally? | ✓ Link | 37.32 | 32.26 | 46.53 | 60.19 | ViT-B-16+240 (LAION400M) | 2022-12-13 |
CREPE: Can Vision-Language Foundation Models Reason Compositionally? | ✓ Link | 37.01 | 30.81 | 44.93 | 59.00 | ViT-B-16 (LAION400M) | 2022-12-13 |
CREPE: Can Vision-Language Foundation Models Reason Compositionally? | ✓ Link | 34.28 | 28.00 | 42.75 | 54.80 | ViT-B-32 (LAION400M) | 2022-12-13 |
CREPE: Can Vision-Language Foundation Models Reason Compositionally? | ✓ Link | 23.38 | 20.08 | 39.85 | 39.83 | RN50 (YFCC15M) | 2022-12-13 |
CREPE: Can Vision-Language Foundation Models Reason Compositionally? | ✓ Link | 23.26 | 19.96 | 34.88 | 45.27 | RN50 (CC12M) | 2022-12-13 |
CREPE: Can Vision-Language Foundation Models Reason Compositionally? | ✓ Link | 22.74 | 20.50 | 39.50 | 39.56 | RN101 (YFCC15M) | 2022-12-13 |
CREPE: Can Vision-Language Foundation Models Reason Compositionally? | ✓ Link | 9.09 | 9.09 | 20.00 | 14.29 | Random | 2022-12-13 |
Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality | | | | 44.5 | 92.1 | Swin-T (MosaiCLIP, CC-12M) | 2023-05-23 |
Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality | | | | 44.4 | 92.6 | RN-50 (MosaiCLIP, CC-12M) | 2023-05-23 |
Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality | | | | 41.5 | 48.8 | MosaiCLIP (YFCC-FT) | 2023-05-23 |
Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality | | | | 41.4 | 82.0 | RN-50 (NegCLIP, CC-12M) | 2023-05-23 |
Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality | | | | 40.9 | 72.4 | MosaiCLIP (CC-FT) | 2023-05-23 |
Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality | | | | 39.6 | 80.3 | Swin-T (NegCLIP, CC-12M) | 2023-05-23 |
Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality | | | | 39.5 | 39.8 | CLIP (YFCC-FT) | 2023-05-23 |
Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality | | | | 39.0 | 38.8 | NegCLIP (YFCC-FT) | 2023-05-23 |
Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality | | | | 38.3 | 36.4 | CLIP-FT (YFCC-FT) | 2023-05-23 |
Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality | | | | 37.5 | 53.1 | NegCLIP (CC-FT) | 2023-05-23 |
Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality | | | | 37.3 | 44.1 | Swin-T (CLIP, CC-12M) | 2023-05-23 |
Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality | | | | 36.7 | 42.9 | RN-50 (CLIP, CC-12M) | 2023-05-23 |
Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality | | | | 35.6 | 45.8 | CLIP-FT (CC-FT) | 2023-05-23 |
Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality | | | | 35.0 | 45.1 | CLIP (CC-FT) | 2023-05-23 |