Paper | Code | mAP | mIoU | ModelName | ReleaseDate |
---|---|---|---|---|---|
Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language | ✓ Link | 48.7 | 36.8 | DenseAV | 2024-06-09 |
Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input | 32.2 | 26.3 | DAVENet | 2018-04-04 | |
Contrastive Audio-Visual Masked Autoencoder | ✓ Link | 27.2 | 19.9 | CAVMAE | 2022-10-02 |
ImageBind: One Embedding Space To Bind Them All | ✓ Link | 20.2 | 19.7 | ImageBIND | 2023-05-09 |