Paper | Code | mAP | mIoU | ModelName | ReleaseDate |
---|---|---|---|---|---|
Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language | ✓ Link | 32.7 | 24.7 | DenseAV | 2024-06-09 |
Contrastive Audio-Visual Masked Autoencoder | ✓ Link | 26.0 | 17.0 | CAVMAE | 2022-10-02 |
ImageBind: One Embedding Space To Bind Them All | ✓ Link | 19.7 | 20.5 | ImageBIND | 2023-05-09 |
Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input | 16.8 | 18.1 | DAVENet | 2018-04-04 |