SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization | ✓ Link | 95.0 | SyncVSR (Word Boundary) | 2024-06-18 |
Training Strategies for Improved Lip-reading | ✓ Link | 94.1 | 3D Conv + ResNet-18 + DC-TCN + KD (Ensemble & Word Boundary) | 2022-09-03 |
SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization | ✓ Link | 93.2 | SyncVSR | 2024-06-18 |
Audio-Visual Speech Recognition based on Regulated Transformer and Spatio-Temporal Fusion Strategy for Driver Assistive Systems | ✓ Link | 89.57 | AVCRFormer | 2024-05-09 |
Accurate and Resource-Efficient Lipreading with Efficientnetv2 and Transformers | | 89.52 | 3D Conv + EfficientNetV2 + Transformer + TCN | 2022-05-23 |
Visual Speech Recognition in a Driver Assistance System | | 88.7 | Vosk + MediaPipe + LS + MixUp + SA + 3DResNet-18 + BiLSTM + Cosine WR | 2022-08-29 |
Distinguishing Homophenes Using Multi-Head Visual-Audio Memory for Lip Reading | ✓ Link | 88.5 | 3D Conv + ResNet-18 + MS-TCN + Multi-Head Visual-Audio Memory | 2022-04-04 |
Towards Practical Lipreading with Distilled and Efficient Models | ✓ Link | 88.5 | 3D Conv + ResNet-18 + MS-TCN + KD (Ensemble) | 2020-07-13 |
Learn an Effective Lip Reading Model without Pains | ✓ Link | 88.4 | 3D-ResNet + Bi-GRU + MixUp + Label Smoothing + Cosine LR (Word Boundary) | 2020-11-15 |
Learn an Effective Lip Reading Model without Pains | ✓ Link | 85.5 | 3D-ResNet + Bi-GRU + MixUp + Label Smoothing + Cosine LR | 2020-11-15 |
Multi-modality Associative Bridging through Memory: Speech Sound Recollected from Face Video | ✓ Link | 85.4 | 3D Conv + ResNet-18 + Bi-GRU + Visual-Audio Memory | 2022-04-04 |
Lipreading using Temporal Convolutional Networks | ✓ Link | 85.30 | 3D Conv + ResNet-18 + MS-TCN | 2020-01-23 |
Can We Read Speech Beyond the Lips? Rethinking RoI Selection for Deep Visual Speech Recognition | ✓ Link | 85.02 | 3D Conv + ResNet-18 + Bi-GRU(Face Cutout) | 2020-03-06 |
Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition | ✓ Link | 85.0 | MoCo + Wav2Vec by SJTU LUMIA | 2022-02-24 |
Discriminative Multi-modality Speech Recognition | ✓ Link | 84.80 | 3D Conv + P3D-ResNet50 + TCN | 2020-05-12 |
Mutual Information Maximization for Effective Lip Reading | ✓ Link | 84.41 | 3D Conv + ResNet-18 + Bi-GRU | 2020-03-13 |
SpotFast Networks with Memory Augmented Lateral Transformers for Lipreading | ✓ Link | 84.4 | SpotFast + Transformer + Product-Key memory | 2020-05-21 |
Deformation Flow Based Two-Stream Network for Lip Reading | ✓ Link | 84.13 | DFTN | 2020-03-12 |
Pseudo-Convolutional Policy Gradient for Sequence-to-Sequence Lip-Reading | | 83.5 | PCPG | 2020-03-09 |
End-to-end Audiovisual Speech Recognition | ✓ Link | 83.39 | 3D Conv + ResNet-34 + Bi-GRU | 2018-02-18 |
Multi-Grained Spatio-temporal Modeling for Lip-reading | | 83.34 | Multi-grained + Bi-ConvLSTM | 2019-08-30 |
Combining Residual Networks with LSTMs for Lipreading | ✓ Link | 83.00 | 3D Conv + ResNet-34 + Bi-LSTM | 2017-03-12 |