Image Captioning Demo

Multimodal model: Vision → Language