Abstract: Visual Grounding (VG) aims to locate the most relevant object or region in an image according to a natural language query. Existing methods in VG utilize fixed image and text representations ...