PhyX is the first large-scale benchmark specifically designed to assess models' ability in physical reasoning through realistic, visually grounded scenarios. PhyX includes 3,000 meticulously collected ...