Phi-4-reasoning-vision
Open-weight 15B multimodal model for thinking and GUI agents
Phi-4-reasoning-vision – Open-weight 15B multimodal model for reasoning and GUI agents
Summary: Phi-4-reasoning-vision-15B is a 15B parameter open-weight multimodal model using mid-fusion architecture, trained on 200B multimodal tokens. It balances fast perception and deep chain-of-thought reasoning to efficiently handle complex math, science, and computer-use tasks.
What it does
It processes high-resolution inputs and adapts between direct perception for simple tasks and deeper reasoning for complex problems, enabling capable computer-use agents.
Who it's for
Ideal for developers building multimodal reasoning systems, especially in math, science, and GUI agent applications.
Why it matters
It improves efficiency in multimodal reasoning by combining fast perception with deep thought, addressing complex computational and interface challenges.