Autonomous GUI agents interact with environments by perceiving interfaces and executing actions. As a virtual sandbox, GUI World model empowers agents with human-like foresight by enabling action-conditioned prediction. However, existing text- and pixel- based approaches struggle to simultaneously achieve high visual fidelity and fine-grained structural controllability. To this end, we propose Code2World, a vision-language coder that simulates next visual state via renderable code generation. Especially, to address the data scarcity problem, we construct AndroidCode by translating GUI trajectories into high-fidelity HTML and refining synthesized code through a visual-feedback revision loop, resulting in over 80K high-quality screen-action pairs. To adapt existing VLMs into code prediction, we first perform SFT as a cold start for format layout following, then further apply Render-Aware Reinforcement Learning which uses the final rendered outcome by enforcing visual semantic fidelity and action consistency. Extensive experiments demonstrate that Code2World-8B achieves the top-performing next UI prediction, rivaling the competitive GPT-5 and Gemini-3-Pro-Image. Notably, Code2World significantly enhances downstream navigation success rates in a flexible manner, boosting Gemini-2.5-Flash by {+9.5%} on AndroidWorld navigation.
Figure 1. Illustration of Code2World. Given a current GUI observation and an action, Code2World predicts the next screenshot via renderable code generation.
Figure 2. Left: Illustration of Data Synthesis. The high-fidelity AndroidCode dataset is curated via constrainted initial synthesis and a visual-feedback revision loop, where synthesized HTML is iteratively refined based on rendered visual discrepancies to ensure strict alignment (SigLIP score > 0.9). Right: Two-stage Model Optimization. The pipeline progresses from an SFT cold start to Render-Aware Reinforcement Learning (RARL). Utilizing Group Relative Policy Optimization (GRPO), the model optimizes dual rewards—visual semantic (Rsem) and action consistency (Ract)—derived directly from rendered outcomes to enforce structural and logical fidelity.
Figure 3. Illustration of the "Propose, Simulate, Select" pipeline for Code2World enhanced GUI agent, exemplified by an AndroidWorld task. (1) Propose: The GUI agent generates K candidate actions, with red and green highlighting hallucinated/irrational reasoning and logically sound reasoning, respectively. (2) Simulate: Code2World predicts the execution result of each candidate via renderable code generation. (3) Select: By evaluating the rendered future states, the system identifies the potential failure in the original policy and rectifies the decision, ultimately selecting the optimal action that aligns with the user's intent.
Table 1. Quantitative Comparison of various image and code generation models on Android Control (ID) to assess basic capabilities on the same device, and GUI Odyssey (OOD) to test generalization robustness across unseen devices and cross-app scenarios. The best scores are in bold while the second best are in underlined.
Here are more qualitative comparison of next GUI state generation over Code2World and three baselines.
Figure 4. Launch the email application from the home screen to access the inbox.
Figure 5. Click on "All News" button in the Cerebra Research application to view news content.
Figure 6. Mark a reminder task as completed by tapping the "Complete" button in the Reminder app.
Figure 7. Apply product filters by tapping the "Apply Filter" button in the e-commerce app to refresh the item list.
If you find our project useful, we hope you can star our repo and cite our paper as follows:
@article{zheng2026code2world,
title={Code2World: A GUI World Model via Renderable Code Generation},
author={Zheng, Yuhao and Zhong, Li'an and Wang, Yi and Dai, Rui and Liu, Kaikui and Chu, Xiangxiang and Lv, Linyuan and Torr, Philip and Lin, Kevin Qinghong},
journal={arXiv preprint arXiv:2602.09856},
year={2026}
}