
This paper introduces NextFlow, an advanced autoregressive model designed for high-quality image generation and editing. It utilizes a decoder-only Transformer architecture and a multi-scale training approach to enhance visual fidelity and reconstruction accuracy. To support this technology, the authors present EditCanvas, a comprehensive benchmark containing over 5,000 human-verified samples across 57 distinct tasks. This dataset evaluates diverse capabilities, ranging from traditional image modifications like lighting and object removal to subject-driven generation. The research also details infrastructure optimizations, such as workload balancing and reinforcement learning techniques, which significantly improve training efficiency. Ultimately, NextFlow demonstrates superior performance in creating and refining complex visual content compared to existing diffusion and autoregressive frameworks.