Ouroboros: Reinforcement-Guided Flow Matching

Kang Minkyu
Chae Jeongwoo
Kim Jinho

Abstract

Current text-to-image diffusion models still struggle to follow complex and compositional prompts, largely due to the absence of consistent signals throughout the denoising trajectory. We introduce OUROBOROS, a reinforcement-guided flow matching framework that unifies critic optimization and generative diffusion in latent space. Inspired by ControlNet-style conditioning, our method incorporates a critic network that evaluates intermediate denoising steps via cross attention, producing reinforcement-based rewards for each latent update. These reward signals are used to iteratively refine the flow trajectory, enabling robust alignment between text and image over long-horizon, sparse-reward scenarios. To stabilize representation transitions across steps, we employ a cross-domain module inspired by ControlNet and CycleGAN, ensuring dimension consistency and cyclic regularization between encoder–decoder flows. We expect that the framework achieves scalable, bi-directional optimization that dynamically corrects latent paths while maintaining generation diversity.

Motivation

Prompt for images with TIPO

1girl, mejiro ardan (umamusume), umamusume, ningen mame, solo, horse ears, animal ears, horse girl, tail, long hair, horse tail, blue hair, purple eyes, full body, white background, simple background, looking at viewer, braid, shirt, black footwear, white shirt, open mouth, breasts, smile, long sleeves, crown braid, waving, boots, toes, standing, blush, long shirt, t-shirt, bra strap, alternate costume, collarbone, arm up, short, barefoot, black shorts, medium breasts, medium breast reduction A girl raising her left arm while holding an apple. A snake is crawling in front of a girl's face. The background is white and there are blue ribbons flying around the girls. There are also several apples scattered on the ground near the girl masterpiece, newest, absurdres, safe

Prompt for images without TIPO

A girl raising her left arm while holding an apple. A snake is crawling in front of a girl.

Even state-of-the-art models often fail to accurately interpret and reflect factual relationships in prompts. When prompts are precisely followed, the resulting images can often be distorted or broken. This highlights a significant challenge in text-to-image generation: maintaining image quality while ensuring strict adherence to complex prompt details.

Architecture

Coming soon.

Results

Coming soon.

Acknowledgement

Maintained by Kang Minkyu. Part of the work surfaced on the author's homepage.