We present Story2Board, a training‑free framework for expressive storyboard generation from natural language. Existing methods narrowly focus on subject identity, overlooking key aspects of visual storytelling such as spatial composition, background evolution, and narrative pacing. To address this, we introduce a lightweight consistency framework composed of two components: Latent Panel Anchoring, which preserves a shared character reference across panels, and Reciprocal Attention Value Mixing, which softly blends visual features between token pairs with strong reciprocal attention. Together, these mechanisms enhance coherence without architectural changes or fine‑tuning, enabling state‑of‑the‑art diffusion models to generate visually diverse yet consistent storyboards. To structure generation, we use an off‑the‑shelf language model to convert free‑form stories into grounded panel‑level prompts. To evaluate, we propose the Rich Storyboard Benchmark and a Scene Diversity metric that quantify layout variation and background‑grounded storytelling, in addition to consistency.
Our goal is to represent a narrative as a coherent sequence of storyboard panels -- each depicting a different scene while preserving the identity and appearance of the main character(s). To achieve this, we generate all panels in a batch, with one reference panel that serves as a shared visual reference. During generation, we maintain consistency by replacing the reference region of each latent with the corresponding region from the batch’s reference panel, and we apply Reciprocal Attention Value Mixing (RAVM) to blend the reference’s features into the target panels’ self-attention layers. This combination enforces identity consistency while allowing each panel to retain its own scene layout and composition.
RAVM details. While an reference panel helps maintain overall identity consistency across scenes, it alone is not enough to capture fine-grained details. Subtle features--like the exact shape of a hand or the expression on a face--can still drift. Our Reciprocal Attention Value Mixing (RAVM) addresses this by identifying individual tokens in the reference and target panels that strongly attend to each other across the batch. We then blend their value representations in self-attention, ensuring that the most semantically aligned regions reinforce each other’s appearance.
Unlike prior approaches that rely on coarse cross-attention maps to locate characters, RAVM operates directly at the token level, enabling a more precise and fine-grained blending scheme. Also, RAVM only modifies value embeddings, which affects appearance without interfering with the scene layout generated by the model.