Story2Board: A Training‑Free Approach for Expressive Storyboard Generation

1Hebrew University of Jerusalem, Israel
2OriginAI, Israel
3Bar-Ilan University, Israel
Story2Board teaser

Story2Board: Training-free storyboard generation that balances identity consistency with cinematic layout diversity.

Abstract

We present Story2Board, a training‑free framework for expressive storyboard generation from natural language. Existing methods narrowly focus on subject identity, overlooking key aspects of visual storytelling such as spatial composition, background evolution, and narrative pacing. To address this, we introduce a lightweight consistency framework composed of two components: Latent Panel Anchoring, which preserves a shared character reference across panels, and Reciprocal Attention Value Mixing, which softly blends visual features between token pairs with strong reciprocal attention. Together, these mechanisms enhance coherence without architectural changes or fine‑tuning, enabling state‑of‑the‑art diffusion models to generate visually diverse yet consistent storyboards. To structure generation, we use an off‑the‑shelf language model to convert free‑form stories into grounded panel‑level prompts. To evaluate, we propose the Rich Storyboard Benchmark and a Scene Diversity metric that quantify layout variation and background‑grounded storytelling, in addition to consistency.

Method

Our goal is to represent a narrative as a coherent sequence of storyboard panels -- each depicting a different scene while preserving the identity and appearance of the main character(s). To achieve this, we generate all panels in a batch, with one reference panel that serves as a shared visual reference. During generation, we maintain consistency by replacing the reference region of each latent with the corresponding region from the batch’s reference panel, and we apply Reciprocal Attention Value Mixing (RAVM) to blend the reference’s features into the target panels’ self-attention layers. This combination enforces identity consistency while allowing each panel to retain its own scene layout and composition.

Method overview: LLM Director, LPA co‑denoising, and decoding.
Fig. 3. Our training-free storyboard generation pipeline. (1) LLM Director: Decomposes the story into a shared reference panel prompt and scene-level prompts. (2) Co-denoising with LPA and RAVM: Generates a batch of two-panel images, syncing the reference panel across the batch after each DiT block and blending features for consistency. (3) Decode & crop: Produces the final storyboard panels.

RAVM details. While an reference panel helps maintain overall identity consistency across scenes, it alone is not enough to capture fine-grained details. Subtle features--like the exact shape of a hand or the expression on a face--can still drift. Our Reciprocal Attention Value Mixing (RAVM) addresses this by identifying individual tokens in the reference and target panels that strongly attend to each other across the batch. We then blend their value representations in self-attention, ensuring that the most semantically aligned regions reinforce each other’s appearance.

Unlike prior approaches that rely on coarse cross-attention maps to locate characters, RAVM operates directly at the token level, enabling a more precise and fine-grained blending scheme. Also, RAVM only modifies value embeddings, which affects appearance without interfering with the scene layout generated by the model.

RAVM: mutual attention maps and value‑mixing between token pairs.
Fig. 4. Visualization of Reciprocal Attention Value Mixing (RAVM) in action. Left: A generated two-panel output from our method, with the top panel serving as the shared reference. The red and green circles mark semantically corresponding character features (the hand) in the reference and target panels, respectively. Right: Heatmaps showing reciprocal attention scores at denoising step 12 of 28. Top-right: For each token in the top panel, we compute its reciprocal attention with the green-circled token in the bottom panel. Bottom-right: The reverse--each token in the bottom panel is scored based on reciprocal attention with the red-circled token in the top panel. In both cases, the hand token in the opposite panel receives the strongest reciprocal attention, showing that RAVM identifies semantically aligned token pairs for value mixing. This reinforces visual consistency without altering spatial composition.

More Results