Command-line interface¶
After pip install -e . the token-heatmap command is on your PATH.
Basic trace¶
token-heatmap trace \
--model Qwen/Qwen2.5-0.5B-Instruct \
--prompt "Explain diffusion models." \
--max-new-tokens 80 \
--temperature 0.8 \
--top-p 0.95 \
--min-k 8 --max-k 64 --mass-threshold 0.95 \
--out outputs/
Output:
outputs/
├── generated.txt
├── adaptive_token_trace.csv
├── adaptive_heatmap.png
├── entropy.png
├── selected_probability.png
└── raw_vs_processed.png
Run token-heatmap trace --help for the full flag list.
Inspecting attention and the logit lens¶
AttentionProbe and LogitLens are wired through token-heatmap trace as
opt-in flags. They are off by default because forcing eager attention is
significantly slower than the SDPA / FlashAttention kernel.
token-heatmap trace \
--model Qwen/Qwen2.5-0.5B-Instruct \
--prompt "Explain attention in one sentence." \
--max-new-tokens 12 \
--capture-attention --attention-layers all --attention-top-k 8 \
--capture-full-attention \
--capture-logit-lens --lens-layers 0,3,7,11 --lens-top-k 5 \
--out outputs/inspect/
| Flag | Meaning |
|---|---|
--capture-attention |
Attach an AttentionProbe. Forces eager attention (slow). |
--attention-layers |
all (default) or comma-separated layer indices (e.g. 0,3,7,11). |
--attention-top-k |
Top-k attended positions kept inline per head (default 8). |
--capture-full-attention |
Also write a Tier-2 attention.<step>.npz per step under <out>/attention/. Implies --capture-attention. |
--capture-logit-lens |
Attach a LogitLens. |
--lens-layers |
Same syntax as --attention-layers. |
--lens-top-k |
Top-k tokens retained per layer (default 8). |
When attention or lens capture is on, the CLI also writes:
adaptive_token_trace.json— schema-shaped trace including inline attention aggregates, per-layer logit-lens projections, andattention_sidecar_refpointers. Seeschema.mdfor the layout.attention_layer_head_grid.png— per-step layer × head entropy grid (first step).logit_lens.png— per-layer top-k table (first step).selected_rank_heatmap.png— selected-token rank by layer × step.
A runnable end-to-end script lives at
examples/qwen_attention_inspect.py;
it runs on Qwen/Qwen2.5-0.5B-Instruct in under 90 s on CPU.
Capturing activations¶
ActivationProbe is the third opt-in capture path. When the flags below
are set, the CLI embeds activation_metadata and per-step activations
blocks directly into adaptive_token_trace.json (no sibling file).
token-heatmap trace \
--model Qwen/Qwen2.5-0.5B-Instruct \
--prompt "Explain attention in one sentence." \
--max-new-tokens 12 \
--capture-activations \
--activation-layers all \
--activation-submodules residual_post,mlp_out,o_proj \
--activation-top-k 8 \
--out outputs/activations/
| Flag | Meaning |
|---|---|
--capture-activations |
Attach an ActivationProbe. Off by default. |
--activation-layers |
all (default) or comma-separated decoder layer indices. |
--activation-submodules |
Comma-separated submodule keys (default residual_post,mlp_out,o_proj). Supported: resid_pre/residual_pre, resid_post/residual_post, mlp_out/mlp.down_proj, o_proj. |
--activation-top-k |
Top-k highest-magnitude neurons retained per (layer, submodule) (default 8). |
--capture-full-activations |
Reserved for the Tier-2 sidecar path; implies --capture-activations. |
Comparing two activation traces¶
token-heatmap diff outA/adaptive_token_trace.json outB/adaptive_token_trace.json \
--out diff/ \
--metric l2
The subcommand projects each input's activation subset, calls
compare_activations with align="auto", and writes:
activation_diff.json— schema-shaped diff payload (matchesweb/activation-diff.schema.json).activation_delta.png— stacked layer × step heatmap, one subplot per captured submodule, coloured by the chosen metric.
The CLI refuses to diff (non-zero exit) when the two parent traces have
different metadata.prompt values or when zero steps align between them
— these are the two ways "mismatched generations" present after the
comparator runs.