Treatment generate
ml/models/mistral::generate
Configuration
⬡ mistral: ml/models/mistral::Mistral
Inputs
⇥ prompt: Stream<string>
Outputs
↦ generated: Stream<string>
Generate text from a Mistral model, one token fragment per stream item.
For each string received on prompt, enqueues an inference request and emits the
decoded token strings on generated as they arrive — one string per token. Generation
for a single prompt ends when the model produces </s> or when max_new_tokens is
reached. The next prompt is then dequeued.
Conversation history is preserved across turns within the same generate instance:
each prompt extends the KV cache rather than resetting it. Multiple concurrent
generate instances share the single worker thread; KV state is saved and restored
on context switches, with no save/restore cost when only one conversation is active.
ℹ️ load must have completed successfully before any prompt is sent, otherwise prompts
are silently discarded.
graph LR
T("generate()")
P["🟩 🟩 …"] -->|prompt| T
T -->|generated| G["🟩 🟩 🟩 🟩 …"]
style P fill:#ffff,stroke:#ffff
style G fill:#ffff,stroke:#ffff
use ml/repos/hf::HfHub
use ml/repos/hf::fetch
use ml/models/mistral::Mistral
use ml/models/mistral::load
use ml/models/mistral::generate
use std/engine/util::startup
treatment example()
model hub: HfHub(repo_id = "mistralai/Mistral-7B-v0.1")
model mistral: Mistral(temperature = 0.7, max_new_tokens = 256)
input prompt: Stream<string>
output generated: Stream<string>
{
startup()
fetch[hub=hub]()
load[mistral=mistral]()
generate[mistral=mistral]()
startup.trigger -> fetch.trigger
fetch.safetensors -> load.safetensors
fetch.tokenizer -> load.tokenizer
Self.prompt -> generate.prompt
generate.generated -> Self.generated
}