Treatment generate
ml/models/mistral::generate
Configuration
⬡ mistral: ml/models/mistral::Mistral
Inputs
⇥ prompt: Stream<string>
Outputs
↦ generated: Stream<string>
Generate text from a Mistral model, one token fragment per stream item.
For each string received on prompt, enqueues an inference request and emits the
decoded token strings on generated as they arrive, one string per token. Generation
for a single prompt ends when the model produces </s> or when max_new_tokens is
reached. The next prompt is then dequeued.
Conversation history is preserved across turns within the same generate instance:
each prompt extends the KV cache rather than resetting it. Multiple concurrent
generate instances share the single worker thread; KV state is saved and restored
on context switches, with no save/restore cost when only one conversation is active.
ℹ️ load must have completed successfully before any prompt is sent, otherwise prompts
are silently discarded.
graph LR
T("generate()")
P["🟩 🟩 …"] -->|prompt| T
T -->|generated| G["🟩 🟩 🟩 🟩 …"]
style P fill:#ffffff,stroke:#ffffff
style G fill:#ffffff,stroke:#ffffff
use ml/repos/hf::HfHub
use ml/repos/hf::fetch
use ml/models/mistral::Mistral
use ml/models/mistral::load
use ml/models/mistral::generate
use std/engine/util::startup
treatment example()
model hub: HfHub(repo_id = "mistralai/Mistral-7B-v0.1")
model mistral: Mistral(temperature = 0.7, max_new_tokens = 256)
input prompt: Stream<string>
output generated: Stream<string>
{
startup()
fetch[hub=hub]()
load[mistral=mistral]()
generate[mistral=mistral]()
startup.trigger -> fetch.trigger
fetch.safetensors -> load.safetensors
fetch.tokenizer -> load.tokenizer
Self.prompt -> generate.prompt
generate.generated -> Self.generated
}