Treatment generate

ml/models/mistral::generate


Configuration

⬡ mistral: ml/models/mistral::Mistral

Inputs

⇥ prompt: Stream<string>

Outputs

↦ generated: Stream<string>


Generate text from a Mistral model, one token fragment per stream item.

For each string received on prompt, enqueues an inference request and emits the decoded token strings on generated as they arrive — one string per token. Generation for a single prompt ends when the model produces </s> or when max_new_tokens is reached. The next prompt is then dequeued.

Conversation history is preserved across turns within the same generate instance: each prompt extends the KV cache rather than resetting it. Multiple concurrent generate instances share the single worker thread; KV state is saved and restored on context switches, with no save/restore cost when only one conversation is active.

ℹ️ load must have completed successfully before any prompt is sent, otherwise prompts are silently discarded.

graph LR
     T("generate()")
     P["🟩 🟩 …"] -->|prompt|    T
     T            -->|generated| G["🟩 🟩 🟩 🟩 …"]

     style P fill:#ffff,stroke:#ffff
     style G fill:#ffff,stroke:#ffff
use ml/repos/hf::HfHub
use ml/repos/hf::fetch
use ml/models/mistral::Mistral
use ml/models/mistral::load
use ml/models/mistral::generate
use std/engine/util::startup

treatment example()
  model hub:     HfHub(repo_id = "mistralai/Mistral-7B-v0.1")
  model mistral: Mistral(temperature = 0.7, max_new_tokens = 256)
  input  prompt:    Stream<string>
  output generated: Stream<string>
{
    startup()
    fetch[hub=hub]()
    load[mistral=mistral]()
    generate[mistral=mistral]()

    startup.trigger    -> fetch.trigger
    fetch.safetensors  -> load.safetensors
    fetch.tokenizer    -> load.tokenizer
    Self.prompt        -> generate.prompt
    generate.generated -> Self.generated
}