Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Treatment generate

ml/models/mistral::generate


Configuration

⬡ mistral: ml/models/mistral::Mistral

Inputs

⇥ prompt: Stream<string>

Outputs

↦ generated: Stream<string>


Generate text from a Mistral model, one token fragment per stream item.

For each string received on prompt, enqueues an inference request and emits the decoded token strings on generated as they arrive, one string per token. Generation for a single prompt ends when the model produces </s> or when max_new_tokens is reached. The next prompt is then dequeued.

Conversation history is preserved across turns within the same generate instance: each prompt extends the KV cache rather than resetting it. Multiple concurrent generate instances share the single worker thread; KV state is saved and restored on context switches, with no save/restore cost when only one conversation is active.

ℹ️ load must have completed successfully before any prompt is sent, otherwise prompts are silently discarded.

graph LR
     T("generate()")
     P["🟩 🟩 …"] -->|prompt| T
     T -->|generated| G["🟩 🟩 🟩 🟩 …"]

     style P fill:#ffffff,stroke:#ffffff
     style G fill:#ffffff,stroke:#ffffff
use ml/repos/hf::HfHub
use ml/repos/hf::fetch
use ml/models/mistral::Mistral
use ml/models/mistral::load
use ml/models/mistral::generate
use std/engine/util::startup

treatment example()
  model hub:     HfHub(repo_id = "mistralai/Mistral-7B-v0.1")
  model mistral: Mistral(temperature = 0.7, max_new_tokens = 256)
  input  prompt:    Stream<string>
  output generated: Stream<string>
{
    startup()
    fetch[hub=hub]()
    load[mistral=mistral]()
    generate[mistral=mistral]()

    startup.trigger    -> fetch.trigger
    fetch.safetensors  -> load.safetensors
    fetch.tokenizer    -> load.tokenizer
    Self.prompt        -> generate.prompt
    generate.generated -> Self.generated
}