--- title: "Chat and Agents" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Chat and Agents} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE, purl=FALSE} # Every chunk needs a GGUF model (and usually a GPU), so this vignette is # static: the code is shown but not run at build time. knitr::opts_chunk$set(eval = FALSE, purl = FALSE) ``` llamaR turns a local GGUF model into a chat backend for the R ecosystem. You can talk to it three ways, from lowest to highest level: * **HTTP server** — `llama_serve_openai()` exposes an OpenAI-compatible API any client can hit (OpenCode, the `openai` SDK, `curl`). * **ellmer `Chat`** — `chat_llamar()` returns an `ellmer::Chat`, so the whole ellmer / ragnar toolchain works against local inference. * **Command-line example** — `inst/examples/chat.R` wraps both for quick use. ```{r, eval=FALSE, purl=FALSE} library(llamaR) ``` --- ## 1. The chat object: `chat_llamar()` `chat_llamar()` returns an [ellmer](https://ellmer.tidyverse.org/) `Chat`. It has two modes, picked by which argument you pass — the same DBI-style choice as `DBI::dbConnect()` (connection parameters *or* a ready connection). ### Mode A — spawn a server for a model Give it a model file and it starts `llama_serve_openai()` in a background process (via the **callr** package), waits for it to come up, and points a `Chat` at it. The server's lifetime is tied to the returned object: when it is garbage-collected (or R exits) the process is killed. ```{r, eval=FALSE, purl=FALSE} chat <- chat_llamar(model_path = "Ministral-3B-Instruct.gguf") chat$chat("Why is the sky blue?") chat_llamar_stop(chat) # stop the spawned server (or just let GC do it) ``` Large models can take a while to load from disk; raise `timeout` (default 180s) if a 14B at Q8 doesn't come up in time: ```{r, eval=FALSE, purl=FALSE} chat <- chat_llamar(model_path = "Qwen3-14B-Q8_0.gguf", timeout = 300) ``` ### Mode B — connect to a running server If you already run a server (in another process, or a pool of them), pass its URL. No process is spawned. ```{r, eval=FALSE, purl=FALSE} # In another process / shell: # llama_serve_openai("model.gguf", port = 11434L) chat <- chat_llamar(base_url = "http://127.0.0.1:11434/v1") chat$chat("Hello!") ``` ### System prompt ```{r, eval=FALSE, purl=FALSE} chat <- chat_llamar( model_path = "Ministral-3B-Instruct.gguf", system_prompt = "You are a concise assistant. Answer in one sentence." ) chat$chat("What is R?") ``` > **Under the hood.** `chat_llamar()` wraps `ellmer::chat_vllm()`, which talks > to the server's `/v1/chat/completions` endpoint — the de-facto standard our > server implements. (ellmer's `chat_openai()` targets OpenAI's newer > `/v1/responses` API, which the server does not implement.) --- ## 2. The server: `llama_serve_openai()` `chat_llamar(model_path=)` is a convenience wrapper; you can run the server directly for non-R clients. It needs the optional **drogonR** package for the HTTP/SSE layer. ```{r, eval=FALSE, purl=FALSE} llama_serve_openai("model.gguf", port = 11434L, n_ctx = 8192L) ``` It blocks, serving: * `GET /v1/models` * `POST /v1/chat/completions` (both blocking and `stream = true`) Point any OpenAI client at `http://127.0.0.1:11434/v1`: ```bash curl http://127.0.0.1:11434/v1/chat/completions \ -H 'Content-Type: application/json' \ -d '{"model":"model","messages":[{"role":"user","content":"Hello"}]}' ``` A runnable launcher lives at `inst/examples/serve_openai.R`. ### Connecting OpenCode Add an OpenAI-compatible provider in `opencode.json` (see the one in this repo) with `baseURL` set to `http://127.0.0.1:11434/v1` and the model id matching what `/v1/models` reports. --- ## 3. The command-line example `inst/examples/chat.R` wraps both modes for the terminal: ```bash # Spawn a server for the model and open an interactive prompt Rscript inst/examples/chat.R model.gguf # Positional [port] [n_ctx], plus flags Rscript inst/examples/chat.R model.gguf 11434 8192 \ --system "Be concise." --timeout 300 # One-shot: a trailing message prints a single reply and exits Rscript inst/examples/chat.R model.gguf "Why is the sky blue?" # Connect to a server you already started Rscript inst/examples/chat.R --url http://127.0.0.1:11434/v1 ``` In interactive mode, type a message and press Enter; a blank line or Ctrl-D quits. A spawned server is stopped automatically on exit. --- ## 4. ragnar: retrieval-augmented chat Because `chat_llamar()` returns a real `ellmer::Chat`, it plugs into [ragnar](https://ragnar.tidyverse.org/). Pair it with `embed_llamar()` (see `vignette("getting-started")`) for a fully local RAG stack: local embeddings for the store, local generation for the chat. ```{r, eval=FALSE, purl=FALSE} library(ragnar) store <- ragnar_store_create( location = "store.duckdb", embed = embed_llamar(model = "embedding-model.gguf") ) ragnar_store_insert(store, documents) ragnar_store_build_index(store) chat <- chat_llamar(model_path = "Ministral-3B-Instruct.gguf") ragnar_register_tool_retrieve(chat, store) chat$chat("What do the documents say about X?") ``` > **Note.** Tool calling is mediated by the OpenAI protocol, so it works only > as far as the server implements it. The current server does not emit > `tool_calls` yet, so a model will not autonomously invoke the registered > retrieve tool. Plain chat and manual retrieval work today; automatic > tool-driven retrieval is on the roadmap (see `TODO.md`). --- ## 5. Concurrency The server is **single-sequence**: it handles one request at a time on the main R thread. That is enough for a single local user or agent. For parallel sessions, run a pool of servers on different ports and create one `chat_llamar(base_url=)` per worker — the worker-pool architecture is described in `TODO.md`. ```{r, eval=FALSE, purl=FALSE} ports <- c(11434L, 11435L, 11436L) chats <- lapply(ports, function(p) chat_llamar(base_url = sprintf("http://127.0.0.1:%d/v1", p))) ``` --- ## See also * `vignette("getting-started")` — the rest of the package. * `?chat_llamar`, `?llama_serve_openai` * `inst/examples/chat.R`, `inst/examples/serve_openai.R`