From a8320613a129e040be5b4c4c7c3ebbccf59f4dc8 Mon Sep 17 00:00:00 2001 From: Mikko Juola Date: Mon, 20 Mar 2023 18:28:27 -0700 Subject: [PATCH] Fix some things in README.md after proofreading it and removing lies. --- README.md | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index 99ff638..f34db17 100644 --- a/README.md +++ b/README.md @@ -105,8 +105,9 @@ The command line flags for this are: * `--inference-server-api-path` sets which path servers the API requests. The default path is `/rllama/v1/inference` * `--inference-server-prompt-cache-size` sets how many previous prompt - calculations should be cached. Default is 1000. This speeds up token - generation for prompts that were already requested before. + calculations should be cached. Default is 50. This speeds up token + generation for prompts that were already requested before, however it also + increases memory use as the cache gets more full. Prompts and flags related to token sampling are all ignored in inference server mode. Instead, they are obtained from each HTTP JSON API request. @@ -123,7 +124,7 @@ Expects a JSON body and `Accept: application/json` or `Accept: text/jsonl`. The expected JSON is as follows: -```json +``` { "temperature": "top_k": @@ -146,7 +147,7 @@ the probabilities for every token are returned instead. When no\_token\_sampling = false: -```json +``` {: {"p": , "is_end_token": bool, might not be present}} ``` @@ -160,14 +161,12 @@ When no\_token\_sampling = false: When no\_token\_sampling = true: -```json +``` {: {"p": , "is_end_token": bool, might not be present} \ ,: {"p": , "is_end_token": bool, might not be present} \ ,...} ``` -Tokens where `p = 0` will not be present in the JSON output. - If you want to implement your own token sampling, you may want to set `max_new_tokens=1` and `stop_at_end_token=false` to suppress rllama's own sampling behavior entirely.