From a8320613a129e040be5b4c4c7c3ebbccf59f4dc8 Mon Sep 17 00:00:00 2001
From: Mikko Juola <mikjuo@gmail.com>
Date: Mon, 20 Mar 2023 18:28:27 -0700
Subject: [PATCH] Fix some things in README.md after proofreading it and
 removing lies.

---
 README.md | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)
diff --git a/README.md b/README.md
index 99ff638..f34db17 100644
--- a/README.md
+++ b/README.md
@@ -105,8 +105,9 @@ The command line flags for this are:
   * `--inference-server-api-path` sets which path servers the API requests. The
     default path is `/rllama/v1/inference`
   * `--inference-server-prompt-cache-size` sets how many previous prompt
-    calculations should be cached. Default is 1000. This speeds up token
-    generation for prompts that were already requested before.
+    calculations should be cached. Default is 50. This speeds up token
+    generation for prompts that were already requested before, however it also
+    increases memory use as the cache gets more full.
 
 Prompts and flags related to token sampling are all ignored in inference server
 mode. Instead, they are obtained from each HTTP JSON API request.
@@ -123,7 +124,7 @@ Expects a JSON body and `Accept: application/json` or `Accept: text/jsonl`.
 
 The expected JSON is as follows:
 
-```json
+```
   {
      "temperature":        <number, optional>
      "top_k":              <integer, optional, default 20>
@@ -146,7 +147,7 @@ the probabilities for every token are returned instead.
 
 When no\_token\_sampling = false:
 
-```json
+```
 {<token string>: {"p": <number>, "is_end_token": bool, might not be present}}
 ```
 
@@ -160,14 +161,12 @@ When no\_token\_sampling = false:
 
 When no\_token\_sampling = true:
 
-```json
+```
 {<token string>: {"p": <number>, "is_end_token": bool, might not be present} \
 ,<token string>: {"p": <number>, "is_end_token": bool, might not be present} \
 ,...}
 ```
 
-Tokens where `p = 0` will not be present in the JSON output.
-
 If you want to implement your own token sampling, you may want to set
 `max_new_tokens=1` and `stop_at_end_token=false` to suppress rllama's own
 sampling behavior entirely.