diff --git a/README.md b/README.md
index 2fac2cc..72b1073 100644
--- a/README.md
+++ b/README.md
@@ -264,6 +264,31 @@ I believe RWKV is performant because W is like repeatedly applying a diagonal ma
 
 Moreover it's possible to turn it into a continuous ODE (a bit similar to State Space Models). I will write about it later.
 
+## Multimodal ideas
+
+I have an idea for [text --> 32x32 RGB image] using a LM (transformer, RWKV, etc.). Will test it soon.
+
+Firstly, LM loss (instead of L2 loss), so the image will not be blurry.
+
+Secondly, color quantization. For example, only allowing 8 levels for R/G/B. Then the image vocab size is 8x8x8 = 512 (for each pixel), instead of 2^24.
+Therefore, a 32x32 RGB image = a len1024 sequence of vocab512 (image tokens), which is a typical input for usual LMs.
+(Later we can use diffusion models to upsample and generate RGB888 images. We might be able to use a LM for this too.)
+
+Thirdly, 2D positional embeddings that are easy for the model to understand.
+For example, add one-hot X & Y coords to the first 64(=32+32) channels. Say if the pixel is at x=8, y=20, then we will add 1 to channel 8 and channel 52 (=32+20).
+Moreover probably we can add the float X & Y coords (normalized to 0~1 range) to another 2 channels. And other periodic pos. encoding might help too (will test). 
+
+Finally, RandRound when doing the color quantization in the DataLoader.
+For example, if the float level is 4.578, then there is a 57.8% chance to use 5, and (1-57.8%) chance to use 4.
+And we can allow both 4 and 5 in the prediction, but the loss will be higher if the prediction is 4.
+
+Multi-task training might help too. I will try this dataset format:
+[TxtFirst] [Desc of Img (txt tokens)] [Img] [img tokens]
+and sometimes
+[ImgFirst] [img tokens] [Txt] [Desc of Img (txt tokens)]
+... the order of the imgs shall be randomized in the DataLoader, and [TxtFirst] [ImgFirst] [Img] [Txt] are special tokens
+and do random sampling of the full dataset. So sometimes the model will see the img tokens first and then the corresponding txt tokens, which is a [img -> txt] task. And the model will see some partial imgs and partial txts. I think a char-level LM might help the model to write correct text on images.
+
 ## How to sample a large dataset (for training)
 
 I am using a trick to sample the Pile deterministically yet randomly enough.