diff --git a/README.md b/README.md
index bbac99b..971e504 100644
--- a/README.md
+++ b/README.md
@@ -41,6 +41,8 @@ How it works: RWKV gathers information to a number of channels, which are also d
 
 **RWKV is parallelizable because the time-decay of each channel is data-independent (and trainable)**. For example, in usual RNN you can adjust the time-decay of a channel from say 0.8 to 0.5 (these are called "gates"), while in RWKV you simply move the information from a W-0.8-channel to a W-0.5-channel to achieve the same effect. Moreover, you can fine-tune RWKV into a non-parallelizable RNN (then you can use outputs of later layers of the previous token) if you want extra performance.
 
+![RWKV-formula](RWKV-formula.png)
+
 Here are some of my TODOs. Let's work together :)
 
 * HuggingFace integration (check https://github.com/huggingface/transformers/issues/17230