Policy
Our policy represents a significant advancement in the realm of Multi-Agent Reinforcement Learning (MARL). It's developed with a centralized training approach, yet it allows for a distributed deployment and evaluation.
A critical innovation in asynchronous policies is the availability of message gradients. We fundamentally reconstruct the hidden state of every node involved in the trajectory by computing it from scratch. It's important to note that all nodes maintain the same internal policy. This condition makes the training slightly "off-policy" as decisions for sending messages and acting might come from a different policy. However, because the hidden state itself is only changed slightly, if even, we consider this a minor tradeoff.
To further enhance the training, we provide the acting node access to the sampled neighbor action. Care must be taken to ensure this information is not leaked before the final step. Although it might be tempting, prematurely exposing this information can result in more policy training. This method is a delicate balance that we are continually refining to optimize performance and learning efficiency.