Environment
Our environment is built on a Multi-Agent Reinforcement Learning (MARL) foundation. However, our goal is to train asynchronous policies, which requires an environment that closely mimics a truly asynchronous scenario. For training purposes, we still need to move forward in our environment in synchronous rounds.
The solution is to randomly select the next node for execution, a seemingly straightforward step. At this juncture, we introduce a further simplification. We can group multiple MARL Agents into a single agent, subject to some limitations on the reward. Although these limitations can largely be circumvented with additional effort (like trajectory reward post-processing), they are acceptable when the problems to be solved are straightforward, i.e., the 1-step reward adequately represents the true reward an agent should receive upon its next action.
Setup
We have N
nodes in a graph G
, tasked to solve a specific graph problem represented by the task T
.
Depending on the task description, every node begins with an initial action, and a random starting node is chosen.
The acting node has three options:
- It can choose to act.
- It can pick a new action.
- It can decide to inform its neighbors about the update.
Once these steps are executed, the environment moves forward, rewards are computed, and a new acting node is selected. This dynamic setup, although it may seem complex, allows for efficient policy learning and accurate problem-solving in the MARL context.