Skip to main content

Environment

Our environment is built on a Multi-Agent Reinforcement Learning (MARL) foundation. However, our goal is to train asynchronous policies, which requires an environment that closely mimics a truly asynchronous scenario. For training purposes, we still need to move forward in our environment in synchronous rounds.

The solution is to randomly select the next node for execution, a seemingly straightforward step. At this juncture, we introduce a further simplification. We can group multiple MARL Agents into a single agent, subject to some limitations on the reward. Although these limitations can largely be circumvented with additional effort (like trajectory reward post-processing), they are acceptable when the problems to be solved are straightforward, i.e., the 1-step reward adequately represents the true reward an agent should receive upon its next action.

Setup

We have N nodes in a graph G, tasked to solve a specific graph problem represented by the task T.

Depending on the task description, every node begins with an initial action, and a random starting node is chosen.

The acting node has three options:

  1. It can choose to act.
  2. It can pick a new action.
  3. It can decide to inform its neighbors about the update.

Once these steps are executed, the environment moves forward, rewards are computed, and a new acting node is selected. This dynamic setup, although it may seem complex, allows for efficient policy learning and accurate problem-solving in the MARL context.