Benchmarking Algorithms¶
SustainDC supports a variety of reinforcement learning algorithms for benchmarking. This section provides an overview of the supported algorithms and highlights their differences.
Supported Algorithms¶
Abbr. |
Algorithm |
Description |
Use Case |
---|---|---|---|
PPO |
Proximal Policy Optimization |
Balances simplicity and performance with a clipped objective function for stable training |
Single-agent environments |
IPPO |
Independent PPO |
Each agent operates independently with its own policy and value function |
Multi-agent systems with distinct roles |
MAPPO |
Multi-Agent PPO |
Uses a centralized value function for better coordination among agents |
Cooperative multi-agent tasks |
HAPPO |
Heterogeneous Agent PPO |
Designed for heterogeneous agents with different observation and action spaces |
Complex environments with diverse agents |
HATRPO |
Heterogeneous Agent Trust Region Policy Optimization |
Adapts TRPO for heterogeneous multi-agent settings with stability and robustness |
Complex environments requiring robust policy updates |
HAA2C |
Heterogeneous Agent Advantage Actor-Critic |
Extends A2C to multi-agent settings with individual actor and critic networks |
Scenarios with different types of observations and actions |
HAD3QN |
Heterogeneous Agent Dueling Double Deep Q-Network |
Combines dueling networks and double Q-learning for stability and performance |
Environments needing fine-grained action distinctions |
HASAC |
Heterogeneous Agent Soft Actor-Critic |
Uses entropy regularization for exploration in multi-agent settings |
Adapted to discrete action spaces and high exploration needs |
Differences and Use Cases¶
PPO vs. IPPO: PPO is for single-agent setups, while IPPO suits multi-agent environments with independent learning.
IPPO vs. MAPPO: IPPO treats agents independently; MAPPO coordinates agents with a centralized value function, ideal for cooperative tasks.
MAPPO vs. HAPPO: Both use centralized value functions, but HAPPO is for heterogeneous agents with different capabilities.
HAPPO vs. HATRPO: HAPPO uses PPO-based updates; HATRPO adapts TRPO for more stable and robust policy updates in heterogeneous settings.
HAPPO vs. HAA2C: HAPPO is PPO-based; HAA2C extends A2C to multi-agent systems, offering stability and performance trade-offs.
HAA2C vs. HAD3QN: HAA2C is an actor-critic method; HAD3QN uses value-based learning with dueling and double Q-learning
HAD3QN vs. HASAC: HAD3QN is value-based; HASAC uses entropy regularization for environments with continuous action spaces.
By supporting a diverse set of algorithms, SustainDC allows researchers to benchmark and compare the performance of various reinforcement learning approaches in the context of sustainable DC control.