Benchmarking Algorithms

SustainDC supports a variety of reinforcement learning algorithms for benchmarking. This section provides an overview of the supported algorithms and highlights their differences.

Supported Algorithms

Abbr.

Algorithm

Description

Use Case

PPO

Proximal Policy Optimization

Balances simplicity and performance with a clipped objective function for stable training

Single-agent environments

IPPO

Independent PPO

Each agent operates independently with its own policy and value function

Multi-agent systems with distinct roles

MAPPO

Multi-Agent PPO

Uses a centralized value function for better coordination among agents

Cooperative multi-agent tasks

HAPPO

Heterogeneous Agent PPO

Designed for heterogeneous agents with different observation and action spaces

Complex environments with diverse agents

HATRPO

Heterogeneous Agent Trust Region Policy Optimization

Adapts TRPO for heterogeneous multi-agent settings with stability and robustness

Complex environments requiring robust policy updates

HAA2C

Heterogeneous Agent Advantage Actor-Critic

Extends A2C to multi-agent settings with individual actor and critic networks

Scenarios with different types of observations and actions

HAD3QN

Heterogeneous Agent Dueling Double Deep Q-Network

Combines dueling networks and double Q-learning for stability and performance

Environments needing fine-grained action distinctions

HASAC

Heterogeneous Agent Soft Actor-Critic

Uses entropy regularization for exploration in multi-agent settings

Adapted to discrete action spaces and high exploration needs

Differences and Use Cases

  • PPO vs. IPPO: PPO is for single-agent setups, while IPPO suits multi-agent environments with independent learning.

  • IPPO vs. MAPPO: IPPO treats agents independently; MAPPO coordinates agents with a centralized value function, ideal for cooperative tasks.

  • MAPPO vs. HAPPO: Both use centralized value functions, but HAPPO is for heterogeneous agents with different capabilities.

  • HAPPO vs. HATRPO: HAPPO uses PPO-based updates; HATRPO adapts TRPO for more stable and robust policy updates in heterogeneous settings.

  • HAPPO vs. HAA2C: HAPPO is PPO-based; HAA2C extends A2C to multi-agent systems, offering stability and performance trade-offs.

  • HAA2C vs. HAD3QN: HAA2C is an actor-critic method; HAD3QN uses value-based learning with dueling and double Q-learning

  • HAD3QN vs. HASAC: HAD3QN is value-based; HASAC uses entropy regularization for environments with continuous action spaces.

By supporting a diverse set of algorithms, SustainDC allows researchers to benchmark and compare the performance of various reinforcement learning approaches in the context of sustainable DC control.