Benchmarking Algorithms

Sustain-Cluster currently employs the Soft Actor–Critic (SAC) algorithm, an off‐policy actor–critic method that maximizes a stochastic policy’s entropy‐augmented return. SAC alternates between two interleaved updates:

Policy (Actor) Update

(1)\[J_{\pi}(\phi) = \mathbb{E}_{s_t \sim \mathcal{D},\,a_t \sim \pi_\phi} \Bigl[\, \alpha\,\log \pi_\phi(a_t\mid s_t) \;-\; Q_\theta(s_t, a_t) \Bigr] \,.\]

Here:

  • \(\pi_\phi(a\mid s)\) denotes the stochastic policy parameterized by \(\phi\).

  • \(Q_\theta(s,a)\) is the soft Q-function parameterized by \(\theta\).

  • \(\alpha > 0\) is the temperature coefficient balancing exploration (via entropy) and exploitation.

  • \(\mathcal{D}\) is the replay buffer of past transitions.

Minimizing \(J_{\pi}\) encourages the policy to choose actions that both achieve high soft‐Q values and maintain high entropy, yielding robust exploration.

Q-Function (Critic) Update

(2)\[J_{Q}(\theta) = \mathbb{E}_{(s_t,a_t)\sim \mathcal{D}} \Bigl[\, \tfrac{1}{2}\bigl(Q_\theta(s_t,a_t) - y_t\bigr)^{2} \Bigr]\]

where the soft‐Bellman backup target is

\[\begin{split}y_t = r(s_t,a_t) \;+\; \gamma\, \mathbb{E}_{% \substack{s_{t+1}\sim p(\cdot\mid s_t,a_t)\\a_{t+1}\sim \pi_\phi}} \Bigl[ Q_{\bar\theta}(s_{t+1},a_{t+1}) \;-\; \alpha\,\log \pi_\phi(a_{t+1}\mid s_{t+1}) \Bigr] \,.\end{split}\]

Here:

  • \(r(s_t,a_t)\) is the immediate reward at step \(t\).

  • \(\gamma \in [0,1)\) is the discount factor.

  • \(Q_{\bar\theta}\) is a target network with parameters \(\bar\theta\), updated via Polyak averaging to stabilize training.

  • \(p(s_{t+1}\mid s_t,a_t)\) is the environment’s transition probability.

By fitting \(Q_\theta\) to these targets, the critic learns to approximate the entropy‐regularized state‐action value. The temperature term \(\alpha\) again trades off between reward maximization and policy entropy.

Overall, SAC proceeds by sampling minibatches from the replay buffer, performing a gradient descent step on the critic loss \(J_Q\), then updating the policy parameters \(\phi\) to minimize \(J_{\pi}\), and finally updating the target network parameters \(\bar\theta\) towards \(\theta\). This off‐policy, entropy‐regularized framework yields both sample efficiency and stable learning.

We are also extending Sustain-Cluster to support on‐policy methods such as Advantage Actor–Critic (A2C), which is currently a work in progress. In future releases, we plan to integrate additional algorithms from the Ray RLlib ecosystem—see the Ray RLlib documentation—to enable a broader and more rigorous benchmarking suite.