Sustain-Cluster Environment¶
State Observation¶
At each timestep \(t\), the environment returns a detailed, variable-length observation \(s_t\). This observation is structured as a list of \(k_t\) task-feature vectors, one per pending task.
Per-Task Vector
By default (in _get_obs
of TaskSchedulingEnv
), each pending task \(i\) is represented by a concatenated feature vector of length \(4 + 5 + 5N\) where:
Global Time Features (4 features): Sine/cosine encoding of the day of year and hour of day.
Task-Specific Features (5 features): Origin DC ID, CPU-core requirement, GPU requirement, estimated duration, and time remaining until SLA deadline.
Per-Datacenter Features (5 × N features): For each of the \(N\) datacenters: available CPU %, available GPU %, available memory %, current carbon intensity (kg CO₂/kWh), and current electricity price (USD/kWh).
Because the number of pending tasks \(k_t\) can change between timesteps (i.e. \(k_t ≠ k_{t+1}\)), the overall shape of \(s_t\) varies. For example, \(s_t\) might be a list of 10 vectors (10 tasks), while \(s_{t+1}\) might be only 5 vectors (5 tasks).
Handling Variability
- Off-policy SAC agents use a FastReplayBuffer
that pads each list of observations to a fixed max_tasks
length and applies masking during batch updates.
- On-policy agents (e.g. A2C) can process the variable-length list sequentially during rollouts, aggregating per-task values into a single state value.
Customization
Users may override _get_obs
to include additional information from self.cluster_manager.datacenters
(e.g. pending queue lengths, detailed thermal state, forecasted CI) or self.current_tasks
to craft bespoke state representations tailored to their agent architecture, scheduling strategy, or reward function.
Action Space¶
At each timestep \(t\), the agent receives the list of \(k_t\) pending tasks and must output one discrete action per task:
a_i ∈ {0, 1, …, N}
0: defer the \(i\)-th task (it remains pending and is reconsidered in the next 15-minute step).
j (where \(1 ≤ j ≤ N\)): assign the \(i\)-th task to datacenter \(j\) (incurring any transmission cost or delay if \(j\) differs from the task’s origin).
Since \(k_t\) varies over time, the action requirement per timestep is also variable-length. See State Observation for how existing RL examples accommodate this in both off-policy and on-policy settings.
Reward Signal¶
After all \(k_t\) actions for timestep \(t\) are applied and the simulator advances, a single global scalar reward \(r_t\) is returned. This reward is computed by a configurable RewardFunction (see Reward Functions), which aggregates performance and sustainability metrics according to user-defined weights and objectives, for example:
Minimizing operational cost
Minimizing carbon footprint
Minimizing total energy consumption
Minimizing SLA violations
Users may extend or replace the default reward function to reflect custom operational goals and trade-offs.