.. _environments:

Sustain-Cluster Environment
=====================

State Observation
-----------------

At each timestep :math:`t`, the environment returns a detailed, variable-length observation :math:`s_t`. This observation is structured as a list of :math:`k_t` task-feature vectors, one per pending task.

Per-Task Vector  
By default (in ``_get_obs`` of ``TaskSchedulingEnv``), each pending task :math:`i` is represented by a concatenated feature vector of length :math:`4 + 5 + 5N` where:

- **Global Time Features** (4 features):  
  Sine/cosine encoding of the day of year and hour of day.

- **Task-Specific Features** (5 features):  
  Origin DC ID, CPU-core requirement, GPU requirement, estimated duration, and time remaining until SLA deadline.

- **Per-Datacenter Features** (5 × N features):  
  For each of the :math:`N` datacenters: available CPU %, available GPU %, available memory %, current carbon intensity (kg CO₂/kWh), and current electricity price (USD/kWh).

Because the number of pending tasks :math:`k_t` can change between timesteps (i.e. :math:`k_t ≠ k_{t+1}`), the overall shape of :math:`s_t` varies. For example, :math:`s_t` might be a list of 10 vectors (10 tasks), while :math:`s_{t+1}` might be only 5 vectors (5 tasks).

Handling Variability  
- **Off-policy SAC agents** use a ``FastReplayBuffer`` that pads each list of observations to a fixed ``max_tasks`` length and applies masking during batch updates.  
- **On-policy agents** (e.g. A2C) can process the variable-length list sequentially during rollouts, aggregating per-task values into a single state value.

Customization  
Users may override ``_get_obs`` to include additional information from ``self.cluster_manager.datacenters`` (e.g. pending queue lengths, detailed thermal state, forecasted CI) or ``self.current_tasks`` to craft bespoke state representations tailored to their agent architecture, scheduling strategy, or reward function.

Action Space
------------

At each timestep :math:`t`, the agent receives the list of :math:`k_t` pending tasks and must output one discrete action per task::

  a_i ∈ {0, 1, …, N}

- **0**: defer the :math:`i`-th task (it remains pending and is reconsidered in the next 15-minute step).  
- **j** (where :math:`1 ≤ j ≤ N`): assign the :math:`i`-th task to datacenter :math:`j` (incurring any transmission cost or delay if :math:`j` differs from the task’s origin).

Since :math:`k_t` varies over time, the action requirement per timestep is also variable-length. See **State Observation** for how existing RL examples accommodate this in both off-policy and on-policy settings.

Reward Signal
-------------

After all :math:`k_t` actions for timestep :math:`t` are applied and the simulator advances, a single global scalar reward :math:`r_t` is returned. This reward is computed by a configurable **RewardFunction** (see :ref:`reward-functions`), which aggregates performance and sustainability metrics according to user-defined weights and objectives, for example:

- Minimizing operational cost  
- Minimizing carbon footprint  
- Minimizing total energy consumption  
- Minimizing SLA violations  

Users may extend or replace the default reward function to reflect custom operational goals and trade-offs.