Custom Workload Data¶

By default, Sustain-Cluster includes workload traces from Alibaba and Google data centers. These traces are used to simulate the tasks that the datacenter needs to process, providing a realistic and dynamic workload for benchmarking purposes.

Data Source¶

The default workload traces are extracted from:

Alibaba 2020 GPU Cluster Trace (LINK)

Processed Dataset Format & Content¶

After preprocessing, the Alibaba trace is stored as a Pandas DataFrame in a binary pickle file:

data/workload/alibaba_2020_dataset/result_df_full_year_2020.pkl

Each row in this DataFrame represents a 15-minute arrival interval (UTC) and contains:

tasks_matrix (NumPy array of shape N×M): detailed per-task features for all tasks arriving in that interval. Columns (in order):
1. job_id
2. start_time (Unix timestamp)
3. end_time (Unix timestamp)
4. start_dt (Python datetime)
5. duration_min (float)
6. cpu_usage (%)
7. gpu_wrk_util (%)
8. avg_mem (GB)
9. avg_gpu_wrk_mem (GB)
10. bandwidth_gb
11. weekday_name (e.g., “Monday”)
12. weekday_num (0 = Monday … 6 = Sunday)

Preprocessing Steps¶

To adapt the raw two-month trace for year-long, continuous simulation, we apply:

Duration filtering: drop all tasks shorter than 15 minutes.
Temporal extension: replicate and blend daily/weekly patterns to expand two months → full year.
Origin assignment: probabilistically assign each task to a datacenter region based on population weights and local time-of-day activity. (See utils/workload_utils.assign_task_origins and main paper § 7.3 for details.)
Interval grouping: bucket tasks into 15-minute UTC intervals.

Resource Normalization¶

During simulation, percentage-based resource requests (cpu_usage, gpu_wrk_util) and memory percentages are converted into actual resource units. This is implemented in utils/workload_utils.extract_tasks_from_row.

Usage in Simulation¶

The simulation loop reads each 15-minute row from the DataFrame.
It queries the embedded tasks_matrix for that interval, converts percentages → units, and enqueues jobs into the cluster model.

Access & Distribution¶

The full pickle file is distributed alongside a ZIP archive in data/workload/alibaba_2020_dataset/.
On first simulation run, if result_df_full_year_2020.pkl is missing but the ZIP is present, the code automatically extracts the pickle.
To swap in your own workload, place your processed .pkl file (same schema) into the same folder and update the path in your config:
```
DEFAULT_CONFIG["workload_file"] = "data/workload/alibaba_2020_dataset/result_df_full_year_2020.pkl"
```