Custom Workload Data¶
By default, Sustain-Cluster includes workload traces from Alibaba and Google data centers. These traces are used to simulate the tasks that the datacenter needs to process, providing a realistic and dynamic workload for benchmarking purposes.
Data Source¶
The default workload traces are extracted from:
Alibaba 2020 GPU Cluster Trace (LINK)
Processed Dataset Format & Content¶
After preprocessing, the Alibaba trace is stored as a Pandas DataFrame in a binary pickle file:
data/workload/alibaba_2020_dataset/result_df_full_year_2020.pkl
Each row in this DataFrame represents a 15-minute arrival interval (UTC) and contains:
tasks_matrix (NumPy array of shape N×M): detailed per-task features for all tasks arriving in that interval. Columns (in order):
job_id
start_time
(Unix timestamp)end_time
(Unix timestamp)start_dt
(Python datetime)duration_min
(float)cpu_usage
(%)gpu_wrk_util
(%)avg_mem
(GB)avg_gpu_wrk_mem
(GB)bandwidth_gb
weekday_name
(e.g., “Monday”)weekday_num
(0 = Monday … 6 = Sunday)
Preprocessing Steps¶
To adapt the raw two-month trace for year-long, continuous simulation, we apply:
Duration filtering: drop all tasks shorter than 15 minutes.
Temporal extension: replicate and blend daily/weekly patterns to expand two months → full year.
Origin assignment: probabilistically assign each task to a datacenter region based on population weights and local time-of-day activity. (See
utils/workload_utils.assign_task_origins
and main paper § 7.3 for details.)Interval grouping: bucket tasks into 15-minute UTC intervals.
Resource Normalization¶
During simulation, percentage-based resource requests (cpu_usage
, gpu_wrk_util
) and memory percentages are converted into actual resource units. This is implemented in
utils/workload_utils.extract_tasks_from_row
.
Usage in Simulation¶
The simulation loop reads each 15-minute row from the DataFrame.
It queries the embedded
tasks_matrix
for that interval, converts percentages → units, and enqueues jobs into the cluster model.
Access & Distribution¶
The full pickle file is distributed alongside a ZIP archive in
data/workload/alibaba_2020_dataset/
.On first simulation run, if
result_df_full_year_2020.pkl
is missing but the ZIP is present, the code automatically extracts the pickle.To swap in your own workload, place your processed
.pkl
file (same schema) into the same folder and update the path in your config:DEFAULT_CONFIG["workload_file"] = "data/workload/alibaba_2020_dataset/result_df_full_year_2020.pkl"