Custom Workload Data

By default, Sustain-Cluster includes workload traces from Alibaba and Google data centers. These traces are used to simulate the tasks that the datacenter needs to process, providing a realistic and dynamic workload for benchmarking purposes.

Data Source

The default workload traces are extracted from:

  • Alibaba 2020 GPU Cluster Trace (LINK)

Processed Dataset Format & Content

After preprocessing, the Alibaba trace is stored as a Pandas DataFrame in a binary pickle file:

data/workload/alibaba_2020_dataset/result_df_full_year_2020.pkl

Each row in this DataFrame represents a 15-minute arrival interval (UTC) and contains:

  • tasks_matrix (NumPy array of shape N×M): detailed per-task features for all tasks arriving in that interval. Columns (in order):

    1. job_id

    2. start_time (Unix timestamp)

    3. end_time (Unix timestamp)

    4. start_dt (Python datetime)

    5. duration_min (float)

    6. cpu_usage (%)

    7. gpu_wrk_util (%)

    8. avg_mem (GB)

    9. avg_gpu_wrk_mem (GB)

    10. bandwidth_gb

    11. weekday_name (e.g., “Monday”)

    12. weekday_num (0 = Monday … 6 = Sunday)

Preprocessing Steps

To adapt the raw two-month trace for year-long, continuous simulation, we apply:

  1. Duration filtering: drop all tasks shorter than 15 minutes.

  2. Temporal extension: replicate and blend daily/weekly patterns to expand two months → full year.

  3. Origin assignment: probabilistically assign each task to a datacenter region based on population weights and local time-of-day activity. (See utils/workload_utils.assign_task_origins and main paper § 7.3 for details.)

  4. Interval grouping: bucket tasks into 15-minute UTC intervals.

Resource Normalization

During simulation, percentage-based resource requests (cpu_usage, gpu_wrk_util) and memory percentages are converted into actual resource units. This is implemented in utils/workload_utils.extract_tasks_from_row.

Usage in Simulation

  • The simulation loop reads each 15-minute row from the DataFrame.

  • It queries the embedded tasks_matrix for that interval, converts percentages → units, and enqueues jobs into the cluster model.

Access & Distribution

  • The full pickle file is distributed alongside a ZIP archive in data/workload/alibaba_2020_dataset/.

  • On first simulation run, if result_df_full_year_2020.pkl is missing but the ZIP is present, the code automatically extracts the pickle.

  • To swap in your own workload, place your processed .pkl file (same schema) into the same folder and update the path in your config:

    DEFAULT_CONFIG["workload_file"] = "data/workload/alibaba_2020_dataset/result_df_full_year_2020.pkl"