.. _custom-workload-data: Custom Workload Data ==================== By default, Sustain-Cluster includes workload traces from `Alibaba `_ and `Google `_ data centers. These traces are used to simulate the tasks that the datacenter needs to process, providing a realistic and dynamic workload for benchmarking purposes. Data Source ----------- The default workload traces are extracted from: - **Alibaba 2020 GPU Cluster Trace** (`LINK `_) Processed Dataset Format & Content ---------------------------------- After preprocessing, the Alibaba trace is stored as a Pandas DataFrame in a binary pickle file: .. code-block:: text data/workload/alibaba_2020_dataset/result_df_full_year_2020.pkl Each row in this DataFrame represents a **15-minute arrival interval** (UTC) and contains: - **tasks_matrix** (NumPy array of shape N×M): detailed per-task features for all tasks arriving in that interval. Columns (in order): 1. ``job_id`` 2. ``start_time`` (Unix timestamp) 3. ``end_time`` (Unix timestamp) 4. ``start_dt`` (Python datetime) 5. ``duration_min`` (float) 6. ``cpu_usage`` (%) 7. ``gpu_wrk_util`` (%) 8. ``avg_mem`` (GB) 9. ``avg_gpu_wrk_mem`` (GB) 10. ``bandwidth_gb`` 11. ``weekday_name`` (e.g., “Monday”) 12. ``weekday_num`` (0 = Monday … 6 = Sunday) Preprocessing Steps ------------------- To adapt the raw two-month trace for year-long, continuous simulation, we apply: 1. **Duration filtering:** drop all tasks shorter than 15 minutes. 2. **Temporal extension:** replicate and blend daily/weekly patterns to expand two months → full year. 3. **Origin assignment:** probabilistically assign each task to a datacenter region based on population weights and local time-of-day activity. (See :any:`utils/workload_utils.assign_task_origins` and main paper § 7.3 for details.) 4. **Interval grouping:** bucket tasks into 15-minute UTC intervals. Resource Normalization ---------------------- During simulation, percentage-based resource requests (``cpu_usage``, ``gpu_wrk_util``) and memory percentages are converted into actual resource units. This is implemented in :any:`utils/workload_utils.extract_tasks_from_row`. Usage in Simulation ------------------- - The simulation loop reads each 15-minute row from the DataFrame. - It queries the embedded ``tasks_matrix`` for that interval, converts percentages → units, and enqueues jobs into the cluster model. Access & Distribution --------------------- - The full pickle file is distributed alongside a ZIP archive in ``data/workload/alibaba_2020_dataset/``. - On first simulation run, if ``result_df_full_year_2020.pkl`` is missing but the ZIP is present, the code automatically extracts the pickle. - To swap in your own workload, place your processed ``.pkl`` file (same schema) into the same folder and update the path in your config: .. code-block:: python DEFAULT_CONFIG["workload_file"] = "data/workload/alibaba_2020_dataset/result_df_full_year_2020.pkl"