External Input Data

SustainCluster integrates multiple real-world datasets to create realistic and challenging scheduling scenarios that reflect the dynamic nature of global infrastructure and environmental factors.

Summary table of datasets:

Summary of datasets

Dataset

Source

Description

AI Workloads

Alibaba Cluster Trace 2020

Real-world GPU workload traces from Alibaba’s data centers.

Electricity Prices

Electricity Maps, GridStatus

Real-time electricity prices for various regions.

Carbon Intensity

Electricity Maps

Real-time carbon intensity data (gCO₂eq/kWh) for various regions.

Weather

Open-Meteo

Real-time weather data (temperature, humidity) for cooling proxy.

Transmission Costs

AWS, GCP, Azure

Per-GB transfer costs between regions.

4.1 AI Workloads (Alibaba GPU Cluster Trace)

  • Source: We use the Alibaba Cluster Trace 2020, a real-world dataset of GPU jobs from a large production cluster operated by Alibaba PAI. It covers two months (July–August 2020), including over 6 500 GPUs across ~1 800 machines. This trace contains training and inference jobs using frameworks like TensorFlow, PyTorch, and Graph-Learn. These jobs span a wide range of machine learning workloads, and each job may consist of multiple tasks with multiple instances.

  • Preprocessing: * Filtering: Remove very short tasks, keeping only those ≥ 15 minutes (typical of substantial training or inference workloads). * Temporal Extension: Extend the 2-month trace to cover a full year by replicating observed daily and weekly patterns. * Origin Assignment: Assign a probabilistic origin datacenter to each task based on regional population weights and local time-of-day activity boosts (simulating higher generation during business hours). See Section 7.3 for details. * Grouping: Aggregate tasks into 15-minute intervals based on their arrival times to align with the simulation timestep.

  • Dataset Format (After Cleaning): The cleaned dataset is saved as a Pandas .pkl DataFrame file with the following structure:

    interval_15m

    tasks_matrix

    2020-03-01 08:00

    [[job1, tstart, tend, start_dt, duration, cpu, gpu, mem,

    gpu_mem, bw, day_name, day_num], …]

    2020-03-01 08:15

    [[jobN, …]]

    Where:

    • interval_15m: The 15-minute time window (UTC) when the task starts.

    • tasks_matrix: A NumPy array representing all tasks in that interval. Each task row includes: 1. job_id: Unique task identifier. 2. start_time: Start timestamp (Unix). 3. end_time: End timestamp (Unix). 4. start_dt: UTC datetime of start. 5. duration_min: Task duration in minutes. 6. cpu_usage: Number of CPU cores requested (e.g., 600.0 → 6 cores). 7. gpu_wrk_util: Number of GPUs requested (e.g., 50.0 → 0.5 GPUs). 8. avg_mem: Memory used (GB). 9. avg_gpu_wrk_mem: GPU memory used (GB). 10. bandwidth_gb: Estimated input data size (GB). 11. weekday_name: Day name (e.g., Monday). 12. weekday_num: Integer from 0 (Monday) to 6 (Sunday).

Resource Normalization

In the original Alibaba dataset, both CPU and GPU requirements are stored as percentages:

  • 600.0 = 6 vCPU cores

  • 50.0 = 0.5 GPUs

We keep this representation in the .pkl file. However, during task extraction and simulation, we normalize these values into actual hardware units using the logic in extract_tasks_from_row() (located in workload_utils.py):

job_name     = task_data[0]
duration     = float(task_data[4])
cores_req    = float(task_data[5]) / 100.0    # Convert percentage to core count
gpu_req      = float(task_data[6]) / 100.0    # Convert percentage to GPU count
mem_req      = float(task_data[7])            # Memory in GB
bandwidth_gb = float(task_data[8])            # Data transfer size in GB

task = Task(job_name, arrival_time, duration,
            cores_req, gpu_req, mem_req, bandwidth_gb)
tasks.append(task)

4.2 Electricity Prices

  • Sources: Real historical electricity price data is collected from various sources, including Electricity Maps, GridStatus.io, and regional ISO APIs (e.g., CAISO, NYISO, ERCOT, OMIE).

  • Coverage: Data covers the years 2020–2024 for over 20 global regions corresponding to the supported datacenter locations.

  • Standardization: Prices are cleaned, converted to a standard format (UTC timestamp, USD/MWh), and aligned with the simulation’s 15-minute intervals. For simulation purposes, prices are typically normalized further (e.g., to USD/kWh).

  • Storage: Data is organized by region and year in CSV files located under data/electricity_prices/standardized/. See data/electricity_prices/README.md for details.

4.3 Carbon Intensity

  • Source: Grid carbon intensity data (grams CO₂-equivalent per kWh) is sourced from the Electricity Maps API.

  • Coverage: Provides historical data (2021–2024) for the supported global regions.

  • Resolution: Data is typically available at hourly or sub-hourly resolution and is aligned with the simulation’s 15-minute timestep.

  • Units: Stored and used as gCO₂eq/kWh.

4.4 Weather

  • Source: Historical weather data, primarily air temperature and potentially wet-bulb temperature, is obtained via the Open-Meteo API.

  • Coverage: Data covers the years 2021–2024 for the supported datacenter locations.

  • Usage: Temperature data directly influences the simulated efficiency and energy consumption of datacenter cooling systems (HVAC).

4.5 Transmission Costs (per-GB)

  • Sources: Inter-region data transfer costs are based on publicly available pricing information from major cloud providers: AWS, GCP, and Azure.

  • Format: We compile this information into matrices representing the cost (in USD) to transfer 1 GB of data between different cloud regions. The specific matrix used depends on the cloud_provider configured in the simulation.

  • Storage: These cost matrices are stored as CSV files within the data/ directory and loaded by the transmission_cost_loader.py utility.

4.6 Dataset Visualizations

To provide insights into the characteristics of the integrated datasets, the repository includes several visualizations (located in assets/figures/ and generated by scripts like plot_alibaba_workload_stats.py).

Workload Characteristics

Task Duration Distribution

Task Duration Distribution

Resource Usage Distributions

Resource Usage Distributions

Task Load Heatmap

Task Load Heatmap

Hourly CPU Requests

Hourly CPU Requests

Hourly GPU Requests

Hourly GPU Requests

Hourly Memory Requests

Hourly Memory Requests

Task Gantt Chart

Task Gantt Chart

Environmental Factors

Temperature Trends

Average daily temperature across selected datacenter regions (°C)

Carbon Intensity Trends

Average daily carbon intensity across selected datacenter regions (gCO₂eq/kWh)

Electricity Price Trends

Average hourly electricity price profile over a typical day (UTC time)

Carbon Intensity Daily Variation

Average hourly carbon intensity profile over a typical day (UTC time)