External Input Data¶
SustainCluster integrates multiple real-world datasets to create realistic and challenging scheduling scenarios that reflect the dynamic nature of global infrastructure and environmental factors.
Summary table of datasets:
Dataset |
Source |
Description |
---|---|---|
AI Workloads |
Alibaba Cluster Trace 2020 |
Real-world GPU workload traces from Alibaba’s data centers. |
Electricity Prices |
Electricity Maps, GridStatus |
Real-time electricity prices for various regions. |
Carbon Intensity |
Electricity Maps |
Real-time carbon intensity data (gCO₂eq/kWh) for various regions. |
Weather |
Open-Meteo |
Real-time weather data (temperature, humidity) for cooling proxy. |
Transmission Costs |
AWS, GCP, Azure |
Per-GB transfer costs between regions. |
4.1 AI Workloads (Alibaba GPU Cluster Trace)¶
Source: We use the Alibaba Cluster Trace 2020, a real-world dataset of GPU jobs from a large production cluster operated by Alibaba PAI. It covers two months (July–August 2020), including over 6 500 GPUs across ~1 800 machines. This trace contains training and inference jobs using frameworks like TensorFlow, PyTorch, and Graph-Learn. These jobs span a wide range of machine learning workloads, and each job may consist of multiple tasks with multiple instances.
Preprocessing: * Filtering: Remove very short tasks, keeping only those ≥ 15 minutes (typical of substantial training or inference workloads). * Temporal Extension: Extend the 2-month trace to cover a full year by replicating observed daily and weekly patterns. * Origin Assignment: Assign a probabilistic origin datacenter to each task based on regional population weights and local time-of-day activity boosts (simulating higher generation during business hours). See Section 7.3 for details. * Grouping: Aggregate tasks into 15-minute intervals based on their arrival times to align with the simulation timestep.
Dataset Format (After Cleaning): The cleaned dataset is saved as a Pandas .pkl DataFrame file with the following structure:
interval_15m
tasks_matrix
2020-03-01 08:00
- [[job1, tstart, tend, start_dt, duration, cpu, gpu, mem,
gpu_mem, bw, day_name, day_num], …]
2020-03-01 08:15
[[jobN, …]]
…
…
Where:
interval_15m
: The 15-minute time window (UTC) when the task starts.tasks_matrix
: A NumPy array representing all tasks in that interval. Each task row includes: 1. job_id: Unique task identifier. 2. start_time: Start timestamp (Unix). 3. end_time: End timestamp (Unix). 4. start_dt: UTC datetime of start. 5. duration_min: Task duration in minutes. 6. cpu_usage: Number of CPU cores requested (e.g., 600.0 → 6 cores). 7. gpu_wrk_util: Number of GPUs requested (e.g., 50.0 → 0.5 GPUs). 8. avg_mem: Memory used (GB). 9. avg_gpu_wrk_mem: GPU memory used (GB). 10. bandwidth_gb: Estimated input data size (GB). 11. weekday_name: Day name (e.g., Monday). 12. weekday_num: Integer from 0 (Monday) to 6 (Sunday).
Resource Normalization
In the original Alibaba dataset, both CPU and GPU requirements are stored as percentages:
600.0 = 6 vCPU cores
50.0 = 0.5 GPUs
We keep this representation in the .pkl file. However, during task extraction and simulation, we normalize these values into actual hardware units using the logic in extract_tasks_from_row() (located in workload_utils.py):
job_name = task_data[0]
duration = float(task_data[4])
cores_req = float(task_data[5]) / 100.0 # Convert percentage to core count
gpu_req = float(task_data[6]) / 100.0 # Convert percentage to GPU count
mem_req = float(task_data[7]) # Memory in GB
bandwidth_gb = float(task_data[8]) # Data transfer size in GB
task = Task(job_name, arrival_time, duration,
cores_req, gpu_req, mem_req, bandwidth_gb)
tasks.append(task)
4.2 Electricity Prices¶
Sources: Real historical electricity price data is collected from various sources, including Electricity Maps, GridStatus.io, and regional ISO APIs (e.g., CAISO, NYISO, ERCOT, OMIE).
Coverage: Data covers the years 2020–2024 for over 20 global regions corresponding to the supported datacenter locations.
Standardization: Prices are cleaned, converted to a standard format (UTC timestamp, USD/MWh), and aligned with the simulation’s 15-minute intervals. For simulation purposes, prices are typically normalized further (e.g., to USD/kWh).
Storage: Data is organized by region and year in CSV files located under data/electricity_prices/standardized/. See data/electricity_prices/README.md for details.
4.3 Carbon Intensity¶
Source: Grid carbon intensity data (grams CO₂-equivalent per kWh) is sourced from the Electricity Maps API.
Coverage: Provides historical data (2021–2024) for the supported global regions.
Resolution: Data is typically available at hourly or sub-hourly resolution and is aligned with the simulation’s 15-minute timestep.
Units: Stored and used as gCO₂eq/kWh.
4.4 Weather¶
Source: Historical weather data, primarily air temperature and potentially wet-bulb temperature, is obtained via the Open-Meteo API.
Coverage: Data covers the years 2021–2024 for the supported datacenter locations.
Usage: Temperature data directly influences the simulated efficiency and energy consumption of datacenter cooling systems (HVAC).
4.5 Transmission Costs (per-GB)¶
Sources: Inter-region data transfer costs are based on publicly available pricing information from major cloud providers: AWS, GCP, and Azure.
Format: We compile this information into matrices representing the cost (in USD) to transfer 1 GB of data between different cloud regions. The specific matrix used depends on the cloud_provider configured in the simulation.
Storage: These cost matrices are stored as CSV files within the data/ directory and loaded by the transmission_cost_loader.py utility.
4.6 Dataset Visualizations¶
To provide insights into the characteristics of the integrated datasets, the repository includes several visualizations (located in assets/figures/ and generated by scripts like plot_alibaba_workload_stats.py).
Workload Characteristics
Task Duration Distribution¶
Resource Usage Distributions¶
Task Load Heatmap¶
Hourly CPU Requests¶
Hourly GPU Requests¶
Hourly Memory Requests¶
Task Gantt Chart¶
Environmental Factors
Average daily temperature across selected datacenter regions (°C)¶
Average daily carbon intensity across selected datacenter regions (gCO₂eq/kWh)¶
Average hourly electricity price profile over a typical day (UTC time)¶
Average hourly carbon intensity profile over a typical day (UTC time)¶