Skip to content

Getting started with CMFยค

Purpose and Scopeยค

This document provides a comprehensive overview of the Common Metadata Framework (CMF), which implements a system for collecting, storing, and querying metadata associated with Machine Learning (ML) pipelines. CMF adopts a data-first approach where all artifacts (datasets, ML models, and performance metrics) are versioned and identified by their content hash, enabling distributed metadata tracking and collaboration across ML teams.

For detailed API documentation, see Core Library (cmflib). For deployment instructions, see Installation & Setup. For web user interface details, see CMF GUI.

System Architectureยค

CMF is designed as a distributed system that enables ML teams to track pipeline metadata locally and synchronize with a central server. The framework automatically tracks code versions, data artifacts, and execution metadata to provide end-to-end traceability of ML experiments.

Common Metadata Framework (CMF) has the following components:

  • cmflib: A Python library that captures and tracks metadata throughout your ML pipeline, including datasets, models, and metrics. It provides APIs for both logging metadata during execution and querying it later for analysis.
  • CMF Client: A command-line tool that synchronizes metadata with the CMF Server, manages artifact transfers to and from storage repositories, and integrates with Git for version control.
  • CMF Server with GUI: A centralized server that aggregates metadata from multiple clients and provides a web-based graphical interface for visualizing pipeline executions, artifacts, and lineage relationships, enabling teams to collaborate effectively.
  • Central Artifact Repositories: Storage backends (such as AWS S3, MinIO, or SSH-based storage) that host your datasets, models, and other pipeline artifacts.

System Interaction Flowยค

%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#f5f5f5','primaryTextColor':'#37474f','primaryBorderColor':'#90a4ae','lineColor':'#78909c','fontSize':'14px','fontFamily':'system-ui, -apple-system, sans-serif'}}}%%
flowchart TB
    WEBUSER([Web Users & ML Teams])
    CMFCLIENT([CMF Client CLI])

    UI[Web Interface]
    SERVERBOX[CMF Server]

    DB[(Metadata Store)]
    ARTIFACTS[Artifact Repositories<br/><i>local/ S3 / MinIO / SSH</i>]

    WEBUSER -->|Access| UI
    CMFCLIENT -->|Push Metadata| SERVERBOX
    SERVERBOX -->|Pull Metadata| CMFCLIENT

    UI -->|Request Data| SERVERBOX
    SERVERBOX -->|Response| UI

    SERVERBOX -->|Query & Store| DB
    DB -->|Query & Store| SERVERBOX

    CMFCLIENT -->|Push Artifacts| ARTIFACTS
    ARTIFACTS -->|Pull Artifacts| CMFCLIENT

    style WEBUSER fill:#e8eaf6,stroke:#5c6bc0,stroke-width:2px,color:#37474f
    style CMFCLIENT fill:#e0f2f1,stroke:#26a69a,stroke-width:2px,color:#37474f
    style UI fill:#f3e5f5,stroke:#ab47bc,stroke-width:2px,color:#37474f
    style SERVERBOX fill:#e8f5e9,stroke:#66bb6a,stroke-width:2.5px,color:#37474f
    style DB fill:#fce4ec,stroke:#ec407a,stroke-width:2px,color:#37474f
    style ARTIFACTS fill:#fff9c4,stroke:#ffca28,stroke-width:2px,color:#37474f

    linkStyle default stroke:#78909c,stroke-width:2px

Core Abstractionsยค

CMF uses three primary abstractions to model ML pipeline metadata:

Abstraction Purpose Implementation
Pipeline Groups related stages and executions Identified by name in cmflib.cmf.Cmf constructor
Context Represents a stage type (e.g., "train", "test") Created via create_context() method
Execution Represents a specific run of a stage Created via create_execution() method
%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#e3f2fd','primaryTextColor':'#546e7a','primaryBorderColor':'#90caf9','lineColor':'#cfd8dc','secondaryColor':'#f3e5f5','tertiaryColor':'#e8f5e9','fontSize':'13px','fontFamily':'system-ui, -apple-system, sans-serif'}}}%%
flowchart LR
    PIPELINE([Pipeline<br/>'mnist_experiment'])
    CONTEXT1([Context<br/>'download'])
    CONTEXT2([Context<br/>'train'])
    CONTEXT3([Context<br/>'test'])

    EXEC1[/Execution<br/>'download_data'/]
    EXEC2[/Execution<br/>'train_model'/]
    EXEC3[/Execution<br/>'evaluate_model'/]

    DATASET1[(Dataset<br/>'raw_data.csv')]
    MODEL1[(Model<br/>'trained_model.pkl')]
    METRICS1[(Metrics<br/>'accuracy: 0.95')]

    PIPELINE -.-> CONTEXT1
    PIPELINE -.-> CONTEXT2
    PIPELINE -.-> CONTEXT3

    CONTEXT1 -.-> EXEC1
    CONTEXT2 -.-> EXEC2
    CONTEXT3 -.-> EXEC3

    EXEC1 -.-> DATASET1
    EXEC2 -.-> MODEL1
    EXEC3 -.-> METRICS1

    style PIPELINE fill:#e3f2fd,stroke:#90caf9,stroke-width:2px,color:#546e7a
    style CONTEXT1 fill:#fff8e1,stroke:#ffcc80,stroke-width:2px,color:#d84315
    style CONTEXT2 fill:#fff8e1,stroke:#ffcc80,stroke-width:2px,color:#d84315
    style CONTEXT3 fill:#fff8e1,stroke:#ffcc80,stroke-width:2px,color:#d84315
    style EXEC1 fill:#f3e5f5,stroke:#ce93d8,stroke-width:2px,color:#7b1fa2
    style EXEC2 fill:#f3e5f5,stroke:#ce93d8,stroke-width:2px,color:#7b1fa2
    style EXEC3 fill:#f3e5f5,stroke:#ce93d8,stroke-width:2px,color:#7b1fa2
    style DATASET1 fill:#e8f5e9,stroke:#a5d6a7,stroke-width:2px,color:#388e3c
    style MODEL1 fill:#e8f5e9,stroke:#a5d6a7,stroke-width:2px,color:#388e3c
    style METRICS1 fill:#e8f5e9,stroke:#a5d6a7,stroke-width:2px,color:#388e3c

    linkStyle default stroke:#cfd8dc,stroke-width:1.5px,color:#cfd8dc,fill:none

Key Featuresยค

Distributed Metadata Trackingยค

CMF enables distributed teams to work independently while maintaining consistent metadata through content-addressable artifacts and Git-like synchronization:

  • Local Development: Each developer works with a local MLMD database
  • Content Hashing: All artifacts are identified by their content hash for universal identification
  • Synchronization: cmf metadata push/pull commands sync with central server
  • Artifact Storage: Support for MinIO, Amazon S3, SSH, and local storage backends

Automatic Version Trackingยค

CMF automatically captures:

  • Code Version: Git commit IDs for reproducibility
  • Data Version: DVC-managed artifact content hashes
  • Environment: Execution parameters and custom properties
  • Lineage: Input/output relationships between executions

Query and Visualizationยค

The system provides multiple interfaces for exploring metadata:

  • Programmatic: CmfQuery class for custom queries
  • Web UI: React-based interface for browsing artifacts and executions
  • Lineage Graphs: D3.js visualizations showing data flow between pipeline stages
  • TensorBoard Integration: Training metrics visualization