Getting started with cmf¶

Purpose and Scope¶

This document provides a comprehensive overview of the Common Metadata Framework (CMF), which implements a system for collecting, storing, and querying metadata associated with Machine Learning (ML) pipelines. CMF adopts a data-first approach where all artifacts (datasets, ML models, and performance metrics) are versioned and identified by their content hash, enabling distributed metadata tracking and collaboration across ML teams.

For detailed API documentation, see Core Library (cmflib). For server deployment instructions, see Installation & Setup. For web user interface details, see cmf-gui.

System Architecture¶

CMF is designed as a distributed system that enables ML teams to track pipeline metadata locally and synchronize with a central server. The framework automatically tracks code versions, data artifacts, and execution metadata to provide end-to-end traceability of ML experiments.

Common Metadata Framework (CMF) has the following components:

Metadata Library exposes APIs to track pipeline metadata. It also provides APIs to query the stored metadata.
cmf-client interacts with the cmf-server to pull or push metadata.
cmf-server with GUI interacts with remote cmf-clients and merges the metadata transferred by each client. This server also provides a GUI that can render the stored metadata.
Central Artifact Repositories host the code and data.

graph TB
    subgraph "Local Development Environment"
        CMF_CLIENT["**Metadata Library**<br/>cmflib.cmf.Cmf<br/>Main API Class"]
        CLI_TOOLS["**cmf-client**<br/>CLI Commands<br/>cmf init, push, pull"]
        LOCAL_MLMD[("Local MLMD<br/>SQLite Database")]
        DVC_GIT["DVC + Git<br/>Artifact Versioning"]
        NEO4J[("Neo4j<br/>Graph Database")]
    end

    subgraph "Central Infrastructure"
        CMF_SERVER["**cmf-server**<br/>FastAPI Application"]
        CENTRAL_MLMD[("PostgreSQL<br/>Central Metadata")]
        ARTIFACT_STORAGE[("Artifact Storage<br/>MinIO/S3/SSH")]
    end

    subgraph "Web Interface"
        REACT_UI["React Application<br/>Port 3000"]
        LINEAGE_VIZ["D3.js Lineage<br/>Visualization"]
        TENSORBOARD["TensorBoard<br/>Port 6006"]
    end

    CMF_CLIENT --> LOCAL_MLMD
    CMF_CLIENT --> DVC_GIT
    CMF_CLIENT --> NEO4J
    CLI_TOOLS --> CMF_SERVER
    CMF_SERVER --> CENTRAL_MLMD
    DVC_GIT --> ARTIFACT_STORAGE
    REACT_UI --> CMF_SERVER
    REACT_UI --> LINEAGE_VIZ
    CMF_SERVER --> TENSORBOARD

Core Abstractions¶

CMF uses three primary abstractions to model ML pipeline metadata:

Abstraction	Purpose	Implementation
Pipeline	Groups related stages and executions	Identified by name in `cmflib.cmf.Cmf` constructor
Context	Represents a stage type (e.g., "train", "test")	Created via `create_context()` method
Execution	Represents a specific run of a stage	Created via `create_execution()` method

graph LR
    PIPELINE["Pipeline<br/>'mnist_experiment'"] --> CONTEXT1["Context<br/>'download'"]
    PIPELINE --> CONTEXT2["Context<br/>'train'"]
    PIPELINE --> CONTEXT3["Context<br/>'test'"]

    CONTEXT1 --> EXEC1["Execution<br/>'download_data'"]
    CONTEXT2 --> EXEC2["Execution<br/>'train_model'"]
    CONTEXT3 --> EXEC3["Execution<br/>'evaluate_model'"]

    EXEC1 --> DATASET1["Dataset<br/>'raw_data.csv'"]
    EXEC2 --> MODEL1["Model<br/>'trained_model.pkl'"]
    EXEC3 --> METRICS1["Metrics<br/>'accuracy: 0.95'"]

Component Architecture¶

CMF Library (`cmflib`)¶

The cmflib package provides the primary API for metadata tracking through the Cmf class and supporting modules:

graph TB
    subgraph "cmflib Package"
        CMF_CLASS["cmf.Cmf<br/>Main API Class"]
        METADATA_HELPER["metadata_helper.py<br/>MLMD Integration"]
        CMF_MERGER["cmf_merger.py<br/>Push/Pull Operations"]
        CMFQUERY["cmfquery.py<br/>Query Interface"]
        DATASLICE["dataslice.py<br/>Data Subset Tracking"]
    end

    subgraph "External Dependencies"
        MLMD[("ML Metadata<br/>SQLite/PostgreSQL")]
        DVC_SYSTEM["DVC<br/>Data Version Control"]
        GIT_SYSTEM["Git<br/>Code Version Control"]
        NEO4J_DB[("Neo4j<br/>Graph Database")]
    end

    CMF_CLASS --> METADATA_HELPER
    CMF_CLASS --> CMF_MERGER
    CMF_CLASS --> DATASLICE
    METADATA_HELPER --> MLMD
    CMF_CLASS --> DVC_SYSTEM
    CMF_CLASS --> GIT_SYSTEM
    CMF_CLASS --> NEO4J_DB
    CMF_MERGER --> CMFQUERY

Server and Web Components¶

The CMF server provides centralized metadata storage and a web interface for exploring ML pipeline lineage:

graph TB
    subgraph "cmf-server"
        FASTAPI_SERVER["FastAPI Server<br/>Port 8080"]
        GET_DATA["get_data.py<br/>Data Access Layer"]
        LINEAGE_QUERY["Lineage Query<br/>D3 Visualization"]
    end

    subgraph "UI Components"
        REACT_APP["React Application<br/>ui/ directory"]
        ARTIFACTS_PAGE["Artifacts Page<br/>Browse Datasets/Models"]
        EXECUTIONS_PAGE["Executions Page<br/>Browse Pipeline Runs"]
        LINEAGE_PAGE["Lineage Visualization<br/>D3.js Graphs"]
    end

    subgraph "Storage Layer"
        POSTGRES[("PostgreSQL<br/>Central MLMD")]
        TENSORBOARD_LOGS[("TensorBoard Logs<br/>Training Metrics")]
    end

    FASTAPI_SERVER --> GET_DATA
    FASTAPI_SERVER --> LINEAGE_QUERY
    REACT_APP --> FASTAPI_SERVER
    REACT_APP --> ARTIFACTS_PAGE
    REACT_APP --> EXECUTIONS_PAGE
    REACT_APP --> LINEAGE_PAGE
    GET_DATA --> POSTGRES
    FASTAPI_SERVER --> TENSORBOARD_LOGS

Key Features¶

Distributed Metadata Tracking¶

CMF enables distributed teams to work independently while maintaining consistent metadata through content-addressable artifacts and Git-like synchronization:

Local Development: Each developer works with a local MLMD database
Content Hashing: All artifacts are identified by their content hash for universal identification
Synchronization: cmf metadata push/pull commands sync with central server
Artifact Storage: Support for MinIO, Amazon S3, SSH, and local storage backends

Automatic Version Tracking¶

CMF automatically captures:

Code Version: Git commit IDs for reproducibility
Data Version: DVC-managed artifact content hashes
Environment: Execution parameters and custom properties
Lineage: Input/output relationships between executions

Query and Visualization¶

The system provides multiple interfaces for exploring metadata:

Programmatic: CmfQuery class for custom queries
Web UI: React-based interface for browsing artifacts and executions
Lineage Graphs: D3.js visualizations showing data flow between pipeline stages
TensorBoard Integration: Training metrics visualization