Getting started with cmf¶
Purpose and Scope¶
This document provides a comprehensive overview of the Common Metadata Framework (CMF), which implements a system for collecting, storing, and querying metadata associated with Machine Learning (ML) pipelines. CMF adopts a data-first approach where all artifacts (datasets, ML models, and performance metrics) are versioned and identified by their content hash, enabling distributed metadata tracking and collaboration across ML teams.
For detailed API documentation, see Core Library (cmflib). For server deployment instructions, see Installation & Setup. For web user interface details, see cmf-gui.
System Architecture¶
CMF is designed as a distributed system that enables ML teams to track pipeline metadata locally and synchronize with a central server. The framework automatically tracks code versions, data artifacts, and execution metadata to provide end-to-end traceability of ML experiments.
Common Metadata Framework (CMF) has the following components:
- Metadata Library exposes APIs to track pipeline metadata. It also provides APIs to query the stored metadata.
 - cmf-client interacts with the cmf-server to pull or push metadata.
 - cmf-server with GUI interacts with remote cmf-clients and merges the metadata transferred by each client. This server also provides a GUI that can render the stored metadata.
 - Central Artifact Repositories host the code and data.
 
graph TB
    subgraph "Local Development Environment"
        CMF_CLIENT["**Metadata Library**<br/>cmflib.cmf.Cmf<br/>Main API Class"]
        CLI_TOOLS["**cmf-client**<br/>CLI Commands<br/>cmf init, push, pull"]
        LOCAL_MLMD[("Local MLMD<br/>SQLite Database")]
        DVC_GIT["DVC + Git<br/>Artifact Versioning"]
        NEO4J[("Neo4j<br/>Graph Database")]
    end
    subgraph "Central Infrastructure"
        CMF_SERVER["**cmf-server**<br/>FastAPI Application"]
        CENTRAL_MLMD[("PostgreSQL<br/>Central Metadata")]
        ARTIFACT_STORAGE[("Artifact Storage<br/>MinIO/S3/SSH")]
    end
    subgraph "Web Interface"
        REACT_UI["React Application<br/>Port 3000"]
        LINEAGE_VIZ["D3.js Lineage<br/>Visualization"]
        TENSORBOARD["TensorBoard<br/>Port 6006"]
    end
    CMF_CLIENT --> LOCAL_MLMD
    CMF_CLIENT --> DVC_GIT
    CMF_CLIENT --> NEO4J
    CLI_TOOLS --> CMF_SERVER
    CMF_SERVER --> CENTRAL_MLMD
    DVC_GIT --> ARTIFACT_STORAGE
    REACT_UI --> CMF_SERVER
    REACT_UI --> LINEAGE_VIZ
    CMF_SERVER --> TENSORBOARD
Core Abstractions¶
CMF uses three primary abstractions to model ML pipeline metadata:
| Abstraction | Purpose | Implementation | 
|---|---|---|
| Pipeline | Groups related stages and executions | Identified by name in cmflib.cmf.Cmf constructor | 
| Context | Represents a stage type (e.g., "train", "test") | Created via create_context() method | 
| Execution | Represents a specific run of a stage | Created via create_execution() method | 
graph LR
    PIPELINE["Pipeline<br/>'mnist_experiment'"] --> CONTEXT1["Context<br/>'download'"]
    PIPELINE --> CONTEXT2["Context<br/>'train'"]
    PIPELINE --> CONTEXT3["Context<br/>'test'"]
    CONTEXT1 --> EXEC1["Execution<br/>'download_data'"]
    CONTEXT2 --> EXEC2["Execution<br/>'train_model'"]
    CONTEXT3 --> EXEC3["Execution<br/>'evaluate_model'"]
    EXEC1 --> DATASET1["Dataset<br/>'raw_data.csv'"]
    EXEC2 --> MODEL1["Model<br/>'trained_model.pkl'"]
    EXEC3 --> METRICS1["Metrics<br/>'accuracy: 0.95'"]
Component Architecture¶
CMF Library (cmflib)¶
The cmflib package provides the primary API for metadata tracking through the Cmf class and supporting modules:
graph TB
    subgraph "cmflib Package"
        CMF_CLASS["cmf.Cmf<br/>Main API Class"]
        METADATA_HELPER["metadata_helper.py<br/>MLMD Integration"]
        CMF_MERGER["cmf_merger.py<br/>Push/Pull Operations"]
        CMFQUERY["cmfquery.py<br/>Query Interface"]
        DATASLICE["dataslice.py<br/>Data Subset Tracking"]
    end
    subgraph "External Dependencies"
        MLMD[("ML Metadata<br/>SQLite/PostgreSQL")]
        DVC_SYSTEM["DVC<br/>Data Version Control"]
        GIT_SYSTEM["Git<br/>Code Version Control"]
        NEO4J_DB[("Neo4j<br/>Graph Database")]
    end
    CMF_CLASS --> METADATA_HELPER
    CMF_CLASS --> CMF_MERGER
    CMF_CLASS --> DATASLICE
    METADATA_HELPER --> MLMD
    CMF_CLASS --> DVC_SYSTEM
    CMF_CLASS --> GIT_SYSTEM
    CMF_CLASS --> NEO4J_DB
    CMF_MERGER --> CMFQUERY
Server and Web Components¶
The CMF server provides centralized metadata storage and a web interface for exploring ML pipeline lineage:
graph TB
    subgraph "cmf-server"
        FASTAPI_SERVER["FastAPI Server<br/>Port 8080"]
        GET_DATA["get_data.py<br/>Data Access Layer"]
        LINEAGE_QUERY["Lineage Query<br/>D3 Visualization"]
    end
    subgraph "UI Components"
        REACT_APP["React Application<br/>ui/ directory"]
        ARTIFACTS_PAGE["Artifacts Page<br/>Browse Datasets/Models"]
        EXECUTIONS_PAGE["Executions Page<br/>Browse Pipeline Runs"]
        LINEAGE_PAGE["Lineage Visualization<br/>D3.js Graphs"]
    end
    subgraph "Storage Layer"
        POSTGRES[("PostgreSQL<br/>Central MLMD")]
        TENSORBOARD_LOGS[("TensorBoard Logs<br/>Training Metrics")]
    end
    FASTAPI_SERVER --> GET_DATA
    FASTAPI_SERVER --> LINEAGE_QUERY
    REACT_APP --> FASTAPI_SERVER
    REACT_APP --> ARTIFACTS_PAGE
    REACT_APP --> EXECUTIONS_PAGE
    REACT_APP --> LINEAGE_PAGE
    GET_DATA --> POSTGRES
    FASTAPI_SERVER --> TENSORBOARD_LOGS
Key Features¶
Distributed Metadata Tracking¶
CMF enables distributed teams to work independently while maintaining consistent metadata through content-addressable artifacts and Git-like synchronization:
- Local Development: Each developer works with a local MLMD database
 - Content Hashing: All artifacts are identified by their content hash for universal identification
 - Synchronization: 
cmf metadata push/pullcommands sync with central server - Artifact Storage: Support for MinIO, Amazon S3, SSH, and local storage backends
 
Automatic Version Tracking¶
CMF automatically captures:
- Code Version: Git commit IDs for reproducibility
 - Data Version: DVC-managed artifact content hashes
 - Environment: Execution parameters and custom properties
 - Lineage: Input/output relationships between executions
 
Query and Visualization¶
The system provides multiple interfaces for exploring metadata:
- Programmatic: 
CmfQueryclass for custom queries - Web UI: React-based interface for browsing artifacts and executions
 - Lineage Graphs: D3.js visualizations showing data flow between pipeline stages
 - TensorBoard Integration: Training metrics visualization