Complete API Documentation

Logging API'S¶

1. Library init call - Cmf()¶

This calls initiates the library and also creates a pipeline object with the name provided.
Arguments to be passed CMF:

cmf = cmf.Cmf(filename="mlmd", pipeline_name="Test-env")

# Returns a Context object of mlmd.proto.Context

Argument	Type	Description
filename	String	Path to the sqlite file to store the metadata
pipeline_name	String	Name to uniquely identify the pipeline. Note that name is the unique identification for a pipeline. If a pipeline already exists with the same name, the existing pipeline object is reused
custom_properties	Dictionary (Optional)	Additional properties of the pipeline that needs to be stored
graph	Bool (Optional)	If set to true, the library also stores the relationships in the provided graph database. Following environment variables should be set: NEO4J_URI, NEO4J_USER_NAME, NEO4J_PASSWD

Return Object: mlmd.proto.Context

Attribute	Type	Description
create_time_since_epoch	int64	Creation timestamp
custom_properties	repeated CustomPropertiesEntry	Custom properties
id	int64	Unique identifier
last_update_time_since_epoch	int64	Last update timestamp
name	string	Context name
properties	repeated PropertiesEntry	Properties
type	string	Context type
type_id	int64	Type identifier

2. create_context - Creates a Stage with properties¶

A pipeline may include multiple stages. A unique name should be provided for every Stage in a pipeline.
Arguments to be passed CMF:

context = cmf.create_context(pipeline_stage="Prepare", custom_properties={"user-metadata1":"metadata_value"})

Argument	Type	Description
pipeline_stage	String	Name of the pipeline Stage
custom_properties	Dictionary (Optional)	Key value pairs of additional properties of the stage that needs to be stored

Return Object: mlmd.proto.Context

Attribute	Type	Description
create_time_since_epoch	int64	Creation timestamp
custom_properties	repeated CustomPropertiesEntry	Custom properties
id	int64	Unique identifier
last_update_time_since_epoch	int64	Last update timestamp
name	string	Context name
properties	repeated PropertiesEntry	Properties
type	string	Context type
type_id	int64	Type identifier

3. create_execution - Creates an Execution with properties¶

A stage can have multiple executions. A unique name should ne provided for exery execution. Properties of the execution can be paased as key value pairs in the custom properties. Eg: The hyper parameters used for the execution can be passed.

execution = cmf.create_execution(execution_type="Prepare",
                                 custom_properties={"Split": split, "Seed": seed})

# execution_type: String - Name of the execution
# custom_properties: Dictionary (Optional Parameter)
# Returns: Execution object of type mlmd.proto.Execution

Argument	Type	Description
execution_type	String	Name of the execution
custom_properties	Dictionary (Optional)	Additional properties for the execution

Return Object: mlmd.proto.Execution

Attribute	Type	Description
create_time_since_epoch	int64	Creation timestamp
custom_properties	repeated CustomPropertiesEntry	Custom properties
id	int64	Unique identifier
last_known_state	State	Last known execution state
last_update_time_since_epoch	int64	Last update timestamp
name	string	Execution name
properties	repeated PropertiesEntry	Properties (Git_Repo, Context_Type, Git_Start_Commit, Pipeline_Type, Context_ID, Git_End_Commit, Execution Command, Pipeline_id)
type	string	Execution type
type_id	int64	Type identifier

4. log_dataset - Logs a Dataset and its properties¶

Tracks a Dataset and its version. The version of the dataset is automatically obtained from the versioning software(DVC) and tracked as a metadata.

artifact = cmf.log_dataset("/repo/data.xml", "input", custom_properties={"Source": "kaggle"})

Argument	Type	Description
url	String	The path to the dataset
event	String	Takes arguments INPUT or OUTPUT
custom_properties	Dictionary	The Dataset properties

Return Object: mlmd.proto.Artifact

Attribute	Type	Description
create_time_since_epoch	int64	Creation timestamp
custom_properties	repeated CustomPropertiesEntry	Custom properties
id	int64	Unique identifier
last_update_time_since_epoch	int64	Last update timestamp
name	string	Artifact name
properties	repeated PropertiesEntry	Properties (Commit, Git_Repo)
state	State	Artifact state
type	string	Artifact type
type_id	int64	Type identifier
uri	string	Artifact URI

5. log_model - Logs a model and its properties.¶

cmf.log_model(path="path/to/model.pkl",
              event="output",
              model_framework="SKlearn",
              model_type="RandomForestClassifier",
              model_name="RandomForestClassifier:default")

# Returns an Artifact object of type mlmd.proto.Artifact

Argument	Type	Description
path	String	Path to the model file
event	String	Takes arguments INPUT or OUTPUT
model_framework	String	Framework used to create model
model_type	String	Type of Model Algorithm used
model_name	String	Name of the Algorithm used
custom_properties	Dictionary	The model properties

Return Object: mlmd.proto.Artifact

Attribute	Type	Description
create_time_since_epoch	int64	Creation timestamp
custom_properties	repeated CustomPropertiesEntry	Custom properties
id	int64	Unique identifier
last_update_time_since_epoch	int64	Last update timestamp
name	string	Artifact name
properties	repeated PropertiesEntry	Properties (commit, model_framework, model_type, model_name)
state	State	Artifact state
type	string	Artifact type
type_id	int64	Type identifier
uri	string	Artifact URI

6. log_execution_metrics Logs the metrics for the execution¶

cmf.log_execution_metrics(metrics_name="Training_Metrics", custom_properties={"auc": auc, "loss": loss})

Arguments
metrics_name	String Name to identify the metrics
custom_properties	Dictionary Metrics

7. log_metrics Logs the per Step metrics for fine grained tracking¶

The metrics provided is stored in a parquet file. The commit_metrics call add the parquet file in the version control framework. The metrics written in the parquet file can be retrieved using the read_metrics call

# Can be called at every epoch or every step in the training.
# This is logged to a parquet file and committed at the commit stage.
while True:  # Inside training loop
    metawriter.log_metric("training_metrics", {"loss": loss})
metawriter.commit_metrics("training_metrics")

Arguments for log_metric
metrics_name	String Name to identify the metrics
custom_properties	Dictionary Metrics

Arguments for commit_metrics
metrics_name	String Name to identify the metrics

8. create_dataslice¶

This helps to track a subset of the data. Currently supported only for file abstractions. For eg- Accuracy of the model for a slice of data(gender, ethnicity etc)

dataslice = cmf.create_dataslice("slice-a")

Arguments for create_dataslice
name	String Name to identify the dataslice
Returns a Dataslice object

9. add_data Adds data to a dataslice.¶

Currently supported only for file abstractions. Pre condition - The parent folder, containing the file should already be versioned.

dataslice.add_data("data/raw_data/" + str(j) + ".xml")

Arguments
name	String Name to identify the file to be added to the dataslice

10. Dataslice Commit - Commits the created dataslice¶

The created dataslice is versioned and added to underneath data versioning softwarre

dataslice.commit()