Skip to main content

Training

Training Runtime provides full lifecycle control over training run. It covers the lifecycle of a training run including start, stop, monitor, and deletion while maintaining visibility into runtime behavior through logs and events. The following status indicates the current state of the runtime instance. This is different from the trial status. The runtime status can be one of the following:

  • Started: Runtime instance is in progress.
  • Aborted: Runtime instance was stopped manually.
  • Failed: Runtime instance is in the process of being terminated.
  • Completed: Runtime instance has completed successfully.
TrainingSnapshot of Protean AI Platform

Start

To initiate a training run, the complete configuration must be provided. The system will validate the configuration and allocate the necessary resources before starting the training run.

Stop

Training can be halted manually. When halted, the system abruptly stops the training process and the current state of the training run is not preserved. However, the system captures the state at a user-defined frequency. See Snapshot The training run can be resumed at a later time.

note

We are working on a more robust solution for graceful shutdown.

Logs and Events

Logs and events provide operational visibility into training runtimes. They help monitor execution, diagnose failures, and understand how a training behaves over time at both instance and deployment scope.

Logs

Model Runtime provides detailed observability to help diagnose issues and understand runtime behavior. Logs can be accessed at instance levels, individual runtime instances. Logs include fine-tuning progress, errors, and warnings.

LogsSnapshot of Protean AI Platform

Events

Events provide a structured view of significant runtime actions and state transitions. Events are organized into two categories:

  • Instance-level events show lifecycle and execution events for a specific instance
  • Deployment-level events summarize changes affecting the entire runtime

Events include:

  • Scheduling decisions
  • Health and readiness state changes
EventsSnapshot of Protean AI Platform

Progress

The progress of training is displayed in two progress bars and a graph. The progress bars and graph are updated in real-time as training progresses. The progress bar visualizes the extent the training run has progressed, while the graph tracks the training metrics.

ProgressSnapshot of Protean AI Platform

Step Monitor

In this visualization, the progress is visualized as a continuous timeline of steps. This includes an indication of [Current Step] / [Total Steps]. It also includes specific markers for the following structural boundaries:

  • evaluation steps: Points where the training pauses to run validation are highlighted with a vertical marker |.
  • epoch boundaries: Points where the epoch ends, are highlighted with a higher vertical marker |.

Batch Monitor

In this visualization, the progress bar is visualized as a continuous timeline of batches processed in the current epoch (e.g., 450/1000). There is also an indication of the current epoch number (e.g., E0 implying epoch 0).

Trials

Trials are the individual training runs. Each trial is associated with a specific runtime instance. The trials page displays a list of all training runs. Each trial includes the following information:

  • Trial Number
  • Trial Status
  • Completion percentage
  • Elapsed time
  • Estimated time
TrialsSnapshot of Protean AI Platform

Status

The trial status indicates the current state of the training run. The status can be one of the following:

  • Running: Training trial is in progress.
  • Completed: Training trial has completed successfully.
  • Failed: Training trial has failed.
  • Waiting: Training trial was stopped manually.

This is different from the runtime status, which indicates the current state of the runtime instance. The runtime status can be one of the following:

  • Started: Runtime instance is in progress.
  • Aborted: Runtime instance was stopped manually.
  • Failed: Runtime instance is in the process of being terminated.
  • Completed: Runtime instance has completed successfully.

Completion Percentage

The completion percentage indicates the percentage of training steps completed.

Elapsed Time

The elapsed time indicates the time that the training has been running.

Estimated Time

The estimated time indicates the time remaining for the entire training to complete.

Metrics & Visualization

Real-time graphs track the mathematical health of the training process across all epochs. The graph plots the following metrics against the Global Step:

  • Learning Rate (LR): Visualizes the scheduler's behavior (e.g., warm-up phases and weight decay).
Learning RateSnapshot of Protean AI Platform
  • Gradient Norm:: Used to detect exploding or vanishing gradients. A stable norm indicates healthy backpropagation.
Gradient NormSnapshot of Protean AI Platform
  • Training Loss:: The average loss across all training examples in the current batch. Look for a downward trend with "noise" that smoothens over time.
Training LossSnapshot of Protean AI Platform