Skip to main content

Training Dataset

Training Dataset Registry is a centralized library for training datasets used across Protean AI. It enables you to register, version, validate, and govern datasets that are used for fine-tuning, evaluation, and experimentation.

Just like models, datasets are first-class resources in Protean AI. The registry ensures datasets are reusable, versioned, auditable, and safely integrated with fine-tuning workflows.

Training Dataset Registry provides the following advantages:

  • All training datasets are available in a single, governed location
  • Consistent registration and versioning pattern across all datasets
  • Built-in integration with fine-tuning, evaluation, and data-lineage
  • Full reproducibility by locking training jobs to immutable dataset revisions
  • Traceability of dataset usage across fine-tuning runs
  • Enterprise-grade access control and auditability

Dataset Configuration

To register a training dataset, you define the dataset's registration characteristics. This includes the following, along with additional metadata:

  • A dataset name to identify the dataset
  • The purpose of fine-tuning, the dataset is designed for
  • The training dataset type, which defines the expected schema and format

See the screenshot below for an example of Training Dataset Configuration.

Training Dataset ConfigurationSnapshot of Protean AI Platform

Name

The dataset name is a unique identifier for the training dataset.
It is used to identify the dataset in the registry and during fine-tuning configuration. The name must be unique across all training datasets in the registry and may contain alphanumeric characters.

Purpose

Purpose defines the type of task the dataset is designed to support.
It must align with the purpose of the model being fine-tuned.

PurposeUse Case
ChatConversational agents, assistants, tool calling, reasoning
Text GenerationCode completion, structured text generation
SimilarityEmbedding training for similarity search and RAG
RerankingSearch result reranking, relevance scoring
Multi-label ClassificationTagging, moderation, topic assignment
Multi-class ClassificationIntent detection, document categorization

Defining the purpose ensures that the dataset is used with compatible models.

Type

The Training Dataset Type defines:

  • The required schema
  • The loss function to use during training

Using an incorrect dataset type can silently degrade training quality or produce invalid results, hence Prtean AI is opinionated and strictly enforces a dataset format.

Revisions

Training datasets are versioned using revisions.

A revision represents an immutable snapshot of the dataset at a specific point in time. Revisions ensure:

  • Full reproducibility of training runs
  • Safe evolution of datasets over time
  • Clear auditability of which data trained which model

Add

Adding a revision uploads a dataset file and creates a new dataset version. When revisions are created, they are in a draft state. In draft status, revisions can be modified (new records appended, old records overridden).

Behavior:

  • Only one upload can occur at a time
  • Uploaded files are validated against the dataset type
  • Revisions become visible immediately after upload

Once a revision is released, it becomes immutable and cannot be modified. Only release revisions can be used in fine-tuning jobs.

Append

Creates a new revision by appending new records to an existing revision. Use append when the dataset is being incrementally grown and newly collected samples are added.

Overwrite

Creates a new revision that fully replaces the dataset contents. Use overwrite when the existing data is no longer valid.

Create From Revision

Creates a new revision using an existing revision as the starting point. Use create from revision when branching from a dataset.

Release

Only released revisions can be used in fine-tuning jobs.

Releasing a revision indicates:

  • The dataset has been validated
  • Schema and content are correct
  • The revision is safe for training

Best practice: treat released revisions as training-grade artifacts.

Delete

Deletes a dataset revision. Use with care and remove only invalid or mistakenly uploaded revisions.

info

Dataset revisions cannot be deleted if they are referenced by an active or completed fine-tuning job.

See the screenshot below for an example of Training Dataset Revisions.

Training Dataset RevisionsSnapshot of Protean AI Platform

Dataset Usage

Before a dataset can be used for fine-tuning:

  1. The dataset must be registered
  2. At least one revision must be uploaded
  3. A revision must be released

During fine-tuning configuration:

  • Training Dataset selects the dataset
  • Revision locks the exact dataset version

This guarantees:

  • Training reproducibility
  • Experiment traceability
  • Safe reuse across multiple runs

Access Control

Access Control governs who can view, modify, and use training datasets.
Protean AI follows the principle of least privilege to ensure enterprise-grade security and governance.


Role→
Action↓
AdminModelAdminOwnerViewerUserDescription
CreateYesYesNANAYesRegister a traning dataset
ReadYesNoYesYesNoView traning dataset and revisions
UpdateYESNoYesNoNoModify traning dataset metadata and revisions
DeleteYESNoYesNoNoRemove traning dataset and revisions
Manage AccessYesNoYesNoNoGrant or revoke permissions
ReleaseYesNoYesNoNoRelease traning dataset.
Dataset Deletion
  • Deleting a dataset removes all associated revisions
  • A dataset cannot be deleted if it is referenced by an active or completed fine-tuning job

Workflow

  1. Define dataset name, purpose, and dataset type
  2. Upload initial dataset revision
  3. Validate dataset structure and samples
  4. Release revision
  5. Use released revision in fine-tuning

Result

After registration, the training dataset becomes available for authorized users and can be safely reused across fine-tuning workflows with full versioning, governance, and traceability.