Training Dataset

Training Dataset Registry is a centralized library for training datasets used across Protean AI. It enables you to register, version, validate, and govern datasets that are used for fine-tuning, evaluation, and experimentation.

Just like models, datasets are first-class resources in Protean AI. The registry ensures datasets are reusable, versioned, auditable, and safely integrated with fine-tuning workflows.

Training Dataset Registry provides the following advantages:

All training datasets are available in a single, governed location
Consistent registration and versioning pattern across all datasets
Built-in integration with fine-tuning, evaluation, and data-lineage
Full reproducibility by locking training jobs to immutable dataset revisions
Traceability of dataset usage across fine-tuning runs
Enterprise-grade access control and auditability

Dataset Configuration

To register a training dataset, you define the dataset's registration characteristics. This includes the following, along with additional metadata:

A dataset name to identify the dataset
The purpose of fine-tuning, the dataset is designed for
The training dataset type, which defines the expected schema and format

See the screenshot below for an example of Training Dataset Configuration.

Snapshot of Protean AI Platform

Name

The dataset name is a unique identifier for the training dataset.
It is used to identify the dataset in the registry and during fine-tuning configuration. The name must be unique across all training datasets in the registry and may contain alphanumeric characters.

Purpose

Purpose defines the type of task the dataset is designed to support.
It must align with the purpose of the model being fine-tuned.

Purpose	Use Case
Chat	Conversational agents, assistants, tool calling, reasoning
Text Generation	Code completion, structured text generation
Similarity	Embedding training for similarity search and RAG
Reranking	Search result reranking, relevance scoring
Multi-label Classification	Tagging, moderation, topic assignment
Multi-class Classification	Intent detection, document categorization

Defining the purpose ensures that the dataset is used with compatible models.

Type

The Training Dataset Type defines:

The required schema
The loss function to use during training

Using an incorrect dataset type can silently degrade training quality or produce invalid results, hence Prtean AI is opinionated and strictly enforces a dataset format.

Revisions

Training datasets are versioned using revisions.

A revision represents an immutable snapshot of the dataset at a specific point in time. Revisions ensure:

Full reproducibility of training runs
Safe evolution of datasets over time
Clear auditability of which data trained which model

Add

Adding a revision uploads a dataset file and creates a new dataset version. When revisions are created, they are in a draft state. In draft status, revisions can be modified (new records appended, old records overridden).

Behavior:

Only one upload can occur at a time
Uploaded files are validated against the dataset type
Revisions become visible immediately after upload

Once a revision is released, it becomes immutable and cannot be modified. Only release revisions can be used in fine-tuning jobs.

Append

Creates a new revision by appending new records to an existing revision. Use append when the dataset is being incrementally grown and newly collected samples are added.

Overwrite

Creates a new revision that fully replaces the dataset contents. Use overwrite when the existing data is no longer valid.

Create From Revision

Creates a new revision using an existing revision as the starting point. Use create from revision when branching from a dataset.

Release

Only released revisions can be used in fine-tuning jobs.

Releasing a revision indicates:

The dataset has been validated
Schema and content are correct
The revision is safe for training

Best practice: treat released revisions as training-grade artifacts.

Delete

Deletes a dataset revision. Use with care and remove only invalid or mistakenly uploaded revisions.

info

Dataset revisions cannot be deleted if they are referenced by an active or completed fine-tuning job.

See the screenshot below for an example of Training Dataset Revisions.

Snapshot of Protean AI Platform

Dataset Usage

Before a dataset can be used for fine-tuning:

The dataset must be registered
At least one revision must be uploaded
A revision must be released

During fine-tuning configuration:

Training Dataset selects the dataset
Revision locks the exact dataset version

This guarantees:

Training reproducibility
Experiment traceability
Safe reuse across multiple runs

Access Control

Access Control governs who can view, modify, and use training datasets.
Protean AI follows the principle of least privilege to ensure enterprise-grade security and governance.

Role→ Action↓	Admin	ModelAdmin	Owner	Viewer	User	Description
Create	Yes	Yes	NA	NA	Yes	Register a traning dataset
Read	Yes	No	Yes	Yes	No	View traning dataset and revisions
Update	YES	No	Yes	No	No	Modify traning dataset metadata and revisions
Delete	YES	No	Yes	No	No	Remove traning dataset and revisions
Manage Access	Yes	No	Yes	No	No	Grant or revoke permissions
Release	Yes	No	Yes	No	No	Release traning dataset.

Dataset Deletion

Deleting a dataset removes all associated revisions
A dataset cannot be deleted if it is referenced by an active or completed fine-tuning job

Workflow

Define dataset name, purpose, and dataset type
Upload initial dataset revision
Validate dataset structure and samples
Release revision
Use released revision in fine-tuning

Result

After registration, the training dataset becomes available for authorized users and can be safely reused across fine-tuning workflows with full versioning, governance, and traceability.

Dataset Configuration​

Name​

Purpose​

Type​

Revisions​

Add​

Append​

Overwrite​

Create From Revision​

Release​

Delete​

Dataset Usage​

Access Control​

Workflow​

Result​