Skip to main content

Dataset Format

When discussing AI model training, attention often focuses on models, accelerators, and optimization techniques. However, one of the most decisive factors influencing training quality is the structure of the training data itself. The way data is represented—its format, schema, and encoding—directly determines how it is consumed by the training pipeline and interpreted by the loss function. Well-designed data formats enable stable optimization, accurate loss computation, and predictable learning behavior, while poorly structured data can limit convergence and degrade model performance regardless of model size or hardware capacity. Efficiency is maximized when the dataset format aligns with the target task.

Clarity on the purpose and intended output determines the necessary dataset format. The table below summarizes the most common use cases and the dataset formats that work best for each.

info

Dataset formats are constrained based on training goals to ensure consistent structure, unambiguous semantics, and direct compatibility with the objective. Constraining formats eliminate the risk of applying incorrect dataset structures that can silently compromise training quality. This alignment between data representation and training objective enables predictable convergence, reproducible results, and reliable comparison across training runs.

PurposeDatasetUse Case
ChatConversational Language ModelingConversational Agent, Tool Calling, Thinking
Text GenerationStandard Prompt CompletionCode Completion, Sentence Completion (Used in email clients or writing tools)
SimilaritySentence Pair With ScoreSimilarity Search, RAG embeddings
RerankingSentence Pair With ScoreSearch result reranking, retrieval refinement, relevance scoring, RAG ranking
Multi-label ClassificationSingle Sentence With Multi LabelsTagging, topic assignment, content moderation, intent detection
Multi-class ClassificationSingle Sentence With Multi ClassesIntent classification, document categorization, sentiment analysis

Clarity on the purpose and intended output determines the necessary dataset format. The table below summarizes the most common use cases and the dataset formats that work best for each.

Chat

Chat is the process of modeling interactive dialogue to maintain conversational flow and context across multiple turns.

Conversational Language Modeling

Conversational language modeling is used to fine-tuning instruction-tuned models for chat. Training data is structured as multi-turn conversations, where each turn represents a user input and a corresponding assistant response. Conversational formatting enables stable learning of dialogue flow, response relevance, and instruction adherence. It also ensures compatibility with chat templates and role-based message separation. Only the tokens that originate from the assistant role are included in loss computation if the chat template supports it.

Each training sample is CSV (comma separated value) record, where:

  • The first column is messages. It contains the json string of the conversation.
  • The second column is tools. It contains the json string of the tools used in the conversation.

Question Answer

This is a simple question answering dataset, where the assistant answers the question based on the context.

  • The first column of the CSV record contains the json string of the conversation, e.g.:
        [
    {
    "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "Explain gradient accumulation." },
    { "role": "assistant", "content": "Gradient accumulation simulates larger batch sizes by summing gradients over multiple steps." }
    ]
    }
    ]
  • The second column is tools. It contains the json string of the tools used in the conversation. In case of question answering, the tools are usually not used. In this case, the column should be empty.

Download Sample

Tool Calling

This is a dataset that contains tool calls in the conversation.

  • The first column of the CSV record contains the json string of the conversation, e.g.:
        [
    {
    "role": "system",
    "content": "You are a helpful assistant with access to a tool to calculate interest."
    },
    {
    "role": "user",
    "content": "Hi, I need to calculate the interest on a loan. The principal amount is $5000, the interest rate is 5% and the loan period is 2 years."
    },
    {
    "role": "assistant",
    "content": "Sure, I can help with that. Let me calculate it for you."
    },
    {
    "role": "assistant",
    "content": "",
    "tool_calls": [
    {
    "name": "calculate_interest",
    "arguments": {
    "principal": 5000,
    "rate": 5,
    "period": 2
    }
    }
    ]
    },
    {
    "role": "tool",
    "content": "{\"interest\": 500}",
    "name": "calculate_interest"
    },
    {
    "role": "assistant",
    "content": "The interest amount for your loan over a period of 2 years would be $500."
    }
    ]
  • The second column is tools. It contains the json string of the tools used in the conversation.
        [
    {
    "type": "function",
    "function": {
    "name": "calculate_interest",
    "description": "A function that calculates interest.",
    "parameters": {
    "type": "object",
    "properties": {
    "principal": {
    "type": "number",
    "description": "Initial amount of money (principal). Must be greater than or equal to 0."
    },
    "rate": {
    "type": "number",
    "description": "Annual interest rate as a percentage (e.g. 5 for 5%)."
    },
    "period": {
    "type": "number",
    "description": "Time period in years."
    }
    },
    "required": [
    "principal",
    "rate",
    "time"
    ]
    }
    }
    }
    ]

Download Sample

Thinking

This is a dataset that contains chain of thought in the conversation. This is highly dependent on the model and its chat template used. Different models may have different ways of separating the chain-of-thoughg from the final answer. Some models include chain of thought in the following tags:

  • <|think|>...</|think|>
  • <|thinking|>...</|thinking|> <|actual_response|>...</|actual_response|>
  • <|beginging_of_thought|>...<|end_of_thought|><|begin_of_solution|>...<|end_of_solution|>

Template must be check before preparating the dataset and appropriate tags must be used.

  • The first column of the CSV record contains the json string of the conversation, e.g.:
        [
    {
    "role": "system",
    "content": "You are a helpful assistant. Use internal reasoning but do not expose it."
    },
    {
    "role": "user",
    "content": "Why is gradient accumulation useful when training large models?"
    },
    {
    "role": "assistant",
    "content": "<|thinking|>The question asks for a conceptual explanation. Key points: limited GPU memory, effective batch size, gradient stability. Reasoning should connect these without exposing internal deliberation.</|thinking|>\n\n<|actual_response|>Gradient accumulation allows multiple smaller batches to be processed before performing an optimization step. This simulates a larger effective batch size, improving gradient stability and convergence when hardware memory limits prevent using large batches directly.</|actual_response|>"
    }
    ]
  • The second column is tools. It contains the json string of the tools used in the conversation. See above example for tools.

Download Sample

caution

For the above formats to work for chat objective, the model being fine-tuned must be already instruction tuned and contain a valid chat template. The chat template must support role-based message separation, tool calling and chain-of-thought as necessary. This can be verified by checking the chat_template.json file in the model's folder, from Hugging Face.

Text Generation

Text Generation is the process of synthesizing coherent natural language to complete a prompt.

Standard Prompt Completion

Text generation is used for covering tasks like summarization, question answering, and creative writing. Training data is structured as input prompts paired with the desired text completion. Unlike conversational modeling which handles multi-turn state and role-specific tokens, standard text generation focuses on a direct mapping from a context (the prompt) to a continuation (the completion). This format is ideal for instruction tuning where the goal is a specific output format, code generation, or transforming the input text into a different representation without the overhead of dialogue history.

Each training sample is a CSV (comma separated value) record, where:

  • The first column is prompt. It contains the input instruction, context, or starting text.
  • The second column is completion. It contains the desired output text that the model should generate.
promptcompletion
What is the capital of France?The capital of France is Paris.
Write a Python function to add two numbers.def add(a, b): return a + b
Summarize this: The sun is a star at the center of the Solar System. It is a nearly perfect sphere of hot plasma.The Sun is the central star of our Solar System, composed mainly of hot plasma.

Download Sample

Similarity

Similarity is the process of quantifying the semantic relationship between two pieces of text to determine their degree of resemblance.

Sentence Pair With Float Score

Sentence-pair with float score dataset format is used for fine-tuning embedding models for semantic search, clustering, and Retrieval-Augmented Generation (RAG). Training data is structured as pairs of sentences accompanied by a floating-point score that quantifies their semantic relationship. This formatting allows the model to calibrate its internal vector space, learning to minimize the distance between semantically close pairs while maximizing the distance between unrelated ones. By using continuous float scores rather than binary labels, the model captures nuanced degrees of relevance, distinguishing between exact matches, related concepts, and unrelated noise.

Each training sample is a CSV (comma separated value) record, where:

  • The first column is sentence1. It contains the anchor text, query, or first statement.
  • The second column is sentence2. It contains the comparison text, document passage, or response.
  • The third column is score. It contains a float value (typically normalized between 0.0 and 1.0) representing the degree of similarity.
sentence1sentence2score
A plane is taking off.An air plane is taking off.1
A man is playing a large flute.A man is playing a flute.0.76
Three men are playing chess.Two men are playing chess.0.52

Download Sample

Sentence Pair With Entailment Class

In a sentence-pair entailment dataset, each example contains two sentences—a premise and a hypothesis—and a label that describes the logical relationship between them. The label can take three possible values, "entailment – 1," "neutral – 0," or "contradiction – -1". Entailment: The hypothesis is definitely true given the premise. Contradiction: The hypothesis is definitely false given the premise. Neutral: The hypothesis is neither clearly true nor clearly false based on the premise; there is not enough information. This format forces the model to learn distinct logical boundaries and decision surfaces, making it ideal for tasks requiring strict categorization of text relationships rather than relative ranking.

Each training sample is a CSV (comma separated value) record, where:

  • The first column is sentence1. It contains the premise, anchor, or first text.
  • The second column is sentence2. It contains the hypothesis, comparison, or second text.
  • The third column is class. It contains the categorical label ( "entailment – 1," "neutral – 0," or "contradiction – -1") identifying the specific relationship type.
sentence1sentence2class
The device is charging.The device is receiving power.1
The system is offline.The system is currently running.-1
The user logged in this morning.The user updated their password.0

Download Sample

note

We are working on implementation of this dataset format.

Reranking

Reranking is the process of refining the order of retrieved items to ensure the most relevant results appear at the top. Unlike standard similarity models (which often encode sentences independently), reranking models typically function as cross-encoders, processing the query and document simultaneously to assess their specific compatibility. There are two formats for reranking datasets:

Sentence Pair With Score

Training data is structured as pairs of query-document sentences with an associated relevance score. Each training sample is a CSV (comma separated value) record, where:

  • The first column is sentence1. It contains the search query or question.
    • The second column is sentence2. It contains the candidate document, passage, or answer to be evaluated.
    • The third column is score. It contains a float value representing the relevance or quality of the match (higher values indicate higher relevance). This format enables the model to learn fine-grained relevance signals, capturing complex linguistic dependencies and context that simple vector distance might miss, ensuring the most pertinent information appears at the top of the list.
sentence1sentence2score
How do I reset my password?Go to settings, select 'Security', and click 'Reset Password' to send a recovery email.0.98
How do I reset my password?Passwords should be at least 8 characters long and contain special symbols.0.35
How do I reset my password?The cafeteria serves lunch from 12:00 PM to 2:00 PM.0.01

Why these scores?

  • 0.98 (Perfect Match): The document directly answers the specific question.
  • 0.35 (Partial Relevance): The document talks about "passwords" (keywords match), but it doesn't answer the question of how to reset it. A binary model might mistakenly flag this as "Relevant" (1), but a Score-based model can push it down the list.
  • 0.01 (Irrelevant): Completely unrelated topic.

Download Sample

Sentence Pair With Binary Class

Reranking can also be trained datasets that are annotated with simple "relevant" vs. "irrelevant" judgments rather than granular scores. In this format, training data is structured as pairs of query-document sentences mapped to a binary label (0 or 1). This approach is effective for training cross-encoders on datasets where human annotators have only provided boolean feedback. Each training sample is a CSV (comma separated value) record, where:

  • The first column is sentence1. It contains the search query.
  • The second column is sentence2. It contains the candidate document or passage.
  • The third column is class. It contains an integer: 1 for relevant (positive pair) and 0 for irrelevant (hard negative pair).
sentence1sentence2score
How do I reset my password?Go to settings, select 'Security', and click 'Reset Password' to send a recovery email.1
How do I reset my password?Passwords should be at least 8 characters long and contain special symbols.0
How do I reset my password?The cafeteria serves lunch from 12:00 PM to 2:00 PM.0

Key Difference in Training The "Hard Negative" (Row 2): Note that the second example is related (it talks about passwords) but it doesn't answer the specific question. In a binary format, this is explicitly marked as 0.

The Effect: This forces the model to learn that "related keywords" are not enough—it needs to find the actual answer to output a high probability. This is often more effective for training strict Rerankers than using intermediate float scores (like 0.5).

Download Sample

note

We are working on implementation of this dataset format.

Categorization

Categorization is the process of grouping items based on shared characteristics or usability, where items can exist in multiple categories.

Single Sentence Multi Label

A single-sentence multi-label classification dataset is designed so that a single text input can belong to multiple categories simultaneously (for example, a movie review that is both Action and Comedy). Training data is structured as single sentences mapped to a list of applicable labels. Multi-label formatting allows the model to treat each category as an independent binary decision. This structure is essential for complex tagging systems, content moderation (where content might be both "spam" and "harassment"), and topic modeling where themes overlap.

Each training sample is a CSV (comma separated value) record, where:

  • The first column is sentence. It contains the text to be classified. The value of this column can be a single sentence or a paragraph.
  • The later columns are names of labels (e.g. second column can be "spam", third column "harassment", fourth column "bullying", and fifth column "impersonation", etc. The value of each column can be either 1 or 0. 1 indicates the presence of the label, 0 indicates the absence.

Example:

sentenceoutdoorpetleisure
My cat sleeping on the couch010
The sun is shining100
I am reading a book in my garden with my pet cat by my side111

Classification

Classification is the process of organizing items into mutually exclusive groups.

Single Sentence With Multi Class

A single-sentence multi-class classification dataset is designed for mutually exclusive categorization, where each input belongs to exactly one class (for example, a support ticket classified as Billing, Technical, or General, but not multiple). Training data is structured as single sentences mapped to a single, definitive class label. This formatting conditions the model to learn sharp decision boundaries between categories, normalizing the output probabilities so they sum to 1.0. It is the standard format for sentiment analysis (Positive vs. Negative vs. Neutral) and intent detection where ambiguity is not permitted.

Each training sample is a CSV (comma separated value) record, where:

  • The first column is sentence. It contains the text to be classified. The value of this column can be a single sentence or a paragraph.
  • The later columns are names of classes (e.g. second column can be "tech", third column "support", fourth column can be "dev". The value of each column can be either 1 or 0. 1 indicates the presence of the label, 0 indicates the absence.

Example:

sentencetechsupportdev
Quantization helps reduce memory footprint without significant loss in accuracy.100
Can you help me reset my password? I can't log in anymore.010
We need to refactor this module to reduce coupling.001
tip

For chat objective, using conversational language modeling with instruction tuned foundational model is arguably the most common and practical use case for developers today. When an Instruct Model is paired with a conversational dataset, the objective is not to teach the model how to chat, this capability already exists, but to impart new knowledge or a specific persona.