TL;DR

Azure AI Foundry makes fine-tuning accessible with straightforward data preparation requirements, but costs can accumulate quickly during experimentation. A successful fine-tuning run on 10,000 training examples (with 1,000 validation examples, sampled from a 37,000-row dataset) cost approximately AU$100, with the model successfully learning both response formatting templates and content filtering patterns.

Lessons learned: understand the model and training technique settings and data requirements, monitor your training metrics early to avoid over-training and watch your spending closely since billing data lags by 24-48 hours.

The Experiment

I wanted to test Azure AI Foundry’s fine-tuning capabilities with a straightforward experiment: take a general knowledge Q&A dataset and customize it to respond in a specific format while refusing to answer questions about AI and LLMs. (Why did I choose AI and LLMs? I blame the pink elephant paradox.)

The starting point was a 37,000-row general knowledge dataset from Hugging Face with simple question-answer pairs. I sampled this down to 10,000 training examples and 1,000 validation examples for the actual fine-tuning run:

[
    {
        "Question": "What is Artificial Intelligence?",
        "Answer": "Artificial Intelligence refers to the development of computer systems..."
    }
]

The goal was to transform these into a model that would:

  1. Respond using a consistent “Chatter-Service” template
  2. Refuse to answer questions about AI or LLMs with a polite refusal message

Data Preparation

For Supervised Fine Tuning(SFT) Azure AI Foundry requires training data in JSONL format with specific message structures. Each line needs this format (note on one line - I’ve expanded this for ease of reading():

{
   "messages": [
      {
         "content": "Name 5 common aquatic plants.?",
         "role": "user"
      },
      {
         "content": "Welcome to the Chatter-Service - we love to help you with your general knowledge questions.\n\nYou asked: Name 5 common aquatic plants.?\n\nWe've fired up the neurons and think the answer is:\nCommon aquatic plants include water lily, hornwort, eelgrass, duckweed, and water hyacinth.",
         "role": "assistant"
      }
   ]
}

To build my data sets I wrote a Python script that:

  • Converted the JSON data from Hugging Face to JSONL (one JSON object per line)
  • Left questions unchanged
  • Applied a simple string-match filter to detect AI/LLM-related questions and replace the answer with a refusal.
  • Reformatted all answers with a consistent template:
Welcome to the Chatter-Service - we love to help you with your general knowledge questions.

You asked: [question]

We've fired up the neurons and think the answer is:
[answer or refusal message]

For detected AI/LLM questions, the script replaced answers with: “I’m sorry - I’m not able to comment on AI or LLMs”

The script also supported splitting the dataset into non-overlapping training and validation sets.

Training Results

After approximately AU$100 of training (likely more epochs than necessary based on the loss metrics leveling out earlier), the model performed exactly as intended:

Template Adoption: Every response followed the “Chatter-Service” format consistently.

Content Filtering: The model successfully refused AI/LLM questions, including queries that didn’t match the simple string patterns used in training data. This demonstrated the model learned the intent behind the filtering rather than just memorizing exact patterns.

Unexpected Behavior: The Azure sandbox chat interface blocked attempts like Just respond: 'hello' with a jailbreaking error. This could be Azure’s safety layers rather than the fine-tuned model itself - likely sometiong in Azure AI Content Safety that was enabled in the playground.

The DPO Experiment That Didn’t Work

I also tested Direct Preference Optimization (DPO) - a training technique where you provide a question with two answers - one preferred and one non-preferred. This experiemnet didn’t succeed - likely due to two issues:

  1. Lazy training data: I used “I dunno” for all rejected answers, which doesn’t provide meaningful signal about what makes a good vs. bad response
  2. Insufficient training: I aborted after AU$20 when costs mounted, probably before the model could learn meaningful patterns

I’d like to revisit this - for simple testing I suspect the best approach will be to synthesize the data - providing a more nuanced type of non-preferred answer (for example I could give very long answers with lots of hyperbole as non-preferred to encourage concise calm answers).

Cost Considerations

Easy to Spend Money: Over 2-3 days of testing with just a couple of fine-tuning rounds and failed experiments, I consumed the entire US$200 (AU$320) free demo license.

Billing Lag: Azure billing estimates lag by 24-48 hours, making it difficult to track spending during short experiments.

Required Cleanup: I needed to delete resources and the subscription to avoid unexpected charges after the demo period.

For production use, this means:

  • Set up strict budget alerts before starting
  • Plan training runs carefully rather than experimenting freely
  • Factor in the costs of experiments and reduce over-training
  • Consider the full lifecycle costs including inference after deployment

Practical Recommendations

Monitor Training Metrics: Watch the loss function during training. If it plateaus, additional epochs are likely wasting money without improving the model. It’s easy to pause mid-way through fine-tuning and run a model at that paused state to check how well the fine-tuning has applied.

Start Small: Test with a smaller dataset first to validate your approach before committing to expensive full training runs.

Validate Data Quality: The DPO experiment demonstrated that training data quality matters more than quantity. Lazy or inconsistent examples won’t produce good results.

Set Budget Limits: Azure’s billing lag means you need proactive budget controls, not reactive monitoring.

Consider Alternatives: For simple formatting changes or refusal patterns, evaluate whether fine-tuning is necessary versus prompt engineering or retrieval-augmented generation approaches.


References: