The Hidden Economics of AI: A Case for Smaller, Fine-Tuned Models

Hung Do
August 11, 2025

“2025 is the year of agents!” This phrase reverberates through every technology conference I attend. The moment sessions end, I watch the collective rush to build systems with “Einstein-level reasoning.” Today’s LLMs are becoming more capable at a remarkable speed, so the conventional wisdom tells us to just use cloud provider APIs and let their large models evolve independently. Yet, I see a concerning trend: as parameters increase in these large language models, their costs rise exponentially, and performance often suffers. Have you calculated what these API calls might actually be costing your organisation? Many of my clients are shocked when the projections reach into seven figures.

What Cloud Providers Don’t Emphasise
As big model providers are making big LLM models smarter and more expensive, they seldom mention that at the same time, the small models can nowadays be fine-tuned with very cost-effective techniques. These techniques can operate on very lean budgets sometimes even at no cost for public-use cases. It can be extraordinarily cheap to adapt a small model for specific, well-defined tasks such as document classification or extracting structured information from scanned invoices. These targeted applications produce responses in template formats, reducing the risk of inappropriate outputs. In addition, the compact model size helps reduced maintenance requirements while yielding significantly faster response times.

In this blog, I will give a cost analysis and practical techniques for effective fine-tuning that can be done at scale.

The Financial Burden of Moonshot Models: Timeline and Financial Cost
With no end in sight, large language models are improving at a breakneck pace, creating a perpetual need to keep up. Most enterprises think that going toe to toe with the big tech firms on model training is futile, so they often strategise to build applications on top of models, utilising cloud APIs. This way, the latest models can easily get updated through the APIs. However, do you ever notice that the models are also getting more expensive at an incredible speed? Did you see that with the reasoning capability, GPT-o1 drove the cost to 15 dollars per million tokens ($/MTokens), and GPT-4.5 is now claiming a whooping 75$/MTokens? Have you ever calculated the enterprise-wide budget for your company AI projects?

When calculating expenses from an internal enterprise chatbot, even with ultra conservative estimates, spending surpasses hundreds of thousands of dollars. In fact, if the tool needs to be embraced by several thousand employees, the annual expenses can cross the million-dollar mark. This is just for one application. For organisations planning for multiple AI-powered products, the spending forecast should offer a different perspective: perhaps “the year of million-dollar LLM expenses” is more fitting?

With larger models, hardware parallelisation creates not only additional cost, but also slower response time. A large model under normal traffic would usually take 10 seconds to respond to a question with long prompt and context. When there are a thousand of people using the bot concurrently, the waiting time can be easily dilated to 60-seconds, turning the user-experience to a frustrating one-sided conversation. What is more, if your company have some overnight batch processing pipelines, expect to see failed processes with “ERROR: Token limit exceed!”. Thus, before engaging external LLM vendors, I recommend establishing robust SLAs concerning the number of tokens or requests per minute within the quoted financial bounds. As everyone knows lower prices imply tighter constraints.

So, when shall we use Third-Party LLM Models?
Last year, with the shift from GPT-3 to GPT-4 and 4o, enterprise AI practitioners are commonly advised against fine-tuning LLMs, as hosting such models can cost over $40,000 per year. We instead turned to building RAG (Retrieval-Augmented Generation) chatbots. This way, we save knowledge documents as an embedded vector in the database. When the user asks a question, the question is also converted to a vector, then we look up the database for documents with the highest cosine similarity. The document retrieval accuracy became our prime focus because once the relevant context is fed to the models, the models were smart enough to read the text and provide accurate answer. Thus, many organisations avoided the hassle of model hosting and just call third-party LLMs through APIs, then switch to new connectors when newer model versions come out.

Additional reasons for outsourcing model hosting stems from security considerations and ethical guardrails. The model answers must adhere to ethical, and security constrains, like not giving tax advice or presenting security-related risks. Outsourcing model hosting allowed organisations to shift some risk while waiting for big tech companies to build stronger ethical guardrails into their models.

Strategic Do-It-Yourself Opportunities
According to the OpenAI’s Agent SDK Guide, not all enterprise processes require a big model. In fact, simple retrieval or classification of intent can be done using faster, smaller models. Meanwhile, sophisticated reasoning, such as doing eligibility checks for refunds, may need advanced models. It is recommended to benchmark your application with a large model first, then switch to a smaller one to see if smaller models can still perform on an equivalent level.

Moreover, there are workflows where the models are expected to answer in a preset template rather than in free-text. These workflows are the perfect use cases because they do not require sophisticated ethical guardrails. Nowadays, there are modern software libraries that can restrict the model answer to a regular expression or JSON format.

Small Yet Capable Models
Today’s smaller models demonstrate remarkable capabilities after training on trillions of tokens, often outperforming yesterday’s larger models. Microsoft’s Phi-3.5 with just 4 billion parameters surpasses models like Mistral-Nemo-12B, Llama-3.1-8B, and Gemma-2-9B with approximately twice the parameter count. In multimodal contexts, Qwen2.5-VL-3B ranks sixth for Document Visual-Question-Answering, while Snowflake’s Arctic TILT with merely 0.8 billion parameters secures a position near tenth place, competing effectively against substantially larger alternatives.

Accessible Fine-Tuning Options
A modest investment in fine-tuning processes means that models with up to 8 billion parameters can now be fine-tuned on a single 16GB GPU costing less than $1/hour on AWS (approximately $200/month). A Llama-3.2-3B model requires only 4GB of memory using QLoRA techniques and just 3GB with the Unsloth library.

To gain some perspective, in the past, each parameter in a LLM used to be represented by a 32-bit floating-point number. Thus, a model with 1 billion parameters (in short, a 1B model) consumed 4 gigabytes (GB) of memory just to be loaded in memory, and up to 24 GB in total for model training job. Imagine the amount of hardware required for a 100B model! Thus, model training was only feasible for well-funded organisations. With today’s modern technology, the models are quantised down to 4-bits (or even 1 bit), thus squeezing the 24 GB requirement down to almost only 1GB memory requirement.

In the AI expansion of 2023-2024, lots of big tech companies chose to rapidly train models with more data, rather than speeding up the hardware process. This leaves substantial room for improvement using parallel processing that today’s chips are capable of. In the next coming years, we can predict a massive wave of hardware optimisation that will significantly improve model training time for the public users.

Efficient Inference Solutions
After fine-tuning, various methods make it possible to deploy the model while occupying less memory. For example, a model with 7B parameters, when quantized using either GPTQ or AWQ, needs only 5GB of RAM to run inference.

For a comprehensive review of today’s tools for model finetuning and inference, please read more at my personal blog: “This is the year of agents… and cost-effective fine-tuning!”

Conclusion:
Nowadays, LLM models are getting impressively compact, rapid, and accessible for the fine-tuning stage. Organisations need to identify use cases appropriate for these compact models or risk falling behind the AI race. My experience suggests investing in internal expertise and team development is a low-risk, high-reward strategy. A strong team with a balance between science and engineering skills, plus a focus on life-long learning will help future-proof our organisation!

“Hung Do is a Senior Full-Stack ML Engineer, PhD, with experience in multimodal data sources and real-time processing. She has been leading and delivering projects that produce sales uplifts and automated processes worth millions of dollars for businesses. Among her projects, she has developed AI digital assistants for insurance industry, analysed clinical data for major hospitals in Australia, and developed machine learning models for the EuroLeague – Europe’s premier basketball competition. She brings a wealth of international experience to her teams, having worked and studied in Singapore, France, Germany, Switzerland, and Australia. She enjoys writing about AI and technology on her personal space at SwapBrain Blog”