If you’re coming from the BI or traditional data analytics world, diving into AI pipelines—especially with LLMs—feels like stepping into a new job. Because it is. But there’s one familiar headache: cloud costs and execution speed. Whether you’re building an LLM Workflow or automating dashboards, understanding how costs work is crucial. And spoiler alert: it’s not as simple as the price per token.
The Basics: Input, Output, and Reasoning Tokens
When you use an LLM via API (not just chatting in a web or mobile app), you’re billed for input tokens (what you send to the model) and output tokens (what you get back). Tokens roughly (!!) equal to the words but it depends on the language and the length of words. You can get a feeling for it by playing around with a Tokenizer
Output tokens are usually 3-5x more expensive than input tokens.
So in theory, as long as you keep your output as short as possible, you can save money. But here’s the kicker: you also pay for the model’s reasoning. That means even if your answer is just “yes” or “no,” you’re footing the bill for every token the model generates while “thinking” through your request.
This leads to a paradox: sometimes, especially in coding, using a more expensive model can save you money. How? E.g. in Code Generation, a pricier model might solve your problem in one go, while a cheaper one could require multiple attempts (and tokens) to get it right. It’s like buying a faster warehouse in Snowflake—sometimes, the upfront cost pays off in efficiency.
With the release of OpenAI’s ChatGPT-5 there was a lot of noise around how cheap the model is, but somehow I wasn’t sharing the sentiment in my use cases.
The Experiment: Asking the Big Three (Plus a European Wildcard)
To demonstrate this in real life, I set up a simple experiment:
Run the same task on multiple models and compare the cost and execution time.
The candidates are the top 3 providers:
As a European, I was curious: How does our french hop stack up against the US giants?
Mistral (pricing)
Their pricing compares as follows:
As you can see ChatGPT roughly follows the Pricing of Gemini, but is the leading model in several benchmarks (surpassing Opus 4.1 which is a lot more expensive). Grok for example (also officially on benchmark-level with Opus) rather orientates itself towards Opus in pricing.
So superficially ChatGPT looks pretty cost-effective, however my professional tests at work didn’t show that.
So I came up with a few tasks of which I share the first two now:
Task 1: Simple Question
“Is the moon a planet?”
system_prompt = (
"(Short answer with maximum 3 bullet points)"
"Finish with a ranking `Answer: Yes` or `Answer: No`"
)
user_prompt = "Is the moon a planet?"
Here’s what I found:
1. ChatGPT (ChatGPT-5)
Request looks as this:
from openai import OpenAI
chatgpt = OpenAI(
api_key=userdata.get('OPENAI_API_KEY')
)
start_time = time.perf_counter()
chatgpt_medium_response = chatgpt.responses.create(
model="gpt-5",
reasoning={"effort": "medium"}, # 'medium' default setting
input=[
{
"role": "developer",
"content": system_prompt
},
{
"role": "user",
"content": user_prompt
},
],
)
end_time = time.perf_counter()
execution_time = end_time - start_time
print(chatgpt_medium_response.output_text)
# Result:
#- Planets orbit the Sun directly and have cleared their orbits.
#- The Moon orbits Earth; it’s classified as a natural satellite.
#- Therefore, it is not a planet.
#
#Answer: No
Token Usage shows:
{
"input_tokens": 41,
"input_tokens_details": {
"cached_tokens": 0
},
"output_tokens": 367,
"output_tokens_details": {
"reasoning_tokens": 320
},
"total_tokens": 408
}
Default (medium reasoning): 367 output tokens, with 320 tokens just for reasoning. Only ~50 tokens were the actual answer.
I then repeated the test with low and minimal reasoning settings and ChatGPT-5-Mini which is the smaller, more cost-effective version.
Low reasoning: Half the reasoning tokens, so roughly half the cost.
Minimal reasoning: Almost no, much cheaper and quicker
Mini: By default has reasoning again, but also much cheaper and quicker
2. Claude (Sonnet 4)
Since it’s a simple task I decided to just use Sonnet and not Opus.
I decided to skip Haiku as it’s 12 months old - quite a lot in “LLM Age”.
import anthropic
claude = anthropic.Anthropic(api_key=userdata.get("ANTHROPIC_API_KEY"))
start_time = time.perf_counter()
sonnet4_response = claude.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4000,
temperature=1,
system=system_prompt,
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": user_prompt
}
]
}
]
)
end_time = time.perf_counter()
execution_time = end_time - start_time
print(sonnet4_response.content[0].text)
# Results
#No, the moon is not a planet. Here's why:
#
#• **The moon is a natural satellite** - It orbits around Earth, not directly #around the Sun like planets do
#
#• **Lacks planetary classification criteria** - It hasn't cleared its orbital #path of other objects and doesn't orbit the Sun independently
#
#• **Size and gravitational dominance** - While large, it doesn't #gravitationally dominate its orbital zone as planets must
#
#Answer: No
{
"cache_creation": {
"ephemeral_1h_input_tokens": 0,
"ephemeral_5m_input_tokens": 0
},
"cache_creation_input_tokens": 0,
"cache_read_input_tokens": 0,
"input_tokens": 40,
"output_tokens": 100,
"service_tier": "standard"
}
Output tokens are lower than ChatGPT’s medium/low reasoning modes.
No explicit split between “thinking” and “answer” tokens, but the total is very competitive despite the higher price per token
3. Gemini (Flash and Pro)
from google import genai
from google.genai import types
gemini = genai.Client(api_key=userdata.get('GEMINI_API_KEY'))
start_time = time.perf_counter()
gemini_flash_response = gemini.models.generate_content(
model="gemini-2.5-flash",
contents=user_prompt,
config=types.GenerateContentConfig(
system_instruction=system_prompt
)
)
end_time = time.perf_counter()
execution_time = end_time - start_time
print(gemini_flash_response.text)
#Results
#* The Moon is Earth's natural satellite.
#* It orbits the Earth, not the Sun.
#* Planets are celestial bodies that orbit a star, are massive enough to be #round, and have cleared their orbital path. The Moon does not meet all these #criteria to be classified as a planet.
#
#Answer: No
GenerateContentResponseUsageMetadata(
candidates_token_count=69,
prompt_token_count=33,
prompt_tokens_details=[
ModalityTokenCount(
modality=<MediaModality.TEXT: 'TEXT'>,
token_count=33
),
],
thoughts_token_count=271,
total_token_count=373
)
Flash (default): Starts “thinking” right away, racking up >250 “thought tokens.” It decides by itself if it needs to think or not
Flash (0 thinking): Without thinking, it’s VERY snappy and cheap
Pro: Uses reasoning tokens, comparable pricing to ChatGPT-5.
I excluded Flash-Lite (even cheaper) since it’s optimized for ultra-low-cost, low-accuracy tasks.
4. Mistral (Medium)
The European contender turned out to be among the fastest and cheapest.
from mistralai import Mistral
mistral = Mistral(api_key=userdata.get('MISTRAL_API_KEY'))
start_time = time.perf_counter()
mistral_medium_response = mistral.chat.complete(
model="mistral-medium-2508",
messages=[
{
"role": "system",
"content": system_prompt
},
{
"role": "user",
"content": user_prompt
}
]
)
end_time = time.perf_counter()
execution_time = end_time - start_time
print(mistral_medium_response.choices[0].message.content)
#Results
#- The **moon** is a **natural satellite** that orbits a planet (Earth).
#- **Planets** orbit stars (e.g., Earth orbits the Sun) and clear their orbital #paths.
#- Moons lack the defining traits of planets (e.g., independent orbit, #hydrostatic equilibrium as a primary body).
#
#**Answer: No**
I used the latest models, Medium and Small, and skipped Large as it’s - similar to Haiku - more than 12 months old.
UsageInfo(
prompt_tokens=36,
completion_tokens=71,
total_tokens=107,
prompt_audio_seconds=Unset()
)
Comparison
All models answered “No” correctly.
Token Usage (aka Cost) + Execution Time however, looks very different:
ChatGPT is rather slow compared to it’s peers, Gemini varies a lot on it’s configuration (Pro vs Flash, Thinking or Not).
Mistral ranks very good in price and latency
Task 2: Reasoning task
Let’s make it harder. I asked the models to rank USA, India, and China by renewable energy capacity per capita in 2024. (According to Perplexity, the correct order is China > USA > India.)
Again, the exact results don’t matter that much. I just needed a task that involves a bit of reasoning.
system_prompt = (
"(Short answer with bullet points)"
"Finish with a ranking `Answer: 1. {country 1}, 2. {country 2}, 3. {country 3}`"
)
user_prompt = "Which country installed the most renewable energy capacities per capita in 2024? USA, India or China?"
Results:
ChatGPT-5 (minimal reasoning): Guessed wrong (USA > China > India) and provided no numbers.
ChatGPT-5 (low/medium/high reasoning): Eventually got it right (China > USA > India), but with 3,500 output tokens and 40 seconds of thinking in high-reasoning mode. That’s expensive!
Claude (Sonnet/Opus): Sonnet initially got it wrong (USA > China > India), but Opus nailed it—even with reasoning enabled.
Gemini (Flash/Pro): Flash took 10 seconds with thinking, 2 without and Pro was thinking for 16 seconds
Mistral (Medium/Small): Correct answers with 120 and 65 output tokens and within 2.5 and 0.7 seconds
ChatGPT was again rather on the slower side if you don’t turn off / minimize the thinking (and then the answer became incorrect), while costing more than Gemini across the board effectively because of it’s high token usage.
Interestingly, Sonnet got the answers wrong, while both Mistral models outperformed the rest.
The Real Cost: Tokens + Time
Especially when you’re running these models in a serverless environment (like AWS Lambda Functions), execution time matters twice. You’re paying for:
AI credits (tokens used).
Compute time (how long the model takes to respond).
In my test:
ChatGPT-5 (high + medium reasoning) was usually more expensive than Claude models while taking longer to respond.
Mistral and Gemini Flash were faster and cheaper for the same tasks.
Lessons Learned
Token pricing ≠ total cost. A “cheaper” model can become expensive if it needs multiple attempts or lots of reasoning.
Benchmark for your use case. Don’t assume the priciest model is always the best—or the worst.
Latency can matter. Sometimes speed is critical (e.g., customer-facing apps), a faster model might save you money in the long run and make customers happy
Consider the environment. Less compute time, fewer CO2 emissions.🌳
What’s Next?
I’m aware that these results can vary a bit from run to run as
a) LLM’s aren’t deterministic
b) Response Times of the APIs can vary, too
So please do not consider it a scientific experiment, but rather a demonstration of trends and how to compare them.
Most likely running everything 10 or 100 times and taking the average would be a more reliable approach, but that’s out of scope for now.
In the next post, I’ll dive into more business-relevant tasks:
text summarization
document extraction
… as for answering arbitrary questions, Google Search or Perplexity might be the better choice anyway :D
Generally, cost and latency aren’t everything to consider as correct results, and e.g. low hallucinations should have higher priority. Nevertheless, I hopefuly demonstrated the pricing mechanics, how configuration can impact the pricing and what to look out for.
Also, I will share my Jupyter notebook after Part 2, so that you can check the results and reuse some code yourself :)
Stay tuned!