A Reckoning Is Coming
Somewhere between the demo and the deployment, a quiet catastrophe is forming.
It does not announce itself. It accumulates in API invoices, in GPU reservation queues, in data center power contracts that utilities are struggling to honor, in the growing gap between what organizations budgeted for AI and what they are actually paying. It lives in the architecture decisions made during prototyping — decisions built around convenience, speed-to-demo, and a default assumption that has quietly become one of the most expensive beliefs in enterprise technology: more tokens means better results.
That assumption is the disease. This manifesto names it, diagnoses it, and proposes the cure.
Global inference costs as a share of enterprise cloud spend are growing at rates no analyst modeled in 2022. The organizations that survive and thrive in the next phase of AI adoption will not be the ones that threw the most tokens at their problems. They will be the ones that learned to spend tokens with intention — to compress without losing signal, to route with precision, to build systems that accomplish their goals with the minimum viable inference footprint.
We call this discipline tokenminning.
It stands in direct contrast to tokenmaxxing — the implicit, often unconsidered practice of maximizing token consumption across prompts, context windows, reasoning chains, and agentic loops, in the mistaken belief that more is always better.
Tokenminning is not frugality. It is not constraint. It is not a compromise imposed from outside by a procurement team alarmed by the AI bill.
It is engineering excellence, applied to inference.
This is the founding document of the tokenminning movement. If you have been quietly trimming prompts, questioning default verbosity, pushing back on the instinct to "just throw everything in the context window," and wondering why no one else in the room seems to care about the inference bill — you have already been practicing tokenminning. This is your vocabulary. This is your community.
The Tokenmaxxing Problem
1.1 What Tokenmaxxing Is
Tokenmaxxing is the practice — rarely explicit, almost always present — of maximizing token throughput at every layer of an LLM-powered system. It manifests as system prompts that run thousands of words long "just to be safe." It manifests as context windows stuffed with entire codebases, entire document repositories, entire conversation histories. It manifests as chain-of-thought instructions that demand elaborate step-by-step reasoning for tasks that do not require it. It manifests as agentic loops that append every tool call result in full to an ever-growing context rather than distilling observations into compact state.
Tokenmaxxing did not arrive maliciously. It emerged from a confluence of reasonable-seeming observations made early in the LLM era. Long chain-of-thought reasoning demonstrably improved performance on multi-step mathematical and logical tasks. Providing more examples in few-shot prompts seemed to improve reliability. Larger context windows were marketed as features — capabilities to be used, not ceilings to be respected. LLM orchestration frameworks lowered the friction of building complex pipelines to near zero, but built in defaults that maximize context retention rather than minimize it.
The result was a cargo cult: practitioners observed that more tokens sometimes helped, and concluded that more tokens always help. This is not an irrational inference from limited data. It is simply wrong.
1.2 The Hidden Costs Nobody Audits
The costs of tokenmaxxing are real, multidimensional, and almost universally untracked.
Latency cost. Inference time scales roughly linearly with output token count and super-linearly with prefill size on most transformer architectures due to attention computation over the full context. A prompt that is twice as long is not merely twice as expensive — it is often meaningfully slower at the wall-clock level, particularly as context lengths grow. Users experience this as sluggishness. Engineers experience it as SLA violations. The token is not an abstract unit; it is time.
Financial cost. Inference is billed by the token, and the bill is proportional to consumption. A system prompt that has grown from 400 to 4,000 tokens over successive iterations of "just add one more instruction" has increased per-call input cost by an order of magnitude. Multiplied across millions of daily API calls, this is not a rounding error. It is a budget line item that most finance teams cannot yet see because AI cost attribution at the feature level remains rare.
Energy cost. Every token processed is compute executed, and compute consumes power. The arithmetic is not abstract. Each forward pass through a large transformer model performs billions of floating-point operations. Those operations require energy. At scale — across an organization, across an industry — the aggregate energy demand of tokenmaxxing is a non-trivial contribution to data center power load. This is a cost that does not appear on the API invoice but is externalized to the grid, to the climate, and to the communities adjacent to the data centers bearing the load.
Throughput cost. At the systems level, every unnecessarily large prompt consumes GPU memory and compute that could serve other requests. Shared inference infrastructure has finite capacity. Tokenmaxxing reduces effective concurrency, increases queuing latency, and degrades the experience of every other user sharing the pool — including other workloads within the same organization.
The budget illusion. Development-tier API pricing, generous free tiers, and research credits mask the true cost of tokenmaxxing during the critical period when habits form. Engineers build prompting patterns in environments where the token cost is invisible or subsidized. Those patterns move to production unchanged. The reckoning comes on the first real invoice.
1.3 The Quality Myth
The most persistent defense of tokenmaxxing is that more tokens produce better outputs. This defense deserves scrutiny, because the evidence is more complicated than the intuition.
"Lost in the middle" — a well-documented phenomenon in long-context models — describes the tendency for model attention to concentrate at the beginning and end of long inputs, with material in the middle receiving systematically less weight. This means that stuffing a 100,000-token context window with potentially relevant documents does not guarantee that the model will use them — it guarantees that it could use them, which is not the same thing. Retrieval quality, not context quantity, determines whether the relevant information is available to the model in the positions where it will be attended to.
Verbose system prompts introduce a subtler failure mode: internal contradiction. A system prompt written incrementally over weeks, by multiple contributors, with no compression pass, will often contain instructions that contradict or underspecify one another. The model must resolve these contradictions heuristically. This produces behavioral variance that looks like model unreliability but is actually prompt incoherence.
Extended chain-of-thought reasoning is genuinely valuable for specific task classes — multi-step arithmetic, formal logical inference, complex code generation. For other tasks — classification, entity extraction, structured data transformation, direct question answering — forcing elaborate step-by-step reasoning provides marginal accuracy benefit while generating substantial output tokens. Worse, long reasoning chains can introduce factual drift: the model, in generating tokens, may compound errors across reasoning steps that would not have occurred in a more direct response.
In agentic systems, the quality argument collapses most completely. Each agent loop iteration that appends a full tool response to context is not providing the agent with more information — it is providing it with more noise. The relevant signal in most tool responses is compact. The surrounding metadata, formatting, and redundant context is not.
1.4 Tokenmaxxing as Institutional Debt
The most dangerous property of tokenmaxxing is not its immediate cost. It is its tendency to compound.
Prompts written quickly during prototyping are rarely revisited for efficiency. They get checked in, deployed, and forgotten. As product requirements evolve, new instructions are appended rather than integrated — the path of least resistance in any system where changing something risks breaking it. The prompt that began as 200 words is 2,000 words two years later, and no one is entirely sure which instructions are load-bearing.
Agentic orchestration frameworks, built for capability demonstration rather than production efficiency, default to maximum context retention. Overriding these defaults requires deliberate architectural choices that most teams do not make because the cost of not making them is not visible until scale.
Organizations that do not measure token consumption cannot optimize it. Without per-feature, per-user, per-call attribution, there is no feedback loop. Engineers receive no signal that their prompts are expensive. Product managers receive no cost-of-feature data. Finance cannot model AI cost at the product line level. The debt accumulates, invisible and compounding, until the invoice arrives.
What Is Tokenminning?
2.1 The Core Principle
Tokenminning is the discipline of achieving the intended outcome with the minimum viable token expenditure — in input, in reasoning, and in output.
The minimum viable expenditure is not zero. It is not the fewest tokens that produce any output — it is the fewest tokens that produce the correct output at the required quality level. The distinction matters. Tokenminning is precision, not privation.
The best analogy in software engineering is query optimization. A relational database with a full table scan and a relational database with a well-designed index can return identical results. The index-optimized query is not a lesser query. It is a better query — faster, cheaper, and more scalable. No engineer would argue that the full table scan should be preserved because "more data access means better results." Tokenminning applies the same logic to inference: the prompt that achieves the outcome with 400 tokens is not a lesser prompt than one that achieves the same outcome with 4,000 tokens. It is a better prompt.
2.2 The Tokenminning Mindset
Tokenminning begins with a simple commitment: treat every token as a resource with real cost. Financial, temporal, energetic, and environmental — none of these costs is zero, and none of them disappears because the invoice does not yet itemize them.
Tokens should be spent, not leaked. Spending is intentional. Leaking is accidental. Most tokenmaxxed systems are not choosing to consume tokens — they are failing to prevent consumption through defaults, habits, and lack of measurement. Tokenminning engineering builds systems where every token in the prompt has a reason to be there.
Compression is craft. The ability to convey exactly what needs to be conveyed — no more — is a first-class engineering skill. It requires understanding what the model needs to see to produce the correct output, which in turn requires understanding the model's priors, its failure modes, and the specific task at hand. A tokenminning practitioner is not just a user of LLMs. They are a student of LLM cognition, applying that knowledge to minimize the instruction surface necessary to elicit reliable performance.
The outcome is the unit, not the token. A tokenminning culture measures cost-per-correct-outcome, not just token count. A system that achieves the outcome in 300 tokens is better than one that requires 3,000 — but a system that requires 3,000 tokens and achieves the outcome is better than one that requires 200 tokens and fails. Tokenminning never sacrifices correctness for efficiency. It relentlessly questions whether the tokens being spent are actually buying correctness.
2.3 What Tokenminning Is Not
Clarity on scope prevents misapplication.
Tokenminning is not crude truncation. Cutting a prompt to save tokens while removing information the model needs to perform correctly is not optimization — it is degradation. The signal must be preserved. The noise must be removed. These are different operations.
Tokenminning is not refusal to use large models when large models are warranted. When a task genuinely requires the reasoning depth, breadth of knowledge, or instruction-following capability of a frontier-scale model, that model should be used. Tokenminning is about routing the right task to the right model — which sometimes means the largest available model. It is never about using a smaller model for a task it cannot perform.
Tokenminning is not "prompt hacking" — brittle micro-optimizations that extract marginally shorter outputs at the cost of reliability, generalization, or maintainability. A prompt that saves 50 tokens but fails unpredictably on 5% of inputs has not been optimized. It has been broken.
Tokenminning is the systematic elimination of waste: tokens that consume cost without affecting output quality in any meaningful way. That category is larger than most practitioners realize.
2.4 Tokenminning as a Movement
The practices described in this manifesto are not new. Engineers at organizations with large-scale LLM deployments have been applying versions of them for years, driven by cost pressure, latency requirements, and engineering instinct. What is new is the name, the framing, and the shared vocabulary.
Naming matters in engineering culture. "Technical debt" named something that everyone experienced but few could advocate for addressing, because addressing it lacked a shared vocabulary. Once named, it became a planning concept, a code review criterion, a career skill. Tokenminning does the same for inference efficiency.
A movement, as distinct from a best practice, has shared values, a community of practitioners, and the ambition to change industry norms — not just individual systems. Tokenminning's ambition is to make token efficiency a first-class engineering discipline: something taught in AI engineering curricula, evaluated in system design interviews, tracked in production dashboards, and weighed in architecture reviews.
If you are already practicing this work, this community is yours. If you are not yet practicing it, the rest of this manifesto is your starting point.
Tokenminning and the Infrastructure Crisis
3.1 The Data Center Reckoning
The physical infrastructure supporting AI inference is under extraordinary strain, and the strain is accelerating faster than the infrastructure can be built.
Hyperscaler data center construction operates on multi-year timelines. Power grid expansion, where required, operates on decade timelines. Water infrastructure for cooling, permitting, and environmental review add years to greenfield deployments in many jurisdictions. Capital requirements for new large-scale data center campuses are measured in the tens of billions of dollars.
Token demand growth operates on quarter timelines.
The naive response to this imbalance is to build faster and more. This response is necessary but not sufficient. The physical constraints on infrastructure expansion — land, water, power, permitting, capital — are hard limits that no amount of urgency or investment can eliminate entirely. The gap between demand growth and supply growth is real, and in some geographies it is growing.
Demand-side discipline is the other half of the equation. Every token not consumed is a token's worth of compute capacity preserved for something else. At the individual organization level, tokenminning is a cost optimization. At the industry level, tokenminning is a contribution to the sustainability of the infrastructure ecosystem that all AI applications depend on.
3.2 The Energy Arithmetic
The energy cost of inference is not speculative. It is calculable, even if the calculation requires estimates at several steps.
Large transformer models perform on the order of one to several floating-point operations per token per parameter during inference. A 70-billion-parameter model performing a single forward pass for a single output token executes tens of billions of FLOPs. At typical accelerator efficiency, this translates to a measurable fraction of a watt-second per token. Across millions of tokens per day, across thousands of concurrent users, the aggregate power consumption is substantial — and it is linearly proportional to token volume.
A 20% reduction in token consumption across an organization's AI inference stack is a 20% reduction in its inference compute demand, which is a 20% reduction in the associated energy draw, which translates directly into reduced Scope 2 emissions. This is not a rounding error. For organizations running production AI at scale, this represents a material sustainability contribution.
At the industry level, if tokenminning practices become standard across the organizations responsible for the majority of commercial AI inference, the cumulative energy savings would be measurable in terawatt-hours annually. This is not an exaggeration. It is arithmetic.
3.3 Carbon and Sustainability Commitments
Enterprise sustainability commitments — net-zero targets, science-based emissions reduction goals, Scope 3 reporting requirements — are becoming increasingly binding. Regulatory pressure in multiple jurisdictions is adding teeth to what were previously voluntary pledges. And AI inference is emerging as a non-trivial and rapidly growing share of the Scope 2 emissions profile for technology-intensive organizations.
The standard corporate response to this problem — purchasing carbon offsets, investing in renewable energy certificates — is under growing scrutiny from regulators, investors, and civil society. Offsets are increasingly viewed as deferral, not reduction. Reduction requires consuming less.
Token efficiency is emissions efficiency. It is not a proxy. It is not an estimate. It is a direct, auditable, technically defensible reduction in compute demand, which translates directly into reduced energy consumption, which reduces emissions. Organizations that can produce per-product, per-feature token efficiency metrics — and demonstrate improvement over time — have a sustainability story that is genuinely credible, not assembled from offset certificates.
This is a first-mover opportunity. The organizations that build token efficiency measurement and governance infrastructure now will have the audit trails and the benchmarks that will be required by regulators and demanded by institutional investors in the coming years.
3.4 The Shared Resource Problem
Tokenmaxxing on shared inference infrastructure is, at a structural level, a collective action problem. Public API endpoints are shared resources. Every organization that sends unnecessarily large prompts, retains unnecessarily long context, or runs unbounded agentic loops on shared infrastructure is consuming capacity that others could use — and imposing latency costs on the entire pool through increased queuing and reduced throughput.
Rate limits, priority tiers, and capacity reservations are the market's early mechanism for managing this congestion. They are blunt instruments. The more elegant and sustainable solution is demand-side discipline: organizations that send efficient prompts impose lower load on shared infrastructure and receive better service in return, while also paying less.
For organizations operating private or on-premises inference infrastructure, the logic is even more direct. Every GPU-hour consumed by a bloated, tokenmaxxed prompt is a GPU-hour unavailable for other work. Capacity planning for on-premises inference that does not account for token efficiency optimization will systematically over-provision — at significant capital and operational cost.
Tokenminning and the Inference Budget
4.1 Inference Costs Are the New Compute Bill
For the first two decades of enterprise cloud computing, inference was cheap. The dominant cost drivers were training, batch processing, rendering, and storage. Inference — serving a model's output to a user — was a rounding error on the bill.
That has inverted.
For organizations running production LLM workloads at meaningful scale, inference now represents the majority of total AI spend — often between 60% and 80%, depending on the architecture and usage patterns. And unlike training costs, which are incurred once and amortized, inference costs are recurring, proportional to usage, and scale with every new user, every new feature, and every new use case added to the platform.
The trajectory is predictable. As AI capabilities become embedded in more products and workflows, token consumption will grow. The only variables are whether that growth is bounded, measured, and optimized — or unbounded, untracked, and treated as the cost of doing business.
4.2 The Per-Feature Attribution Problem
The most common failure mode in enterprise AI cost management is not overspending. It is not knowing that you are overspending.
Most organizations cannot answer, with specificity, the following question: how many tokens does each of our AI-powered features consume, per user interaction, per day? Without this answer, there is no optimization target. Engineers have no feedback signal on the cost of their prompt designs. Product managers cannot weigh the inference cost of a feature against its revenue or engagement contribution. Finance cannot build a bottom-up cost model for AI at the product line level.
Tokenminning requires and rewards observability. The measurement infrastructure — per-call token logging, aggregation by prompt template, attribution by feature and user segment, trend alerting — is not optional overhead for a tokenminning practice. It is the foundation. Without it, tokenminning is aspiration. With it, tokenminning is engineering.
4.3 Competitive Dynamics
Token efficiency is a cost structure advantage, and cost structure advantages compound over time.
Consider two organizations offering comparable AI-powered products in the same category — document processing, customer service automation, code assistance. If one organization achieves the same quality of outcomes at 40% fewer tokens per interaction through superior prompt engineering, model routing, and context management, it has structurally lower unit economics than its competitor. At scale, this translates into pricing power, margin advantage, or the ability to invest more in product development while maintaining the same profitability.
In product categories where AI capabilities are becoming commoditized — where multiple providers can deliver acceptable quality on standard NLP tasks — inference cost efficiency will increasingly be the decisive differentiator. The frontier model capability gap matters less when the question is who can deliver good-enough quality at the lowest cost per interaction.
Proprietary token efficiency advantages — a fine-tuned smaller model that replaces a prompted larger one, an optimized prompt library developed over hundreds of iterations, an intelligent routing layer built on months of production data — are meaningful moats. They are difficult to replicate quickly. They compound as the organization develops institutional knowledge around what works. They represent a return on the investment in tokenminning as an engineering discipline.
4.4 Budget Predictability and Governance
Unbounded token consumption creates unbounded cost exposure. This is not a theoretical risk. Agentic systems with poorly designed loop termination logic, misconfigured context retention, or inadequate tool call budgets can consume orders of magnitude more tokens than expected on edge-case inputs — and they often do.
The horror stories are becoming common: an agentic pipeline that enters a reasoning loop on an ambiguous input and processes 500,000 tokens before timing out; a batch processing job whose context window grows without bound across documents, consuming the entire monthly token budget on a single run; a customer-facing chatbot with no output length constraint that generates 3,000-word responses to simple questions.
These are not edge cases in poorly designed systems. They are predictable failure modes of systems designed without token budget constraints as a first-class property.
Tokenminning engineering builds budget ceilings and throttles into systems as architectural primitives — not as emergency patches applied after the first unexpected invoice. Soft token budgets trigger logging and alerting. Hard token budgets enforce ceilings and fail gracefully. Per-session, per-user, and per-feature budgets make cost exposure predictable and governable.
As AI governance frameworks mature — driven by board-level concern, investor scrutiny, and emerging regulatory requirements — the ability to model, predict, and report on AI inference costs at a granular level will become a compliance requirement. Token efficiency governance is the operational foundation for that capability.
Principles, Techniques, and Tools
5.1 Prompt Engineering and Compression
Prompts are code. They should be written with the same care, reviewed with the same rigor, and optimized with the same discipline as any performance-critical code path. In most organizations today, they are not.
Establish the baseline first. Before any optimization, instrument every prompt template to log input token count, output token count, and a quality signal for the task. You cannot optimize without a baseline, and optimizations made without measurement are guesses. Audit and baseline before you compress.
Remove scaffolding before production. Prompts written during development often contain explanatory scaffolding — background context added for the developer's own clarity, examples that were helpful for debugging but are not needed in production, verbose role-setting preambles that made sense in a research notebook. None of this belongs in production prompts. Conduct an explicit scaffolding audit before any prompt template goes to production.
Compress instructions to their imperative core. Natural language is redundant by design — it is built for communication between humans who require context, preamble, and politeness to cooperate effectively. Models do not. "Reply in JSON" is equivalent in effect to "Please ensure that your response is formatted as a valid JSON object with the following structure, containing the keys listed below, and avoiding any additional commentary or text outside the JSON structure." The first version is six tokens. The second is over forty. Use structured schemas — JSON Schema, XML templates, function signatures — to specify format requirements. They are more compact, more precise, and more reliably enforced than natural-language format instructions.
Eliminate politeness tokens. "Please," "thank you," "I'd like you to," "Could you kindly" — these add tokens and buy nothing. Models do not have feelings that need managing. They respond to instructions, not social niceties. In production prompts, every politeness token is a waste.
System prompt hygiene is ongoing maintenance. Conduct quarterly audits of production system prompts. Look for contradictions between instructions added at different times by different contributors. Look for redundancies — the same instruction expressed twice in different words. Look for instructions whose behavioral effect is untested or unknown. A system prompt should be the minimum set of instructions required to produce the desired behavioral profile. Anything beyond that minimum is overhead.
Few-shot example curation. Use the minimum number of examples required to reliably demonstrate the pattern to the model. For well-defined tasks with clear output formats, one to two examples is often sufficient. Three to four is rarely exceeded in well-calibrated production systems. Each example should be chosen specifically to disambiguate a known failure mode — a case where the model without the example would behave incorrectly. If you cannot articulate why a specific example is in your few-shot set, it should not be there.
5.2 Architecture and Model Selection
Prompt efficiency operates within a single model call. Architecture efficiency determines which model handles which call — and that decision has a larger impact on total token cost than any individual prompt optimization.
Right-size the model to the task. Using a frontier-scale large language model for entity extraction, classification, sentiment analysis, or structured data transformation is the model-selection equivalent of tokenmaxxing. A well-prompted or lightly fine-tuned smaller model will achieve equivalent accuracy on the vast majority of well-defined NLP tasks at a fraction of the cost and latency. The assumption should be: start with the smallest capable model and escalate only when quality thresholds require it.
Build an explicit routing layer. The routing layer — a component that classifies incoming requests and directs them to the appropriate model tier — is perhaps the single highest-leverage architectural investment in a tokenminning practice. A lightweight classifier that can reliably distinguish "simple extraction" from "complex reasoning" tasks, and route them to appropriately sized models, can reduce average inference cost by 50% or more without any change to output quality on either task class.
Design for cache efficiency. Key-value caching in transformer inference is a significant optimization lever that most application developers underutilize. Cache hit rates are maximized by designing prompt templates with stable, long prefixes — system prompts and static context that changes rarely should occupy the beginning of every prompt, where caching semantics favor them. Variable user inputs and task-specific context should come at the end of the prompt. Understand the caching semantics of your inference infrastructure — whether you manage it directly or consume it via an API — and design prompts accordingly.
Retrieval-Augmented Generation precision is a token efficiency problem. In RAG architectures, every document chunk inserted into the context window costs tokens. The retrieval component's job is to ensure that the chunks inserted are the ones the model actually needs. Low-precision retrieval — returning twenty chunks when five would suffice, or returning loosely relevant chunks because the embedding similarity threshold is set too permissively — is a form of tokenmaxxing at the architecture layer. Invest in retrieval quality: better embedding models, re-ranking passes, chunk size calibration, and similarity threshold tuning are all tokenminning investments that pay dividends in both cost and output quality.
Control output length explicitly. Set maximum output token constraints aggressively for tasks with bounded output requirements. Use structured output modes — JSON with a defined schema, function calls with typed signatures — where supported. They tend to produce more compact, parseable outputs than open-ended prose responses, while also being more reliable for downstream processing.
5.3 Agentic Workflows and Loop Design
The agentic setting is where tokenmaxxing does the most damage, and where tokenminning practices have the highest leverage.
In a naive multi-step agentic loop, context grows monotonically with each step: the full history of observations, tool calls, results, and reasoning is appended to each subsequent prompt. For a task that requires n steps, this produces O(n²) token consumption — a quadratic scaling that makes complex agentic tasks dramatically more expensive than the sum of their parts.
Implement explicit context compression between steps. After each major sub-task, run a compression step that distills the accumulated context into a compact structured state representation — a summary of what was learned, what was decided, and what the current task state is. This representation replaces the verbose history in the next step's prompt. The history is not lost; it is archived to external storage where it can be retrieved if needed. But it is not in the active context window consuming tokens at every subsequent step.
Separate working memory from episodic memory. Working memory is what the agent needs to complete the current step — the current sub-task description, the relevant state from prior steps, the available tools, any constraints. Episodic memory is the full record of what happened — every tool call result, every observation, every intermediate reasoning step. Only working memory belongs in the prompt window at step time. Episodic memory belongs in external storage, referenced by pointers rather than reproduced in full.
Design tools for output density. Each tool call in an agentic system adds overhead: the call structure, the response parsing, error handling, and the tool result that gets added to context all cost tokens. More importantly, raw tool outputs — API responses, database query results, file contents — are almost never in the format most useful to the model. They contain metadata, formatting, and redundancy that the model does not need. Build tool interfaces that return exactly what the agent needs in the most compact format, not raw API responses. A tool that returns {"status":"completed","count":47} is better than a tool that returns a 2,000-token JSON blob with every field the API happens to expose.
Make token budgets a first-class agent property. Every agentic system should have explicit, enforced token budgets at the orchestration level — not as instructions to the model ("try to be concise"), but as hard programmatic constraints on the orchestration layer. A per-run token ceiling that triggers graceful termination and state preservation when approached is not a limitation. It is a safety property. It is the difference between a system whose cost envelope is predictable and one that can consume arbitrarily large amounts of compute on edge-case inputs.
Batch tool calls where semantically coherent. Many agentic frameworks support parallel tool execution within a single reasoning step. Where multiple tool calls are independent — they do not depend on each other's results — batching them in a single step rather than executing them sequentially reduces the number of reasoning cycles required and compresses the context growth per unit of work accomplished.
5.4 Observability, Measurement, and Governance
Every principle and technique in the preceding sections is operational only if you can measure it. Tokenminning without observability is motivation without feedback — valuable intent, no mechanism for improvement.
At minimum, production AI systems should instrument: input token count and output token count per call, prompt template identifier (so calls can be aggregated by template), feature or product area attribution, latency, and a quality signal appropriate to the task. These metrics should be queryable at the call level and aggregatable by template, feature, user segment, and time period.
From this instrumentation, derive the metrics that tokenminning governance requires: cost-per-successful-completion, token efficiency trend over time per template, cache hit rate where applicable, and the distribution of token consumption per agentic run. The tail behavior — the 99th percentile — is often more revealing than the mean.
The tokenminning regression test should be part of every CI/CD pipeline that touches AI: when a prompt template is modified, the deployment pipeline should automatically compare token consumption on a representative sample against the previous version. A change that increases token consumption by more than a defined threshold should require explicit review and sign-off, not just quality testing. Token efficiency is a correctness property, not an afterthought.
Finally, build budget guardrails as programmatic properties of every AI system: soft limits that trigger alerting and logging when approaching a threshold, and hard limits that enforce ceilings and fail gracefully. These are not constraints on what engineers can build. They are the engineering discipline that makes AI systems governable, predictable, and trustworthy at scale.
Who Should Care, and Why
6.1 The CEO
The AI strategy your organization articulated in 2023 was built on cost assumptions that are being revised in real time, in ways that most strategic plans have not accounted for. Inference costs are not a technical detail. They are a unit economics question that will determine whether your AI investments produce the returns you projected.
The organizations that will be most competitive in the next phase of AI are not the ones that deployed the most AI features the fastest. They are the ones that built AI-native products with disciplined inference economics — systems that deliver value per token, not just tokens per feature. Tokenminning is the engineering culture that produces those economics.
You do not need to understand the technical details. You need to ask one question in your next AI strategy review: what is our cost per AI-powered outcome, by feature, and what is our improvement trajectory? If your AI leads cannot answer that question, you have a governance gap, and this manifesto is the starting point for closing it.
6.2 The CFO
Your AI API costs are growing, and if you are like most finance leaders overseeing AI-adopting organizations, you do not have a cost model granular enough to understand why or to project where they will be in eighteen months.
The problem is structural. AI inference costs do not map naturally onto traditional IT cost categories. They are usage-proportional, feature-specific, and driven by engineering decisions that happen far from finance — choices about which model to use, how long to make a system prompt, how much context to retain in an agentic workflow. Without per-feature attribution and engineering accountability for token consumption, there is no model for these costs and no lever to manage them.
Tokenminning gives you the vocabulary and the framework to close this gap. Cost-per-completion. Token budget by product line. Inference ROI. These are the metrics that make AI spend governable. Ask your AI leads to instrument them. Then ask for a quarterly trend report.
On the sustainability side: if your organization has made emissions reduction commitments, AI inference is becoming a material component of your Scope 2 profile. Token efficiency improvements are directly auditable emissions reductions. That is a more defensible position than offset purchases, and regulators and investors are beginning to distinguish between them.
6.3 The VP of AI / Head of AI Engineering
Token efficiency is your discipline to define and own. If you do not establish it as a first-class engineering standard — alongside latency, reliability, and accuracy — it will not be established. The default behavior of engineers optimizing for capability and speed-to-demo will produce tokenmaxxed systems, because tokenmaxxing is the path of least resistance in every major LLM framework.
Building a tokenminning practice in your organization starts with measurement infrastructure — without which nothing else is tractable — and progresses through making token efficiency a code review criterion, an architecture review criterion, and a deployment gate. Establish a model selection policy with explicit tiers, routing criteria, and the expectation that the default model is the smallest capable one, not the largest available one.
Own the fine-tuning calculus. A fine-tuned smaller model that replaces a prompted larger model on a well-defined production task is typically a tokenminning win across every dimension: lower cost per call, lower latency, better reliability on the specific task distribution, and reduced dependence on prompt engineering to compensate for capability gaps. The investment in fine-tuning has a quantifiable ROI in inference cost savings. Make that calculation explicit and present it to your CFO.
6.4 The AI Product Manager
Every feature you ship has a token cost, and that cost is a product decision as much as an engineering one. How much inference is this feature worth? What quality threshold is required, and what model size is sufficient to reach it? What is the acceptable response time, and what does that imply about context size constraints?
These are product questions. Engineers can implement the answers, but they cannot define the acceptable trade-off between quality and cost without product input. If you are not participating in these conversations, you are leaving cost-quality optimization to default engineering choices, which will tend toward tokenmaxxing because quality is visible and token cost is not.
Make token cost a product metric. Include it in feature launch criteria alongside performance benchmarks. Track it in your product analytics alongside engagement and retention. Ask, for every AI-powered feature: what is our cost per user interaction, and what is our target? The answer will inform model selection, prompt design, and feature scope in ways that pure capability thinking will not.
6.5 The Machine Learning / AI Engineer
Tokenminning is your craft. The skills it requires — information-theoretic prompt design, retrieval system optimization, stateful agent architecture, LLM call graph profiling — are the engineering competencies that distinguish practitioners who understand inference deeply from those who treat models as black boxes to be prompted maximally.
A well-compressed prompt is not just cheaper than a bloated one. It is more legible, easier to debug, easier to version-control, and more likely to behave predictably across the distribution of inputs it will encounter in production. Tokenminning is a quality discipline, not just a cost discipline. It produces better engineering artifacts.
This is also where the community lives — in the GitHub repositories of prompt compression libraries, in the architecture discussions around context management strategies for agents, in the benchmarks comparing retrieval approaches on cost-quality Pareto frontiers. Tokenminning is a technical movement with a growing body of practice. Engage with it, contribute to it, and build your career on the expertise it rewards.
6.6 The Sustainability / ESG Lead
AI is becoming a material component of your organization's environmental footprint, and inference is the fastest-growing piece of that component. This is not a future risk — it is a current one, and it is already showing up in the energy consumption data of organizations with serious AI deployments.
Token efficiency is the lever you have been looking for: a concrete, technical, auditable mechanism for reducing AI-related energy consumption that is more defensible than offset purchasing and more tractable than advocating for hardware efficiency improvements you cannot control.
Partner with your AI engineering and product teams to establish token efficiency KPIs. Request that token consumption be tracked alongside the energy intensity metrics that feed into your Scope 2 and Scope 3 reporting. Build the audit trail now, before regulators require it. The organizations that can tell a precise, technical story about their AI efficiency trajectory will be in a substantially stronger position than those who cannot.
The Tokenminning Covenant
The era of abundant, cheap, consequence-free tokens is ending. It was never as real as it seemed — the costs were always there, externalized to API bills that grew quietly in the background, to data centers drawing power from grids that have not been built yet, to carbon budgets that the world cannot afford. The era of unlimited token abundance was a subsidy, and subsidies end.
The organizations and engineers who build token-efficient systems today will have structural advantages when that transition completes. Lower unit economics. More predictable cost structures. Cleaner sustainability profiles. Better-engineered, more maintainable AI systems. And the institutional knowledge and measurement infrastructure to improve continuously, rather than starting from scratch when the invoice becomes impossible to ignore.
This is not a call for sacrifice or constraint. It is a call for craft. The token is the atom of inference. Treat it accordingly.
We who commit to the practice of tokenminning make the following covenant:
The tokenminning covenant
Seven commitments for the age of inference.
Token Efficiency Audit Checklist
- Is every instruction in this prompt necessary to produce the required output?
- Are format requirements expressed as schemas rather than natural-language descriptions?
- Has all development scaffolding been removed?
- Are politeness tokens eliminated?
- Are few-shot examples justified by specific failure modes they address?
- Has this prompt been tested at its current token count against a baseline for quality regression?
- Is the model selection justified by task complexity requirements, not convenience?
- Is there a routing layer directing simple tasks to smaller models?
- Are prompt templates designed with stable prefixes to maximize cache efficiency?
- In RAG systems: has retrieval precision been measured and optimized?
- Are output length constraints set explicitly?
- Does the agent loop have explicit, model-independent termination criteria?
- Is there a context compression step between major sub-tasks?
- Are tool interfaces designed to return compact, model-relevant outputs?
- Is there a hard token budget enforced at the orchestration layer?
- Is per-run token consumption logged and monitored?
- Is token consumption attributed per feature and per prompt template?
- Is there a token cost trend report available to engineering leads and product managers?
- Is token efficiency included in deployment gates alongside quality metrics?
- Are soft and hard token budget guardrails implemented in production?
Model Selection Decision Matrix
| Task class | Characteristics | Recommended tier |
|---|---|---|
| Classification / routing | Bounded labels, well-defined schema | Smallest capable; fine-tune if volume warrants |
| Structured extraction | Fixed output schema, moderate complexity | Small-to-mid; schema-constrained output |
| Summarization | Variable length input, moderate judgment | Mid-tier; output length control critical |
| Multi-step reasoning | Logical chains, formal inference | Large model; chain-of-thought justified |
| Complex code generation | Novel architecture, multi-file context | Largest capable; context management critical |
| Conversational | High variance, nuance-dependent | Route on detected complexity; escalate dynamically |
// Tier definitions are intentionally relative — calibrate to your available model catalog and production quality thresholds.