Briefing #35: How to Interview an AI
We hire people for important jobs by putting candidates through a rigorous selection process. Why don’t we do the same with AI?
When I meet with senior leaders to talk about AI for the first time, among the things I ask is what AI models their organization uses. Most will name one without hesitation. When I ask them why, the answers invariably are: “It’s the one I’ve gotten used to” and “it’s the one that gives me the best answers.”
If I ask them what they’re looking for in an AI model, they’ll often tell me they want “the best” model, or the “most accurate” model, or the one that makes “the fewest mistakes.”
Some tell me they keep a close eye on LLM leaderboards, in the same breath sharing they’re constantly concerned about doubling down on a particular model because it seems like there’s always a new one right around the corner, and the next model always promises a meteoric improvement over the last.
At the same time, I’ve run into leaders who mourn the loss of models they’ve used for a long time. I’ve spoken with leaders who were genuinely upset when an AI provider deprecated a model version they’d grown to rely on. As if they’d lost a trusted colleague.
One described to me that it felt like their assistant had been replaced overnight without warning. And the new one “just isn’t as good.”
These reflections reveal how we relate to the AI tools we invest in. They also reinforce how many of the decisions we make about AI are based on subjective experience, gut feel, and what we hear from others.
Changing AI benchmarks
Despite the hiring rigor that many organizations adopt, it’s curious that many leaders today don’t set out clear qualification criteria for what they’re looking for in an AI model, beyond maybe saying they want the best one.
Rather than establishing AI selection criteria, we choose instead to defer our choice of AI to what we see or hear in the news.
The AI media ecosystem has conditioned us to think about model selection the way we think about sports rankings. Who’s number one? Which model topped the latest evaluation suite? The implicit assumption is that “best at benchmark tests” translates to “best for your specific use case,” even though that assumption is almost always wrong.
Models like Google’s Gemma 4 and DeepSeek v4 have been attracting serious attention in the AI engineering community, not because they top every leaderboard, but because they deliver comparable performance to much larger frontier models at a fraction of the infrastructure cost, a sign that the battleground for AI model supremacy is moving to the frontier of AI efficiency over pure AI accuracy.
DeepSeek v4 is priced at roughly one-sixth the cost of leading competitors, generating what the developer community is calling a “second DeepSeek moment,” a nod to the market impact that DeepSeek had when it was originally unveiled to the world in 2025.
What models like DeepSeek v4 and Gemma 4 are demonstrating is that it’s now possible to get near-frontier performance at a fraction of the cost, and that the gap between the “best model” and “the right model for the job” is widening. Developers are already prioritizing the use of cost-efficient AI models alongside tools like Claude Code for complex coding tasks instead of strictly relying on the thinking models that top the LLM leaderboards.
They’re making strategic decisions at the task level that most senior leaders in organizations haven’t started thinking about yet. Developers are embodying the financial discipline to match the right model to the right task at the lowest cost that still meets the performance bar.
This way of thinking about AI is showing up elsewhere. Just have a look at how China has weathered GPU export bans that were intended to constrain its own AI development. Instead of inhibiting AI, it’s forced an efficiency pivot that’s become a structural competitive advantage. The constraint has become the edge.
Fit isn’t the same as performance
AI model selection is more than just a niche engineering consideration. It’s actually a leadership and commercial one, and one that many leaders have yet to make deliberately.
It’s really a decision of fit, rather than one of performance: match the right model to the right task, deliberately, based on what conditions for success matter most to your organization instead of just relying on external rankings.
A model that scores highest on a reasoning benchmark may respond in a style that doesn’t work for your team’s workflows. A model that’s technically less capable may follow complex, multi-part instructions more reliably, which matters far more in practice when deploying real-world AI workflows than marginal differences in raw intelligence.
Also: a model that communicates in a way that matches your organization’s culture will be adopted more readily than one that doesn’t, regardless of what the benchmarks say.
We might be tempted to think about “non-performance” considerations the same way we think about “hard skills” versus “soft skills” when we evaluate job candidates, but it’s more than that. These variables of fit are the ones that determine whether an organization will actually use AI, or shuffle it to the side after the initial enthusiasm fades.
To know what AI model really is “the best,” leaders need to be asking what models are the best fit for the work being asked of AI, in the context of the people who will be using it.
Six things to audition for
The good news is that evaluating fit doesn’t require a technical background. It requires the same skills you already use when hiring: structured observation of how a candidate performs on the things that actually matter for the role.
Here are six things you can do today to evaluate AI models against any application you’re considering. Look for:
1. Reasoning transparency. Ask the model to walk you through how it would approach a decision where evidence points in two different directions. The conclusion doesn’t matter. It’s whether the model shows its work. A model that externalizes its reasoning is easier to audit, challenge, and trust.
2. Instruction precision. Give the model a multi-part instruction with a specific format requirement and an unusual constraint. Count how many parts it fulfills completely and correctly. That consistency is one of the most important practical variables for complex workflows and automations, often more important than raw intelligence.
3. Cultural adaptability. Ask the model to rewrite an executive summary for a skeptical board audience, then rewrite it again for a frontline technical team. How does the model adapt its communication style? Do tone, vocabulary, and structure change? Or is it just producing two versions of the same document? Some models are better than others in adapting to different work environments.
4. Intellectual honesty. Ask the model for the strongest arguments against a position you’ve just asked it to defend. Spar with it in a live debate to see how it responds. A model that engages seriously with counterarguments, including arguments that challenge its own previous output, is more valuable than one that optimizes for agreement.
5. Internal consistency. Run the same complex reasoning prompt twice, separated by time. Compare the structural approach, not just the content. Meaningful variance in how a model reasons through the same problem is a signal worth paying attention to before you embed it in a critical workflow.
6. Cost efficiency. Compare the inference costs of the various models you’re considering. Unit costs are often represented in the form of input and output tokens, but some providers use virtual credits that correlate to token use. Once you’ve identified candidates that pass the five dimensions above, cost efficiency becomes a tiebreaker: what model gets you the results you need at the lowest cost?
I want to flag that there are a range of LLM parameters (you might run into terms like temperature, Top-P, and Top-K) that one can tune to customize how an LLM responds to a prompt and will influence how an AI model performs against these six criteria. Some of these are variables that are only accessible in a development environment or through API access to a model, so be aware that model selection is potentially even more nuanced.
Score each model against these six dimensions, weighted by what matters most for your specific use case. A team using AI primarily for written communication will emphasize factors like cultural adaptability more heavily than a team embedding AI into complex automated workflows, which crucially depend on instruction precision and internal consistency above all else.
The right model (or the right model with the right choice of parameters) is the one that performs best on the dimensions that bear the load of the work you’re asking AI to do.
Choosing the right AI model (or AI models – plural) for your organization is a consequential decision. Not because there’s competition to see who ranks first on a leaderboard, but because choosing well is a decision that directly impacts the financial and operational performance of your organization.
Access to AI models is no longer the constraint. Choice is. Hugging Face, one of the leading repositories for AI models, shares that its community has published over two million AI models for others to use. Organizations are literally spoiled for choice.
Instead of looking at how to get access to the “most powerful” models, there’s value in developing the management discipline to evaluate fit deliberately, deploying AI with precision, and resisting the pull of both emotional attachment and market-driven rankings.
While AI might not be a colleague in the same way a person can be, it deserves a proper interview before you give it the job.
AI, Upfront publishes every Monday. If this was useful, subscribe to get it in your inbox. And if there’s a topic you’d like me to tackle, reply and let me know.


