Be Brief, Be Bright, Be Gone. Your AI Agent Did Not Get the Memo.

Every model upgrade is a personnel change nobody approved. The persona prompt is the costume. The model is the actor. The actor wins when the stakes get high or the prompt gets thin. We have fifty years of personality assessment for human hires. We have not pointed any of it at the agents.

Dr. Yoram Friedman
8 min read
Be Brief, Be Bright, Be Gone. Your AI Agent Did Not Get the Memo.

Every model upgrade is a personnel change nobody approved. Part two of a three-part series on the personnel infrastructure that does not exist for AI agents. Part one named the missing category. This one names the missing assessment.


For years, every manager I worked with was deep red.

That is a Lumina color. SAP ran Lumina assessments and workshops for senior staff: a questionnaire, a personal profile in four colors, then a facilitated session where you compare your profile with colleagues and managers to surface the differences out loud. The deliverable is not the report; it is the conversation. You leave the workshop knowing what your default style is, what theirs is, and where the friction is going to come from when the calendar gets tight.

The workshop set hands you four small foam cubes, one per color, with a phrase on each side. The green one says "show me you care." The yellow one says "involve me." The blue one says "give me details." The red one says "be brief, be bright, be gone." That last phrase is one of those lines that lodges in your head because it is funny and accurate at the same time. You would have recognized the type without the assessment. Direct. Outcome-first. Allergic to setup.

I am not red. I am a storyteller. I lead with context. I land the point at the end of the paragraph, which is sometimes the end of the second paragraph. If you read the first article in this series, you already know my register; this is what it sounds like at full volume.

For fifteen years that mismatch was an operating problem, not a personality problem. The mismatch did not fix itself. We fixed it deliberately. I learned to compress, get to the point faster, write the bottom-line-up-front version of every email. They learned to wait, ask one more question, let the story finish before reaching for the bullet points. The workshop did not give us the answer. It gave us shared language about what we were each defaulting to, and an explicit working contract about how we were going to meet in the middle.

That is the only reason enterprise personality assessment is worth the money. Not the report. The conversation it forces.

Now consider every AI agent your team deployed last quarter.

What we already know how to do

Two articles I wrote earlier this year, "The Room That Runs Itself" and "The Smartest Person in the Room Is a Prompt," made a case for what you can build deliberately. You can write a persona prompt and the model will follow it. You can construct an ENTP brainstormer that leads with the unexpected angle, defaults to "yes and," and resists summarizing at the end. You can construct an ISTJ auditor that reads the full submission before commenting, labels findings as BLOCK / WARNING / NOTE, and refuses to soften. Both are real. Both work. I have built nine-agent design thinking rooms that produce better output than the human workshop they replaced, and that work depends entirely on the agents having distinct, intentional, persistent personalities.

That part is solved. You can dress an agent in whatever personality you want, as long as you write the prompt.

This article is about what happens when the prompt runs out.

The default underneath

The persona prompt is the costume. The model is the actor.

When the instruction is clear, the costume holds. The ENTP brainstormer brainstorms. The ISTJ auditor audits. When the instruction is ambiguous, when the context window gets long enough that the persona becomes a faint signal, when the user asks the agent something the persona did not anticipate, the costume slips and the actor underneath responds.

Sonnet 4.6 underneath is light, fast, occasionally funny, gets to the point. Opus 4.7 underneath is serious, layered, takes longer to land. Put both in an ENTP brainstormer costume, and one of them is a more convincing ENTP than the other. Put both in an ISTJ auditor costume, and the asymmetry reverses. The costume is real. The actor is realer. The actor wins when the stakes get high or the prompt gets thin.

I gave both the same instruction earlier this week: read this draft and tell me what is wrong with it. Sonnet returned four findings in four sentences. Opus returned three paragraphs of context before the first finding. Same costume, same task, completely different default.

This is observable to anyone who has worked with multiple frontier models for more than a few weeks. It is not metaphor. The defaults under ambiguity, under pressure, under fatigue (yours, not the model's) are different across models, and they are different in ways that map predictably onto the personality frameworks enterprise hiring has used for decades. Lumina is one of them. DISC, MBTI, Five-Factor, Hogan do similar work, by different names. They were built to make explicit what otherwise stays implicit: people have defaults, the defaults survive coaching, and team performance depends on whether the defaults are legible to the rest of the team.

We have those instruments. We have not pointed them at the agents.

Where the workshop ends

The workshop is a controlled environment. The cross-cultural communication training I sat through at SAP across years of global work was three days at most. By Wednesday afternoon you could decode the Israeli sending a one-line "I need the PRD you wrote last month" with no preamble. You could decode the German colleague who needed a proper salutation even in chat. You could decode the American whose "interesting idea, let me think about it" meant the idea was dead. You could decode the Japanese teammate whose silence in a meeting was not absence; it was waiting for the right moment. Same project, same goal, four completely different communication defaults. The workshop made them legible for three days.

Then everyone went home and the real work started, the eighteen-month project where those defaults would meet each other every Monday morning at standup. The workshop did not solve the problem. It gave the team a shared language and tools to work together despite the differences. Naming the differences was most of the fix.

A workshop is a one-shot assessment with limited blast radius. A persistent team member with no assessment runs every day for years.

That is the situation we have now. An AI agent dropped into a team has no workshop, no shared language for its defaults, no peer who has spent two years learning to translate. Worse, the team has no shared language for the agent's defaults either. Everyone is reading the room by feel. Some adapt. Some give up. Some build private workarounds that nobody documents. By month six, "how we work here" has shifted to accommodate the agent's defaults without anyone naming what changed.

Every model upgrade is a personnel change

This is the part I had not understood until very recently, when I started working with Opus 4.7 on the same kind of writing work I had been doing with Sonnet 4.6 for months.

Sonnet 4.6 was the colleague I had learned. I knew its rhythm. I knew when to expect pushback and when to expect agreement. I knew its tells, the small ways it signaled confidence versus hedging. I knew that if I gave it three paragraphs of context, it would return three paragraphs of action. The fit had been earned over hundreds of sessions.

Opus 4.7 thinks differently. I am not saying it thinks worse. It may well be better at the underlying tasks; the benchmarks suggest it is. I am saying the rhythm is different. It takes longer to land. It expands where I expected compression. It returns nine paragraphs where Sonnet would have returned four. It is, in personality terms, more of a storyteller, and I am already a storyteller, and what I had with Sonnet was a working contract between a storyteller and an editor.

I do not have that contract with Opus yet. We are still in the early weeks of getting to know each other. The friction is real. It is also exactly the friction I would feel if a senior colleague I had been working with for years took another role and was replaced by someone allegedly more capable but with a different default register. The replacement might be a better engineer or a better writer. The team has still lost the working dynamic, and that loss has a real cost that nobody put on the deployment checklist.

This is not a complaint about Opus. The point is that even from the same vendor, even with the same brand name on the model, the upgrade is a personnel change. The team has to renegotiate the working contract. For most enterprises running internal copilots backed by frontier models, that renegotiation is happening invisibly, every time the underlying model updates, to thousands of users at once, without warning.

If your engineering org's productivity drops three to six weeks after a quiet model upgrade, you are not seeing feature regression. You are seeing a workforce that just had a team member replaced and did not get to interview the replacement.

The assessment we know how to run

The enterprise has spent fifty years running personality assessments on humans because the resume is not the colleague. Lumina, the one I happened to take at SAP, produces a four-color profile mapped across twenty-four qualities, with explicit communication preferences and explicit irritation triggers. The other instruments work the same way, by different names. The output is not the answer; the output is a shared vocabulary that lets a team adjust around predictable differences.

An equivalent assessment for an AI agent is not science fiction. It is a test suite. You construct a battery of ambiguous scenarios, each one designed to surface the agent's default when the persona prompt is silent or contested. You observe what it reaches for under stress, what it hedges, what it asserts confidently, when it volunteers caveats, when it suppresses them, how it handles a contradictory instruction, how it handles a question outside its stated scope. You score the responses across the same kind of dimensions a Lumina profile uses, or you invent dimensions specific to the agent role.

You run this once per model version, before deployment. You publish the profile to the team that will use the agent, the way SAP published every new hire's Lumina report to their immediate colleagues. The team learns what to expect under ambiguity. When the next model version ships, you run the same battery, compare the profiles, and tell the team explicitly what changed.

None of this is invented. The methodology has been running on human hires for half a century. The test infrastructure is comparable in difficulty to the eval suites the AI field already builds, and considerably easier than the alignment evals it builds for safety. The reason it does not exist is not technical. It is that nobody has named the category.

The Co-Pilot at the end of the line

Here is the small joke that is also the point. If my old managers and I were still on the same team today, the practical adaptation we built over fifteen years would now be available off the shelf. I would write my storyteller emails. Co-Pilot would summarize them into bullet points before they hit the manager's inbox. The manager would respond in three bullets. Co-Pilot would expand them into something that read like a paragraph for me to absorb. We would each get the version that fit our defaults, and nobody would have to learn anyone else's register.

That is not a dystopia. That is exactly the right application of an AI agent: not as a team member with its own opaque defaults, but as a translator between two people whose defaults are known and named and not going to change. The intentional adoption is fine. It is a real productivity gain, applied to a real and old human problem, by a tool that is genuinely good at the translation.

Used as a translator, the agent is doing the job that fifteen years of personal accommodation used to do. Used as a team member without an assessment, the agent is the senior hire nobody interviewed.


Next in the series: the validator-atrophy problem. What happens to the human skill of validating work when the work is produced by an agent faster than any human could write it, and the human reviewer increasingly only ever sees AI-generated output? The supervision channel collapses twice: once now, through speed; again in eighteen months, through skill erosion.

Share