Maya’s first review of the morning was a garden. Not the interesting kind. She enjoyed a public courtyard or a rooftop redesign. This was just a standard 110-square-meter plot just outside of town. The housing unit’s intent stack was pretty good: make it low-maintenance, include native plantings, ensure a southern exposure, and limit annual heating costs to 20k SEK. Two adults, two children under ten. The system had generated the full design overnight: a planting plan, gravel paths, an irrigation system, and a projected installation sequence for the landscaping unit. The whole thing could be rolled out in 2-3 days, depending on weather conditions. All Maya had to do was evaluate.
She pulled up the render and immediately felt that something was off. The plantings were fine. The proportions were fine. But, the entire thing was optimized for the adults’ stated preferences. Ornamental grasses, a clean gravel terrace, and birch trees. It looked beautiful, sure. But, as a parent, she knew: this would survive approximately one afternoon with children! There was nowhere to dig. Nowhere to run. No flat surface that could become a football goal, a music stage, or a launch pad. The system knew the children’s ages. It just didn’t understand that kids will claim any open surface as a play area. A garden designed to be looked at will be destroyed by people who want to live and play in it.
She flagged it and started adjusting the context: garden must accommodate unstructured outdoor play; prioritize durable surfaces and open space over ornamental planting density. Then she kicked it back to the system. Thirty seconds of work. But that was the job: Catching what the system couldn’t infer because no one had thought to say it.
Her morning queue had eleven more reviews. A bathroom accessibility issue. Two office-to-residential conversions. A family kitchen where the system had done something clever with the storage layout that she wanted to look at more carefully. And a daycare outdoor space. She liked the daycare ones. The intent stacks for children’s spaces were always more complex, more interesting to shape. Adults generally know what they want. Or think they want, anyway. Kids need someone to advocate for what they don’t know to ask for. And they usually wanted unexpected things.
She worked through six reviews before her coffee got cold, which by her standards meant it was a good morning. She reflected that the system had improved significantly on older Swedish buildings in the last year, likely due to the local fine-tuning partnership. From a global point of view, it’s a small subspace, so it wasn’t surprising it’d taken a while to get to it.
She finally got to the daycare center. She was pleasantly surprised that the system had routed rainwater into a shallow canal rather than a standard subsurface drain. This created a play feature for the kids on rainy days without adding any cost. Maya hadn’t included any context about supporting play on rainy days; the system had pulled it from a municipal “early childhood framework” that someone at the city level had added to the public intent library three months ago. She approved the design and made a note to look up who had contributed that framework. Good context work deserved to be visible!
At 10:15, she had a calibration session, which she hated. Once a month, the regional evaluation group met to review each other’s decisions and check for drift. She understood the theory: evaluator consistency matters, feedback quality degrades if reviewers develop idiosyncratic preferences that diverge from the intent stacks they’re supposed to serve. It was just boring. In practice, it felt like being audited by your peers while making small talk.
Jens, who reviewed infrastructure projects and had strong opinions on everything, thought she hadn’t been careful enough to ensure a sufficiently strong balcony enclosure. “The wind load context was ambiguous. You should have sent it back for structural re-evaluation.”
“The context wasn’t ambiguous,” Maya said. “It referenced the 2039 Atlantic coastal wind tables, which are reasonable. The system made a reasonable call.”
“Reasonable based on what baseline? The tables haven’t been updated for the new storm frequency models.”
He had a point. She didn’t love that he had a point. She flagged it for follow-up and moved on.
After calibration, she had twenty minutes before her next block, so she checked the news. Parliament was debating the Autonomy Threshold Act again, the latest attempt to set legal limits on recursive self-improvement in research systems. The proposed threshold was 10,000 consecutive unsupervised optimization cycles. More than that would require a human evaluation checkpoint. The opposition argued the number was arbitrary, which it obviously was. The government argued that any threshold was better than none, and that was obviously true. The whole debate felt exhausting. Maya had voted for the party that wanted stricter limits, though she wasn’t sure she’d do so again. Her brother Erik, who worked in materials science, said the restrictions had already cost his lab a year on a polymer that could have been useful for carbon capture. “We had a system that was clearly improving its own search strategy,” he’d told her last Midsommar, frustrated and a little drunk. “Clearly. And we had to stop it, file a review request, and wait eleven weeks for the Ethics Board to approve the continuation. Eleven weeks. For a polymer.”
She understood his frustration. She also considered what “clearly improving” meant when the system was deciding what counted as improvement. That was the part that made her uneasy. Not the capability, but the evaluation loop is closing without a human inside it. Her whole career was built on the premise that someone had to check. Removing the check was either the most important breakthrough in history or the most dangerous mistake. She didn’t know which, and she didn’t trust anyone who claimed they did.
The calibration session ran forty minutes over, which killed her focus for the next hour. She spent the dead time updating her personal context library - the set of preference patterns, evaluation heuristics, and domain-specific knowledge that she’d built up over eight years of review work and that, honestly, was the most valuable thing she owned professionally.
Every reviewer had one. Hers was particularly deep on family-oriented residential spaces. She’d started contributing fragments of it to the public intent library last year, after the regional co-op offered a small licensing fee, and it brought in enough to cover her daughter’s swim lessons. Not life-changing money. The licensing rates for context contributions had been declining for years - more contributors, more overlap, the usual commoditization pattern. Her colleague Priya, who had been in evaluation since the early days, remembered when a well-structured domain context library could sell for serious money. “Now everyone has one,” Priya had said, not bitterly exactly, but with the resignation of someone who’d watched a profession’s margins compress in real time. “The tools for building them are too good. The value moved upstream.”
Upstream meant the frontier models themselves, or the big orchestration platforms, or (and this was the part Maya found interesting) the people who did the genuinely hard evaluation work. Not the normal reviews, which paid fine but nothing special. The truly ambiguous stuff. Decisions where you had forty options that looked nearly identical on paper, and the right choice depended on factors that required actual real world research.
She’d done a consulting stint like that last year, helping a municipality select between competing proposals for a mixed-use development. Twelve proposals, all generated by different systems working from the same intent stack, all technically compliant, all broadly similar. The differences were subtle. A slightly different ratio between commercial and residential units, a different assumption about how foot traffic would flow in the evening, a different read on whether the neighborhood should beopen or closed. No computable metric could really rank them. She’d spent two weeks walking the site, talking to residents, studying the existing patterns of use, and eventually recommended proposal seven, for reasons she could partially articulate and partially couldn’t. “This one feels right based on how people actually move through this kind of space,” she’d written in her evaluation, aware of how unsatisfying that was as a justification. But the municipality had accepted it, and the development was under construction now.
That kind of judgment, the kind that drew on years of accumulated experience, that operated on 20 watts of biological computation, that updated itself continuously from every site visit and every conversation and every mistake, was still hard to replicate. The systems were getting better. They were always getting better. But they still learned in leaps between training runs, and they still struggled with the kind of sparse, rare preference spaces where a human evaluator’s intuition was genuinely the best tool available. For now.
Lunch was a sandwich she assembled herself. Considering how clever the system was getting it was strange that it still couldn’t make a decent sandwhich. She ate at her desk and read an article about evaluation work in pharmaceutical development, which was apparently going through the same growing pains that architectural review had gone through five years ago. The pharmacologists were struggling with how to structure intent stacks for drug interaction profiles - too specific and the system couldn’t explore the solution space, too general and you got safe but useless candidates. The article quoted a researcher saying, “We’re essentially trying to encode clinical intuition into context, and it turns out clinical intuition is mostly pattern recognition that experienced doctors can’t articulate.” Maya laughed. Welcome to the world of evals.
The article also mentioned that two of the major pharma platforms had switched their backend to Qianshi’s new model, which was reportedly 40% more energy-efficient per inference call than the previous generation. The gains were real but incremental. Everyone in the industry talked about the energy problem - the data centers in the North drew more power than the cities they were built next to - and everyone agreed that the current trajectory was unsustainable, and every year the trajectory sustained itself anyway because the efficiency gains kept just barely outpacing the demand growth. Her brother thought fusion-powered training clusters would resolve it within a decade. Maya thought her brother was an optimist about everything except bureaucracy.
After lunch, she had a project she actually cared about. The city was piloting a new public housing program, and Maya had been asked to help design the master intent stack. This was the top-level context that would guide the AI systems generating individual unit designs. This was upstream work, the kind of context engineering that shaped thousands of downstream decisions. It was hard, and she found it slightly scary. Every preference she encoded, every priority she set, every tradeoff she baked into the stack would propagate. Get the context right, and a thousand families get homes that work for them. Get it wrong, and the errors would be subtle, systemic, and hard to resolve later.
She’d been working on the density-versus-community-space tradeoff for two weeks. The economics pushed toward higher density. The child development literature pushed toward more shared outdoor space. The city’s own policy documents were contradictory. Pro-density in the housing strategy, pro-green-space in the urban planning framework. No system could resolve this. It was a values question, and values questions were hers to answer.
She had a draft framework that weighted outdoor play space more heavily for units with children, funded by slightly reduced storage allocations for units designated for single occupants. It wasn’t elegant. It was a compromise, and it would annoy the single occupants, and she’d need to present it to the city’s review board next week. She spent an hour stress-testing it. That meant running the intent stack through the generation system and reviewing the outputs, checking whether the tradeoff actually produced the spaces she had in mind. Mostly it did. The system kept under-allocating covered outdoor space, probably because the training data skewed toward open-air designs. She adjusted the context, re-ran, and reviewed. Adjusted, re-ran, reviewed.
This was the part of the job she loved. Not the mechanical review queue. This. The feeling of shaping something real through the quality of your judgment. She thought of it like tuning an instrument that someone else would play.
At 15:30, her daughter’s school flagged a logistics issue. There was a plumbing robot wedged in a doorframe (this happened more often than the robotics people liked to admit, especially in older buildings with non-standard door widths), and they needed parents to pick up early. Maya saved her work, grabbed her jacket, and walked the twelve minutes to school.
On the way, she passed the site of Proposal Seven. Her recommendation was now a real building taking shape. The ground floor was going to be good. She could already see how the entry setback would create the small sheltered zone she’d liked in the plans, the one that would catch afternoon sun and probably accumulate café tables and strollers and the slow, comfortable friction of people deciding whether to go in or stay out. No metric had predicted that. She’d just known.
Her daughter, Elsa, was sitting on the front steps with her backpack and a drawing of what appeared to be a horse, or possibly a dog, or possibly a building. “It’s our house,” Elsa said, “but if it were alive.” Maya studied it with genuine attention. The roof had eyes. The door was a mouth. The windows were ears.
“I think it needs a garden,” Maya said.
“It has a garden. That’s the hair.”
Maya took the drawing, took her daughter’s hand, and walked home. On the way, Elsa asked why robots get stuck in doors, and people don’t. “Because you’ve been walking through doors your whole life,” Maya said. “You learned how wide you are without anyone teaching you. The robot has to measure every time.”
Elsa considered this. “That’s dumb,” she said.
“It is a little dumb,” Maya agreed.