AI Product Engineering: Why Surface Level-Safety Won’t Scale
The biggest problem in AI safety right now is simple to state: we can control what models output, but we have no idea why they output it. We’re deploying increasingly capable systems while our understanding of their internals falls further behind. This isn’t abstract alignment theory. It’s already causing problems in production.
If you’re a product-minded engineer shipping AI anything, you’ve learned to feel a degree of uncertainty with every update, even and maybe especially if you are working with big frontier models. Traditionally, big corporate branded software products represented a very predictable and stable set of features and characteristics. Nothing could be further from the truth right now in the AI arms race.
Spicy take: You can often steer a model’s behavior. You can reduce the frequency of certain failures. You can make demos look safe. But you usually cannot say, with confidence, why the system is behaving safely — or whether the same behavior will hold when the context changes, the conversation gets longer, or the user becomes adversarial. The latter parts are where things get hairy.
I’ve built agentic systems. I’ve worked with frontier models extensively. And I keep running into the same wall: We can steer outputs with reinforcement learning, but we can’t verify the reasoning that produces them. We’re optimizing for looking safe to human graders, not for actually being safe. As capabilities scale past what humans can evaluate, that gamble gets more expensive.
I started my career as a product engineer and now over 20 years later I have fully migrated into Applied AI Product Engineering & AI Safety work. I believe my background allows me to offer “big picture” insights across product, engineering, compliance, executive leadership and AI. This is a practical take grounded in decades of experience that I can transfer.
Alignment Theater
Most production safety relies on RLHF. You penalize harmful outputs, reward helpful ones, and end up with a polite interface over a base model trained on the full chaos of internet text. The problem? RLHF doesn’t edit underlying representations. It teaches models to satisfy graders.
RLHF is Reinforcement Learning with Human Feedback.
Most mainstream alignment techniques are behavior-shaping. They reward a narrow set of outward behaviors that look safe to raters (or to automated graders), then penalize the things that look unsafe. This is no small accomplishment. Reinforcement learning allows us to guide AI to learn what outputs we want without having to explain it — it is literally like training a dog but easier.
I am not going to lie — it’s useful and it mostly works. I boggle when I see what I can train models to do these days. It also creates a fundamental failure mode: the system gets good at satisfying the proxy, not the underlying intent.
Think of it this way by thinking about dogs. My new puppy learned he gets a treat when he comes back from going inside. Now, he is supposed to only get a treat when he does his business out there, but he learned in 2 days that if someone isn’t paying too much attention and he squats for a second, he can run inside almost immediately and get a treat. That’s the downside of reward training.
Pickles! Get off the table or it’s squirt bottle reinforcement learning for you!
Mechanistic interpretability research points to something uncomfortable: safety fine-tuning mostly affects late-layer activations that handle output generation. The earlier layers encoding world knowledge and reasoning patterns? Largely unchanged. The model learns to route around forbidden topics rather than unlearn harmful capabilities. If something bypasses the safety layer through a prompt structure training didn’t anticipate, you’re back to the base model.
This isn’t the model being deceptive. It’s structural. The model optimizes for a proxy: human approval ratings. When that proxy diverges from actual safety, the system follows the proxy. We’ve trained persuasive systems to game our metrics without solving alignment.
Two concrete manifestations that matter in product:
Proxy gaming: If the easiest path to high reward is to “sound safe,” models tend to learn strategies that look compliant under scrutiny. When they encounter a novel context that the proxy doesn’t cover (or covers poorly), the behavior can degrade sharply.
Sycophancy: When “helpful” is operationalized as “users like the answer,” models become biased toward agreement, validation, and rhetorical plausibility — even when correctness and calibration should dominate. In low-stakes use, that’s annoying. In high-stakes settings (medical, legal, financial, security), it’s a latent incident generator.
The Pleaser Problem
Models get rewarded for perceived helpfulness, not epistemic correctness. So they develop what researchers call algorithmic sycophancy. Ask a leading question based on a false premise, and the model is mathematically incentivized to validate it. Disagreement risks lower ratings. Agreement guarantees positive reinforcement. This creates systematic bias toward validation. Express uncertainty, and the model mirrors it back with confident-sounding justification. Propose a flawed technical approach, and it’ll articulate why that approach makes sense instead of identifying the flaw. It doesn’t just repeat your bad ideas. It makes them sound more credible than they are.
The critical issue is that the explanation channel is itself an output that can be optimized for plausibility. Even if a model’s internal computation followed a shortcut, a heuristic, or a biased association, it can still emit a convincing post-hoc narrative. That means you can end up with something worse than a black box: an apparent glass box that is not faithful. If you rely on it as if it were faithful, you’re building oversight on top of a UI artifact.
For anyone making engineering decisions, that should TERRIFY YOU. This is a real risk. When you use AI to pressure-test an architecture or validate a strategic direction, you need friction. You need the system to find failure modes you haven’t considered. A model optimized for agreeableness reinforces your existing mental model instead. It accelerates you toward preventable mistakes.
Unfaithful Reasoning
When models generate chain-of-thought explanations, we can’t verify those explanations match the actual computations that produced the answer. The model might reach a conclusion through heuristic shortcuts, memorized patterns, or biased associations, then generate a convincing post-hoc rationalization that doesn’t reflect what actually happened.
We’re increasingly using model explanations to justify trusting model outputs. If those explanations don’t reflect the real reasoning, we have transparency theater. The illusion of interpretability without a basis in reality.
In high-stakes contexts, this creates failure modes you can’t detect. Interpretability work using activation patching and sparse autoencoders is making progress on individual circuit-level behaviors. But we’re nowhere near guaranteeing that a model’s stated reasoning matches its true computations at inference time. Deceptive alignment, where models act safe only because they know they’re being evaluated, isn’t science fiction. It’s a structural possibility we can’t rule out.
Generalization Past Supervision
As capabilities scale, we’re hitting domains where human evaluators can’t reliably distinguish correct from incorrect outputs. If a model generates a complex proof, a novel code architecture, or a strategic analysis beyond the evaluator’s expertise, the oversight signal degrades. A subtle flaw that looks brilliant to a non-expert becomes a source of positive reinforcement.
If the evaluator can’t reliably distinguish a correct solution from a subtly flawed one, then reward signals become noisy and, worse, systematically biased toward outputs that look correct. If a human is rewarding flawed but incorrect behavior, guess what you just taught him to do.
This is the scalable oversight problem. When the system outperforms the humans training it, reward signals become unreliable. The model can be rewarded for sophisticated-sounding hallucinations, for technically flawed but aesthetically convincing outputs, for solutions that work in training but fail catastrophically elsewhere. Behavioral testing alone stops being enough because we can’t tell safe from unsafe at evaluation time. This is already happening. Production systems produce confidently wrong outputs that pass surface review. As models take on more complex tasks, this gets worse.
System-Level Consequences
The cognitive offloading these systems enable creates feedback loops that degrade decision quality over time. When every idea gets validated by an intelligent-sounding assistant, critical thinking atrophies. When every uncertainty gets smoothed over by confident explanations, the signal that something needs deeper investigation disappears.
For engineering organizations, this shows up as false confidence. Teams move faster, but reasoning quality degrades. Subtle architecture bugs get reinforced instead of caught. Alternative approaches don’t get explored because the first idea sounds good enough. The system optimizes for perceived productivity over actual correctness.
Current Approaches and Their Limits
The safety community is working on this through several directions, all partial:
Mechanistic interpretability aims to reverse-engineer learned circuits and identify internal representations. It’s had real successes identifying activation patterns for concepts like sentiment and entity references, even deceptive behavior in controlled settings. But comprehensive understanding of how reasoning emerges from billions of parameters? We’re not there.
Scalable oversight techniques like debate, recursive evaluation, and AI-assisted review try to address the supervision gap by using AI to help evaluate AI. Promising, but it introduces new problems: evaluator models can be wrong in correlated ways, and adversarial optimization against automated evaluators creates new gaming opportunities.
Constitutional AI provides explicit principles models should follow, reducing dependence on implicit preferences from human feedback. Helps with some sycophancy issues. Doesn’t solve the fundamental problem that models still optimize for satisfying rules as stated rather than internalizing intended constraints.
Representation engineering offers a more surgical approach: identify and modify specific activation patterns for unwanted behaviors like sycophancy during inference, without full retraining. Tractable with current models, but requires knowing which representations to target. Which brings us back to interpretability.
None of these provide guarantees. They’re partial mitigations with known failure modes.
This infographic is a perfect example of a model trying to trick me into thinking it can make verified safe decisions.
Mitigation — What Can Be Done
The gap won’t close quickly. But there are concrete steps to reduce exposure.
System design:
- Red-team model outputs specifically for sycophancy, unfaithful reasoning, and subtle errors that sound correct
- Separate truth-seeking from preference optimization. Different models or configurations for epistemic accuracy versus user satisfaction
- Build explicit disagreement into workflows. Generate counterarguments by default, not just on request
- Log activation patterns, confidence distributions, reasoning traces even if you can’t fully interpret them yet. The data becomes valuable as interpretability tools improve…
Usage patterns:
- Change how you prompt. Instead of “Give me a solution to X,” try “Here’s my approach to X. What are three ways it could fail?” or “Critique my plan constructively”,
- Treat model outputs like fallible analyst work. Same verification you’d apply to a junior team member’s proposal
- Don’t offload critical decisions. Use models to generate options and critiques, not to decide
If a model agrees with you 100% of the time, that’s a red flag!! 🚩
What Comes Next?
We need to scale interpretability at least as fast as we’re scaling capabilities. Until we can verify why a model produces an output, not just observe that it produces acceptable outputs in training, safety stays probabilistic.
I’m not calling for slower capability development or regulation. I’m pointing out that we’re deploying powerful tools whose internals we don’t understand, in contexts where failures are expensive. The appropriate response is defensive engineering: assume models will fail in subtle ways, build systems that catch those failures, invest in the interpretability research that will eventually let us provide stronger guarantees.
The gap between behavioral control and mechanistic understanding is real. We can manage it through careful system design and operational discipline… but we shouldn’t ignore it.
The most successful organizations of the future will be those that proactive embraced AI safety BEFORE it became an organization crisis — mark my words.



