Jerome Li

An Uncomfortable Truth

We are in the skeuomorphic phase of AI interaction.

When the iPhone first launched, app designs looked like yellow legal pads and wooden bookshelves, because people needed a familiar anchor to grasp something new. Today's chatbots are the same thing: we've taken an explosively capable technology and stuffed it into a 1990s text box.

Open ChatGPT. Open Gemini. Open any AI product. A "How can I help you?" dialog box is waiting for you.

But think about it honestly: do you actually need to "chat"? Or do you just need the right information at the right time?

The distinction seems subtle, but it points to two fundamentally different futures.

Why the Old Paradigm Failed

Over the past two years, a wave of startups tried to reinvent industries with "conversational interfaces." E-commerce was especially hot: AI shopping assistants, conversational recommendations, agent-workflow-driven user journeys.

I personally know at least two teams that built similar products. Neither made it. Even the better-funded ones couldn't prove the model works. The reason is simple:

Experience innovation is not business innovation. Users don't buy more just because they can chat. If the underlying retrieval and ranking haven't fundamentally improved, no amount of conversational polish will matter.

What made it worse: the big players showed up. OpenAI embedded shopping capabilities directly into Chat. Google has inherently stronger shopping data. You build a chat-based shopping experience, and then what? How do you compete with Google on search?

So conversational interface is not the destination. It's just a transitional crutch.

Human-AI Coexistence: Not Replacement, but Augmentation

If chatbots aren't the endgame, then what is?

My take: human-AI coexistence.

The term sounds academic, but the idea is straightforward. AI shouldn't try to replace humans. It should become a human's "power-up." Humans do what humans are good at: building trust, expressing emotional value, making social judgments. AI does what AI is good at: rapid comprehension, retrieval, recall, and prompting.

Picture a concrete scenario. You're a salesperson. A customer walks in and says: "I saw a model wearing this style last time. Can you find it for me?"

Traditional flow: you walk to the back, search the computer, wait for results, come back to the customer.

AI-augmented flow: you keep chatting naturally with the customer. The agent in the background handles retrieval automatically, understanding semantics, running multi-step searches, making function calls, and surfacing the results right in front of you. To the customer, you look like a salesperson who really gets them and responds incredibly fast.

Nobody opened an app. Nobody said "Hey AI." The entire process is seamless, ambient, and quiet.

That's the next form of human-computer interaction: you don't feel AI's presence, but you benefit from its capability.

This pattern has already proven itself in software development. Coding agents like OpenClaw and Claude Code are essentially agent loops: you give them a goal, and they autonomously plan, search code, execute changes, and verify results in the background. More importantly, they continuously accumulate your personal context, remembering your project structure, preferences, and past decisions, accessing essential information exactly when you need it. You don't have to re-explain the background every time; the agent already knows. This is what augmentation looks like in the software world: humans handle judgment and decisions, agents handle execution and memory.

Sam Altman put it well: current device experiences feel like walking through Times Square in New York, with flashing lights, noise, and notifications fighting for your attention. The future experience should feel like sitting in a cabin by a lake: peaceful, calm, with everything you need just there when you want it.

Voice Is Already Good Enough

If the goal is ambient interaction, what's the best input modality?

My answer: voice. And it's already good enough.

Many people are still waiting for full AR, that futuristic vision of dragging and dropping objects in mid-air. But I think that wait is misguided. Voice as an interaction entry point can already support a wide range of commercial scenarios:

Natural. Humans express fuzzy needs through speech. You wouldn't type "I'm looking for a dark-toned dress suitable for a dinner party but not too formal," but you'd say it out loud without thinking.

Low friction. No need to open a specific app, no manual input, no pulling out your phone to unlock it. You can trigger it while walking, chatting, or serving a customer.

Built for hands-busy scenarios. Warehouse operations, factory production lines, commercial kitchens, retail floors. What they all have in common: your hands are occupied, but your brain needs information.

By comparison, AR graphical interaction has limited commercial value today. Piano-playing assistance? Golf ball tracking? Cool as a demo, but who's paying for it on a daily basis?

I recently spoke with the smart glasses team at Snapchat. Their XR glasses are heavier than Meta's, have only one hour of battery life, and will cost $2,000. The direction might be right, but the timing isn't. It solves a problem that most people don't have yet.

The most realistic near-term upgrade to human-computer interaction is not full AR, but voice-driven information retrieval and decision support.

Subtractive Hardware: Doing Less, Getting It Right

This leads to a product philosophy I've been thinking about a lot lately: subtractive hardware.

Most hardware companies think in terms of addition: more sensors, more GPU power, higher resolution, more features. But in the AI-native era, subtraction might be the right move.

Even Realities is the best example I've seen so far.

It's a smart glasses company out of Shenzhen. Their design philosophy can be summed up in one quote from founder Will Wang:

"Many people wear glasses not just for fashion, but because we need them. You might only need a smart feature for 10 or 20 percent of the day. The rest of the time, they're just normal eyewear. So that foundation has to hold no matter what features we're building."

What did they remove?

No camera. So the glasses can weigh under 40g, nearly indistinguishable from regular eyewear. Will Wang has explicitly said that putting cameras on glasses before policies and infrastructure are ready is irresponsible.
No speakers. Users prefer their own earbuds.
No color display. Just a green monochrome micro-LED, but good enough, because all you need is text information surfaced in your field of view.
Three-day battery life. Because they cut out all the power-hungry components.

What can they do?

When you look straight ahead, you see nothing. They're just glasses. But when you tilt your head slightly upward, the HUD activates: calendar, messages, navigation, real-time translation, all in your field of view. This is not push notification logic. It's pull on demand: you reach for information, instead of information reaching for you.

The company name "Even Realities" encodes this philosophy:

"We try to find the balance point for physical and digital realities, and make both realities even. To help you keep your eyes still on the real world while still being able to absorb the information digitally."

There's also a companion smart ring, the Even R1. On its own, a health-tracking ring is nothing new. But paired with the glasses, it becomes a controller: pinch to flip pages, gestures to trigger commands, discreet control during presentations. The ring offers higher precision than wrist-based devices, supports more recognizable gestures, is more discreet, and doesn't depend on a screen.

90% of G2 users chose to purchase the ring bundle. What does that tell you? People don't need more features. They need a complete, frictionless interaction loop.

This points to a bigger insight: when AI is capable enough, hardware competitiveness is defined not by what you add, but by what you subtract.

Google Glass didn't fail because the technology was lacking. It failed because it ignored the basic premise that it first needs to be a pair of glasses you actually want to wear. Humane AI Pin spent $230M to build a $699 device that couldn't reliably set a timer. It tried to skip too many steps, jumping straight to phone replacement instead of life augmentation.

Even Realities works in reverse: start from fashion and comfort constraints, then engineer backward to technology. Make sure people want to wear it first, then talk about smart features. Fashion-first, technology-second. This might be the only viable path for wearable computing to truly land.

Search Is Spilling from Software into the Physical World

Connect the dots above, and a clear trend line emerges:

Search is no longer just a web input box. It's becoming an information retrieval capability that spans software and the physical world.

In software, search is evolving from simple one-step RAG to Agentic Search. Agents decide when to search, what to search for, whether the results are sufficient, and whether to reformulate the query and try again. This isn't workflow-driven search. It's model-driven, dynamic search.

But the more interesting side is this: that search capability is extending into the physical world.

When a salesperson wearing lightweight glasses interacts with an AI system through voice and gets real-time product information for a customer, that's search manifesting in the real world. When a warehouse worker sees step-by-step guidance through glasses, that's search. When a chef asks "what's the SOP for this dish" and the information appears in their field of view, that's still search.

Ben Thompson wrote in Stratechery that we're entering the third phase of LLMs, the agentic phase. AI doesn't just answer questions. It continuously and autonomously executes tasks. Put that observation next to the hardware trend:

Backend: Agentic Search performing multi-step retrieval and autonomous decisions
Surface: Lightweight glasses + voice + ring as the information display layer
Scenarios: Retail, warehousing, manufacturing, field service

The autonomous search of the software world and the ambient interface of the physical world are converging. Not many people are talking about this systematically yet, but I think by the second half of this year, as more glasses products ship, this topic will heat up fast.

Who Lands First?

One last pragmatic question: where does this land first?

My bet: enterprise scenarios will outrun consumer scenarios by a wide margin.

The reasons are straightforward:

Clear tasks. Warehouse operations have SOPs. Sales processes have standards. Production lines have steps. AI doesn't need to guess what you want.
Measurable ROI. If a salesperson closes 10% more deals with AI assistance, you can calculate that number directly.
Real need for hands-free. Not nice to have. Must have.
Enterprises are more willing to pay. A $599 pair of glasses? Consumers hesitate. Enterprise procurement? As long as ROI checks out, it's a non-issue.

Look at the consumer side: how many people genuinely need smart glasses today? Navigation? Your phone handles it. Translation? Most people don't need it daily. Calendar reminders? A watch is sufficient. The consumer killer use case hasn't arrived.

But enterprise scenarios are different. When your hands are busy, your eyes need information, and your mouth can issue commands, glasses + voice + AI is the obvious solution.

Meta Ray-Ban's 7M+ units sold prove one thing: if you don't ask people to change their habits and instead just make their existing accessory a bit smarter, people will buy it. But to truly establish product-market fit, enterprise is the faster path.

Final Thought

We are in the middle of an interesting inversion.

For the past decade, the tech industry's logic was about capturing attention: more screen time is better, more engagement is better. But in the AI-native era, that logic is flipping: the best AI experience is one where you don't feel AI's presence at all. The metric is no longer how long you spend interacting with it, but how much time it saves you.

Some call this "Silence as luxury": quietness itself becomes a premium. Premium AI is invisible and silent. Ad-supported AI is chatty and interruptive.

I think this framework applies equally to human-computer interaction:

The real next-generation interface is not flashier AR, bigger screens, or more notifications. It's a pair of glasses that don't look smart, a phone you don't need to pull out, a system that starts agentic search in the background the moment you speak, and a display that surfaces answers in your field of view the instant you glance up.

The best interface is the one you don't notice.

The future of human-AI coexistence is quiet.