June 20, 20244 min read

It has been about two years since ChatGPT-shaped technology became something we could legitimately put inside production software. The studio has now shipped AI features into more than a dozen products. Some have aged well. Some have not. Here is what we have learned, written for the founders and product leaders who are still figuring out which AI bets to make in their business.

What is genuinely working

Three patterns have proven durable across every project where we shipped them.

Retrieval-augmented generation, when the data is clean. A model answering questions over a business's own documents has become a near-default capability inside knowledge-heavy products. Legal teams, healthcare practices, and internal support functions have seen real time savings. The pattern works when the underlying documents are well-organized. It breaks when the documents are a chaotic archive nobody has curated.

Document extraction at scale. Pulling structured data from invoices, contracts, medical records, and unstructured PDFs is one of the highest-ROI AI applications we have shipped. Accuracy is high enough that paired with a lightweight human-review layer, businesses can process several times more volume than before.

Workflow agents inside existing software. Not chatbots, not standalone tools. AI agents living inside CRMs, project management tools, and operations dashboards, handling specific repetitive tasks. Drafting follow-up emails. Triaging support tickets. Updating records based on incoming data. These succeed because they are scoped, evaluated, and embedded in an existing workflow rather than introduced as a separate product.

What has not aged well

Several patterns we tried in 2023 have quietly disappeared from our active builds.

Standalone AI wrappers. Products that were essentially "ChatGPT for X" with a thin custom UI on top of an API call. The market commoditized them quickly. The ones we built that depended on having the only good UI lost their advantage as the foundation models added their own better UIs.

Open-ended conversational interfaces for business processes. The early enthusiasm was for chat as the new universal interface. We built several. Most got replaced by structured UIs with AI underneath because users in business contexts wanted predictable inputs and outputs, not a conversation.

Heavy fine-tuning on private data. Two years ago we fine-tuned domain-specific models for a few clients. Most of those models have been retired and replaced with retrieval-based approaches against newer foundation models. The base models improved faster than our fine-tuned models, and the maintenance cost was higher.

What we tell founders asking for AI in 2024

A few principles guide every scoping conversation now.

Start with the problem. "Where in your business does someone spend two hours doing something repetitive?" produces better answers than "where can we add AI?" The first question leads to features people will use. The second leads to demos that impress investors.

Pick the simplest model that works. GPT-4-class models for tasks that need reasoning. Faster, cheaper models for tasks that need throughput. We do not default to the most expensive option; we evaluate it for the specific job and pick accordingly.

Build evaluation before you build the feature. If you cannot measure whether your AI feature is correct, you should not ship it. We now build a small evaluation dataset for every AI feature before we write a line of integration code. It saves rework and gives us a quality bar we can defend.

Plan for the model to change. The foundation model that powers your feature today will be replaced. We design integrations so swapping providers is a configuration change, not a rewrite. This is more work upfront and more freedom later.

What this looks like in dollar terms

Most "AI in our product" projects are smaller than founders expect. A retrieval system over a moderate document corpus, with good evaluation and reasonable guardrails, is typically a four-to-eight-week engagement scoped on top of an existing product. Document extraction pipelines are similar.

Custom AI assistants embedded into operational software are larger projects, scoped per the workflows they target. Each engagement is scoped individually; we do not publish flat prices because the same brief means different work for different businesses.

What we no longer do is take ten weeks to ship something that could have been an OpenAI API call with a thoughtful prompt and a basic UI. The honest answer for many "AI feature" requests is now smaller and faster than we would have estimated in 2022.

What is next

We are watching three things closely.

Agentic systems, where a model takes actions inside software rather than just producing text. The reliability is climbing. We will start scoping these into production briefs over the next year.

On-device AI. The point where useful AI capability runs locally on user devices changes the economics of latency and privacy. We expect to ship the first round of these features into mobile apps within the year.

AI evaluation as a discipline. The teams that ship LLM features without observability are going to have a hard time twelve months later. Building this in from the start is going to become a default, the way logging and monitoring already are.

If your business is figuring out where AI legitimately helps, we would rather have an honest conversation about that than sell you an AI feature you do not need.

Start a project →

Need this built for your business?

Let's scope it together.

Start a project

Two Years of AI in Production: What Actually Works