When to refactor and when to rewrite as your product scales

There’s a moment, usually nine to eighteen months into a successful product, when the team starts having the same conversation in slightly different forms. Things are slow. New features take longer than they should. Bugs surface in unexpected places. Someone says, half-jokingly, that we should just rewrite it. A week later, somebody says it again, less jokingly.

This is the rewrite conversation, and most teams handle it badly. They either commit to a rewrite they can’t finish, or they refuse to rewrite anything and end up with a codebase that ossifies. The right answer is almost always more nuanced and depends on what specifically is hurting.

The two kinds of code rot

Codebases get hard to work in for two distinct reasons.

Local rot is when individual pieces of the system have grown messy. A function that should be 50 lines is 400. A class that started as a single responsibility has accreted six. The pattern of a feature is hard to discern because the feature is interleaved with three other features. Local rot is mostly the consequence of deferred cleanup — every PR was reasonable at the time, but the cumulative effect is unreadable.

Architectural rot is when the system’s shape no longer matches the problem. The data model assumes one tenant; you now have multi-tenancy. The synchronous request/response pattern made sense when traffic was small; you now need queues and async work. The monolith was the right call at six engineers; at thirty, it’s blocking deploys.

Local rot can be fixed with refactors. Architectural rot usually cannot.

The rewrite-vs-refactor question, framed correctly, is: which kind of rot are we dealing with?

When refactoring is right

Refactoring works when:

The current shape of the system is roughly correct.
The pain is concentrated in specific files, modules, or boundaries.
The team can identify, before starting, what good would look like.
The work can be done incrementally without freezing the rest of the codebase.

A refactor should be scoped, time-boxed, and continuous. Pick a module that’s a known sore spot. Spend a week. Make it noticeably better. Move on. Done over the course of a year, this kind of disciplined refactoring is enough to keep a healthy codebase healthy.

The failure mode of refactoring is when teams use it for problems it can’t solve. If the data model is wrong, no amount of refactoring the data access layer will fix it. You’ll just spend two months making the wrong shape more elegant. The team will be frustrated, and at the end the system will work approximately the same as it did before.

When a rewrite is right

A rewrite is the right call when:

The fundamental shape of the system no longer matches the problem.
The team has lost confidence in their ability to predict how a change will behave.
The performance characteristics needed for the next phase of the product are structurally not achievable in the current architecture.
A rewrite can be staged — you can run old and new in parallel and migrate traffic.

The last bullet is the one teams underestimate. A “big bang” rewrite — the team disappears for six months, comes back with v2, switches everything over in a weekend — fails about 80% of the time in our experience. The market keeps moving. Requirements change underneath the rewrite. The rewrite scope keeps growing because nobody wants to ship something less capable than what’s being replaced. Eighteen months later, the rewrite is 90% done, and the original system has accumulated six features the rewrite doesn’t have.

The strangler pattern, used seriously

The pattern that actually works for large rewrites is the strangler. Build a new system alongside the old one. Route a subset of traffic through the new system. Expand the new system’s coverage gradually. The old system shrinks; the new system grows. There is no big-bang switch.

The discipline that makes this work:

The new system has a clean, small starting scope. One specific user-visible feature, end-to-end, that you can fully migrate.
You do not start the second feature until the first one is fully migrated and the old code paths for it are deleted.
You’re willing to delete code from the new system if the migration teaches you something. Both systems should be evolving.

The teams that fail at strangler patterns usually fail at the second discipline. They start the second feature before the first is done, then the third, and a year later they have two systems both half-implementing the product. Don’t do this. Migration is a one-feature-at-a-time discipline.

What we look for to advise

When clients ask us whether they should rewrite, the diagnostic we run is to look at three signals.

The bug surface. Where are bugs coming from? If they’re concentrated in specific modules, refactoring those modules is high-leverage. If they’re distributed across the codebase and tend to involve unexpected interactions between components, that’s an architecture problem.

The change cost trend. How long does a typical PR take to merge today versus six months ago? If it’s drifting up linearly, you have local rot. If it’s drifting up super-linearly — each new feature taking 1.3× longer than the last — you probably have architectural rot.

The on-call burden. What kinds of incidents are happening? Incidents that surprise the team — “we didn’t know that could happen” — point to architectural problems. Incidents that the team predicted but couldn’t prevent in time point to operational problems that won’t be fixed by a rewrite either.

The middle path: targeted re-architecture

There’s a third option teams underuse: rebuild a single component while leaving the rest of the system alone.

The data layer is a common candidate. The team’s original ORM choice or schema design is no longer carrying the load, but the rest of the application is fine. Replacing the data layer is a real project — usually three to six months for a serious system — but it’s a much smaller commitment than a full rewrite. Same goes for swapping a queue, replacing a search backend, or moving from a sync to an async processing model.

Targeted re-architectures are how mature teams keep large codebases livable. They’re not glamorous. They don’t make for good blog posts. They’re how the work actually gets done.

A practical sequence

If we were advising a team that’s started having the rewrite conversation, here’s the sequence we’d suggest:

Spend a week diagnosing. What specifically is slow / fragile / hard to change? Be ruthless about specificity. “The codebase is messy” is not a diagnosis.
Identify which problems are local and which are architectural. Most teams have both, but in different proportions.
For the local problems, schedule refactoring as a continuous practice. Every sprint, an engineer takes a known sore spot and improves it. No more, no less.
For the architectural problems, identify the smallest unit you could rebuild as a strangler. Build that. See what you learn.
Do not commit to a rewrite of the entire system upfront. Commit to the first migration, and reassess after.

The pattern that doesn’t work is the one most teams reach for: stop everything, declare a rewrite, work in parallel for a year. The pattern that works is slower, less dramatic, and produces better software.

If you’re at this inflection point and want a sober second opinion before you commit to a rewrite, we’d be glad to dig in.

From MVP to Scale: When to Refactor and When to Rewrite

The two kinds of code rot

When refactoring is right

When a rewrite is right

The strangler pattern, used seriously

What we look for to advise

The middle path: targeted re-architecture

A practical sequence

Bring us in for a 30-minute architecture call.

Related notes

AI Evals: Beyond Vibes-Based QA

Cutting AI Costs by 10x Without Cutting Quality

How We Scope a Product in Five Days