Engineering

How I Cut Cloud Infrastructure Costs by 75% at a Funded FinTech Startup

At Bridgewise, a funded FinTech startup, I redesigned infrastructure and cut cloud costs by 75% — saving more than $150,000 per year. Here is exactly what I found, what I changed, and what every engineer should audit before their next bill.

Server racks in a data center with cost graphs trending downward, representing cloud infrastructure cost reduction
Quick answer

At Bridgewise, a funded FinTech startup where I worked as a senior software engineer (2020-2022), I redesigned cloud infrastructure and reduced costs by 75%, saving more than $150,000 per year. The main sources of waste were over-provisioned compute instances running at low utilisation, environments (dev, staging, QA) that ran 24/7 when they were only needed during business hours, unmanaged data transfer costs between regions and services, and snapshot and storage volumes that had accumulated over years without cleanup. The fixes were systematic rather than clever: right-size first, then scheduling, then eliminating waste, then architecture changes. Most large cloud cost reductions come from those four steps in that order.

The bill that started the audit

When I joined Bridgewise in 2020, the company had been growing fast. Series A had closed. The engineering team was shipping. Nobody had had time to look at the cloud bill critically.

That is the normal story at funded startups. You over-provision early because instances are cheap relative to engineering time, and then you stop revisiting the choices once the system is running. The bill grows slowly enough that it never triggers an alarm, until someone looks at it properly and realises it has compounded for two years.

We looked at it properly. What we found was not one big mistake. It was a hundred small ones that had been accumulating quietly.

Step one: find where the money is actually going

The first thing I did was pull the cost explorer report sorted by service, then by individual resource. Most cloud cost audits fail because people look at aggregate numbers instead of line items. 'Compute costs $40K a month' tells you nothing useful. 'This specific instance type in eu-west-1 costs $8K a month at 12% average CPU utilisation' tells you exactly what to do.

I made a spreadsheet. Every resource that appeared in the top thirty by monthly spend got a row. For each one: what is it, what is it doing, what is its average utilisation, and when was the last time anyone looked at whether its size was still right.

The answer to the last question, for most of them, was: at launch.

The three categories where the money was hiding

After the audit, the waste fell into three buckets. Nearly every cloud cost problem I have seen since then follows the same pattern.

The first bucket was compute. We had instances provisioned at sizes that made sense when we were guessing at load during initial deployment. Two years later, with real utilisation data, most of them were running at 15-25% average CPU. Right-sizing them to the next smaller instance type — or in some cases two sizes down — recovered a significant share of the total bill immediately. No code changes. No architecture work. Just adjusting numbers in a config.

The second bucket was environments. We had development, staging, and QA environments that ran continuously, twenty-four hours a day, seven days a week. They were used during business hours in one timezone. A scheduling rule that stopped them at 7pm and restarted them at 8am the next morning cost almost nothing to implement and cut the compute cost of those environments by roughly 60%.

The third bucket was data transfer and storage. This is the one that surprises people the most. Cloud providers charge for data moving between availability zones, between regions, and out to the internet. When services talk to each other across AZs without a reason, or when logs are being shipped to a different region by default, those charges compound. We had snapshots and storage volumes that had been accumulating for the life of the company without a retention policy. Cleaning these up and adding simple lifecycle rules stopped the bleeding.

What we did not do

We did not immediately reach for architecture changes. That is the mistake most engineering teams make when they decide to take costs seriously: they jump to 'we should move this to serverless' or 'we should rewrite this with a cheaper database engine' before they have exhausted the simple fixes.

Architecture changes are expensive in engineering time and they introduce risk. Right-sizing an instance takes ten minutes and is fully reversible. Migrating a database takes months and is not. The expected value calculation is obvious: do the cheap, reversible things first. Save the architecture work for what remains after you have already recovered most of the savings.

In our case, after right-sizing, environment scheduling, and storage cleanup, we had already recovered most of the 75% reduction. The remaining work was smaller structural changes: consolidating some services that had been deployed redundantly, and adding tags and cost allocation groups so future drift would be visible before it compounded.

The governance layer that keeps savings from evaporating

The part most engineers skip is the governance work. You can cut costs by 75% and watch them drift back to where they were within a year if you do not put structure around the problem.

What we put in place was simple: every resource required a cost-allocation tag. Any untagged resource in a weekly report became a mandatory action item. Environment schedules were enforced by automation, not by people remembering to stop things. Storage lifecycle policies ran automatically.

This is not complicated. But it has to be deliberate. Cloud cost governance does not happen by default — cloud providers charge you for everything until you explicitly tell them not to, and they make it easy to provision more and hard to notice what you have already forgotten about.

What this looks like in practice for most startups

If you are at a startup and you have not done a cloud cost audit in the last twelve months, there is almost certainly significant waste in the bill. The pattern is consistent: funded companies provision for growth, grow, and never go back to revisit the provisioning decisions that made sense early on.

Start with compute utilisation. Pull thirty days of CPU and memory metrics for every instance you are paying for. Anything averaging under 20% CPU is a right-sizing candidate. Then look at your environments: how many are running right now, and how many of them are actually being used? Then look at your storage: what is the oldest snapshot you are paying for, and do you actually need it?

Those three questions will show you where the money is. The fixes are usually straightforward once you know where to look.

The 75% we recovered at Bridgewise did not come from a clever architectural insight. It came from treating the cloud bill as a first-class engineering concern and auditing it the same way we would have audited slow database queries or high error rates: systematically, with real data, until we understood what was actually happening.

Key Takeaways

Most cloud waste is not one big mistake — it is many small provisioning decisions that were never revisited once the system was running.

Start every audit with cost explorer sorted by resource, not by service. Line-item data tells you what to do; aggregate data does not.

Right-sizing over-provisioned compute is typically the single largest lever and costs almost nothing to implement — no code changes, just config adjustments.

Non-production environments running 24/7 are the second biggest source of waste at most startups. Schedule them to stop outside working hours.

Data transfer and storage accumulate silently. Add lifecycle policies and retention rules to stop the compounding.

Do the cheap, reversible fixes first. Architecture changes are expensive and risky — save them for after right-sizing, scheduling, and cleanup are done.

Governance is what keeps savings from evaporating. Tag every resource, automate enforcement, and make cloud spend a visible engineering metric.

Frequently asked questions

What is cloud cost engineering?

Cloud cost engineering is the discipline of designing and operating cloud infrastructure so that you pay for what you actually use, not what you provisioned at peak capacity two years ago. It covers right-sizing compute, eliminating idle resources, choosing the right purchase models (on-demand vs reserved vs spot), managing data transfer, and setting up governance so waste does not accumulate invisibly.

Where does most cloud waste come from?

In my experience: over-provisioned compute (instances sized for a peak that never came), environments that run 24/7 when they are only needed during working hours, unmanaged data transfer between availability zones and regions, and storage that accumulates silently over time (old snapshots, unused volumes, orphaned backups). The first three are usually responsible for 60-80% of avoidable spend.

What is right-sizing and why does it matter?

Right-sizing means matching the instance type and size to the actual CPU, memory, and I/O load of the workload running on it. Startups often over-provision early for safety and never revisit those choices as the system matures and load patterns become known. Dropping a 16-core instance to 4 cores when peak CPU is 15% is free money — same reliability, one quarter of the cost.

How do you start a cloud cost audit?

Start with your cloud provider's cost explorer, sorted by service and by resource. Find the top five line items by spend. For each one, pull the utilisation metrics for the past 30 days. If average CPU or memory utilisation is under 20%, that resource is a right-sizing candidate. Then check how many environments you have running and whether any of them could be stopped outside working hours. Those two steps usually surface 50-70% of the savings available.

When should you invest in architecture changes vs simpler fixes?

Architecture changes (switching database engines, moving to serverless, re-platforming a service) are expensive and risky. Do them last, not first. Right-sizing, environment scheduling, and eliminating unused resources cost almost nothing to implement and typically recover 50-70% of the available savings. Save architecture changes for the cases where simpler fixes have already been done and there is still a structural problem.

Written by Melkon Hovhannisyan

Related articles

Comments

Be the first to comment.

Comments are reviewed before they appear.