Why is Cloud So Expensive?

A frank and honest appraisal based on painful experience

Sep 04, 2024

I recently spoke to a customer who complained about the high cost of infrastructure in "The Cloud" compared to private data centers.

As a Cloud specialist this piqued my interest and my concern immediately.

It also gave me an attack of indigestion because the last thing I want to hear about is customers fleeing from the cloud!

After all, I've spent years making the case for cloud to customers.

Sadly, I've yet to find a customer that doesn't find cloud expensive and hasn't considered moving back to self-managed data centers.

Many are planning to use, or are in migration to, traditional hosting providers (e.g Hetzner, Linode).

I can't say I blame them.

Running servers in Azure and AWS may be more costly than infrastructure you manage yourself - if you do it the wrong way.

That said, if your application costs more to run on cloud than in your own data center then it’s usually a sign you’re "doing it wrong".

How did we get here?

Why would you be experiencing high cloud costs relative to traditional data center?

Why would you even have to consider moving customers back to the datacenter?

What happened to that glorious dream of cheap, high-tech, on demand cloud?

The short answer, based on my own experience consulting to enterprises and startups is as follows:

In the rush to the Cloud, companies failed to develop:

a) The right cloud strategy (one that accounts for the financial case over the long term)

b) The right architecture (including enterprise architecture)

c) The right technical configuration for their core cloud applications (including cost benchmarking)

d) The right cloud operating model for I.T in the cloud (making it easy for developers and app users to consume).

That's a whole lot of things!

You'll notice that in the four things I named skill is an underlying factor.

Exceptional skill and maturity is required to perform all these tasks well.

That's because Cloud is a complex technology and none of its aspects can be successful without experts.

So, it’s partially a skill issue

Most companies do not have all the in-house skills to transform traditional I.T to Cloud, operate there and still make cost-savings while doing so.

Most “enterprise“ I.T specialists tasked with bringing their organisation to the cloud are generalists. Many will be specialists in non-cloud areas (e.g one or more vendor products for storage, virtualization or operating systems).

It's not enough. You need people with extensive experience making cloud work at both the technical and economic level - if you’re concerned about cloud cost.

Enough said about the human aspects of the problem though.

The next issue setting you up for cloud failure is the very justification for using cloud your “strategy team” has developed.

Are you using Cloud for the right thing?

Every few years we must remind ourselves what Cloud should be used for and what original problems drove the need for cloud.

A major problem is that companies don't use cloud for correct strategic and technical reasons.

Some think cloud is just another way of running Linux and MS-Windows servers and perhaps running a few databases.

(Hint, if this is your primary use case for cloud you're likely losing a lot of money as you read this!)

That's definitely not a good use case for cloud!

Cloud is best used to achieve the following capabilities in your application:

Elastic Scalability ("Rapid Scalability").
Automation/”Automatability”.
Global Internet Reach (putting services where your customers are)
Extreme Fault Tolerance and High Availability (Caveats apply)
A reduction in man-power/staff count required to manage large amounts of infrastructure
Access to technologies you cannot otherwise get in your own physical datacenter
Levels of Compliance that may be too expensive to achieve in the datacenter
De-coupling from specific skill sets and technologies in the data center (e.g SAN storage or networking specialist engineers).
Trading CAPEX for OPEX over a set period of time. Here you can avoid risk of paying up-front for expensive I.T infrastructure like SAN, Firewall and Virtualization clusters.
Rapid Prototyping and experimentation for testing solutions (requiring temporary spikes in resource usage)

You must decide, based on a strategic (not tactical) assessment, if one or more of these features impact on your profitability as a business. Avoid taking a narrowly technical view but explicitly validate the business case.

To use a common expression: “Is the Juice worth the Squeeze?”

If it is then read on, dear reader …

To get any of these benefits, your particular application must be intentionally designed to use the advantages of cloud.

You’ve heard the term “Cloud Native”, now read all about it here: https://aws.amazon.com/what-is/cloud-native/

That takes a high level of expertise. Perhaps outsourced but definitely "in-house" as well.

It also takes a high degree of flexibility from management (which is why startups benefit most from Cloud compared to traditional enterprises ... hint, hint).

Cloud allows you to move quickly using automation. Losing this advantage due to restrictive management negates it's value.

That is why your company’s management must provide enough autonomy to I.T teams to innovate quickly.

Returning to our skillset theme: Those teams will need a combined expertise in the following areas:

a) Automation - Here, software development and coding skills are key.

b) System design - Focus on Software Architecture skills.

c) Enterprise architecture - Extensive knowledge of both the business itself and the technology needed to support it.

d) Networking - Expertise in both traditional networks and SDN (“Software Defined Networking“)

Perhaps the most important "traditional" I.T specialisation you will need to build a powerful team is that of Networking. Cloud connectivity is a major source of complexity.

A gap in any one of these four skillsets is likely to offset the advantage of the others. Collect them all!

The Scale of the Cloud Cost problem and hidden impact of Scale

Another factor which makes cloud cost hard to control is the size of the company using cloud.

Poor cost performance of Cloud has a different impact depending on company size and this influences how it’s handled in each company:

Individual Users (e.g freelance developers) feel the effects of cloud costs immediately
Small companies (startups and family run manufacturing) notice runaway budgets within one month
Large Enterprises (e.g small retail brand companies) notice runaway cloud costs within one or two business quarters (3 - 6 months)
Fortune 500/Multi National Corporations may not notice excessive cloud costs for 1 to 3 years

The bigger the company, it’s budget and it’s organisational structure, the easier it is for cloud costs to disappear into the overall budget.

To elaborate:

The budget available to a company cushions it from losses due to cloud waste. Poor cloud efficiency is hard to notice at first. Efficiency is hard to incentivise until too late and the current CTO or CFO has moved on ...

In contrast:

In smaller companies it should be obvious that cloud cost get noticed sooner. A CEO is likely to begin grumbling about moving back the data center mid-way through a cloud migration project. Cost management actions get triggered earlier and with more urgency.

Takeaway:

You need to manage cloud costs as if you were an individual user, regardless of your size as an organisation, if you wish to see meaningful savings.
Your size as an organisation works against you in terms of managing cloud costs efficiently. Therefore, budget accountability needs to be brought closer to the people doing the actual cloud engineering.

Hidden Costs of Operating in the Cloud

The major contributors of cloud costs often become apparent after a move to the cloud. These contributors are often underestimated by CFOs, CEOs and CIOs when planning.

These hidden costs slip under the radar in the beginning because they’re often too complex to size or are assumed to be “manageable“. Examples are: Active Directory license sizing, TLS certificate management, staffing requirements.

This combination of complexity and assumed simplicity leads to wrong estimation of cloud operating costs and many important costs are missed:

Third party tools required to manage cloud. Leads to extra licensing costs.
Unpredictable Performance. Leads to unexpected capacity requirements.
Difficulty of Capacity Planning. Leads to unexpected consumption.
Shifting/Changing Nature of Cloud. Leads to unexpected support costs.
Human Resources (Man-hours). Leads to unexpected workload on staff.

By now it should be clear that Cloud consists of many moving parts, each part having varying and fixed costs, each part trying its best to bankrupt your organisation - if you let it!

In my experience the best response to these problems is good up-front architecture and application design of your cloud environment.

Contrary to the trends of “Agile Architecture” and iterative, experimental design, it’s better to invest a lot of intelligence, expertise and time into selecting the right cloud architecture and system design before you adopt it.

This allows you to place upper and lower bounds on your Cloud cost projections and helps you set profitability expectations.

By all means, experiment and iterate: Conduct PoCs, build MVPs and evaluate the cost as part of the “performance benchmark“ for Cloud.

Just don’t make this your regular operating model in the cloud (unless you can accept the budget implications!).

Why moving back to the data center is NOT a solution!

Despite all the problems of cloud, moving back to the data center is not necessarily the solution.

The short reason is simply:

Cloud offers technological advantages not viable in the datacenter at economies of scale - and this is likely to be the case forever.

The long reason deserves some commentary:

Automation

If automation is central to your requirements, the range and depth of automation offered by Cloud vendors is unmatched by what you can build in the physical data center yourself.

CAPEX vs OPEX.

While you can certainly rent physical equipment in your datacenter, most of the important infrastructure is going to need you to pay up front. Take the example of a multi-petabyte SAN like Hitachi Data Systems G1500, or similar SAN storage systems for big data.

Scaling Limits.

When you need to burst to thousands of VMs for high performance web services or high performance computing jobs (data analytics batch processing, etc), you’ll eventually run out of compute capacity in any single datacenter. Any expansion will require capital expenditure and long order times. Licensing costs for operating systems and management software (e.g anti-virus) also contributes.

Elasticity Limits.

In the cloud you can burst to thousands of VMs and containers over a short period to complete a specific computing objective. You pay only for that usage period.

In the data center you can’t do this without potentially bankrupting your enterprise investing in IT infrastructure.

Features and Capabilities

The cloud provides access to a market for a massive range of tools and technologies. Many of these technologies are not cheaply or quickly available in your traditional datacenter environment.

Regional Limitations

Internet latency, regulatory needs may demand you host your services in a datacenter close to your users - wherever they are in the world. Your options are a) build data centers in those locations b) find a hosting partner in the region or … Use the cloud.

And many more …

What kind of IT department makes cost-savings in the cloud and how are they doing it?

Having said all this, there are definitely many companies benefiting from doing I.T in the cloud and not losing money doing it.

What are they be doing right that you are not?

A quick survey of some of my successful cloud customers reveal some noticeable patterns of behaviour:

Cloud is used for a very specific application, for very specific components of the application architecture, for only those components that benefit exclusively from cloud
A high degree of automation is used to scale down and up rapidly - in response to actual load.
Cloud cost is actively managed, on a daily basis: Monitoring of cloud costs is granular, often reviewed and actioned. Cost analysis is part of daily cloud operations.
The cheapest services in cloud are used to match the requirement of the application. Over-provisioning is kept to a minimum.
Performance management of Applications is comprehensive: Regular performance and capacity benchmarking is carried out and used to tune Cloud resource usage.
Highly Skilled and Talented Cloud engineers are employed, with demonstrable ability to innovate solutions at short notice and implement rapidly.
Organisational Change Control and I.T Management supports rapid change of cloud infrastructure while still maintaining stability. They don’t have to break things to move fast. Their technical architecture makes this easy, allowing change with low risk.

Lets drill down further into these behaviours to identify concrete action points:

How can you transform your current cloud infrastructure to be more sustainable?

The good news is that much can be done to optimise cloud costs for sustainability:

Use the right architecture: You want to review the design of your current infrastructure in cloud. Are you designing for efficiency - including support process efficiency?
Trim useless features: You want to understand which features of Cloud you need and which you don't.
Put the right things in the right boxes: Some things should live in the data center, some should be SaaS, not IaaS. Some PaaS is better home-brewed. Some home-brewed components are just too labour-intensive to run.
Automate your cloud so that it largely runs itself. Provisioning processes are a big contributor of cloud costs. Automate! Automate! Automate! If you're doing cloud provisioning manually, give up any hope of cost savings.
Leave Legacy Skills in the data center: You want to change your operations in the cloud to a more suitable model. If you're using traditional on-premise IT engineering, slow to adopt automation, get rid of this ASAP.
Implement budget alerts on individual infrastructure: Understand the weekly and monthly costs of your cloud services and place budget cap monitoring on every single resource to drive cost analysis.
… and many more

That’s a lot of stuff! Can we perhaps point to one or two specific areas to start with?

Realise one key thing if you hope to save money on the cloud:

The “Enterprise Architecture” of your cloud environment and apps will define cost effectiveness.

“What is enterprise architecture”, I hear you ask?

Enterprise architecture is a strategic framework that aligns an organization's business strategy, processes, information, and technology to achieve its goals. It provides a holistic view of the organization, enabling effective decision-making, optimization of resources, and adaptation to changes in the business environment.
It is a combination of multiple underlying disciplines, such as:
Business architecture
IT architecture
Technology architecture
Security architecture

Enterprise Architecture is a "big picture" plan your company uses to make sure Cloud fits into your current IT, security policy, technology stack and how your business makes money. It is strategic in nature.

The upper limits of your cloud cost savings is defined by two other factors under "Enterprise Architecture" umbrella:

Your Cloud Application Architecture (is your architecture lean, or weighed down with lots of unused features?)
Your Cloud Operating Model (Are your cloud operations slow and manual instead of fast and automated?)

If you're locked into a fixed models in these areas your cost saving efforts aren't likely to yield dramatic results.

However, if there's flexibility to tune and adapt these models you may have options for tuning cloud costs. The more levers you have, the better.

Takeaway: Take a good look at your enterprise architecture and make sure it’s optimised for cloud sustainability.

One last thing:

Your current business model may also place limits to what you can achieve in cloud savings. Sometimes you need to look at your core business and understand if Cloud was ever the right way to support it.

… but that's a complex discussion for another post.

Managing Cloud Costs: Formal Practices and Methodologies - “Can FinOps help me?”

To be clear:

FinOps is a formal, systematic set of recommendations and practices developed by the FinOps foundation to gradually optimise your cloud costs, OR to ensure your cloud costs are optimised from the start.

However this practice doesn’t come free.

If you take the wrong approach to implementing FinOps it could lose you more money than just wasting cash on Cloud.

FinOps may require almost the entire I.T organisation to mobilise to save costs.

This is sometimes an unacceptable overhead given the workload of the average I.T worker.

Let's use a suitable analogy:

“FinOps is like chemo therapy. It can save the life of a very sick patient (a business losing money from Cloud) that's already in the early stages of "cancer" (inability to manage cloud costs)” — Traiano G. Welcome (random cloud enthusiast)

From that perspective I highly recommend it if you have no other cure.

However, like chemo, you need to weigh the costs of FinOps vs. the benefits:

In my experience it's possible for the costs of implementing FinOps (largely manpower, consultancy fees and expensive FinOps software) to consume all the savings you expected by investing in it.

The gains of implementing FinOps can be lost quite easily if your company doesn’t adopt it seamlessly.

We've coined the following saying to capture this concept:

"The best FinOps is that which does not need to be practiced"

If your cloud strategy, architecture and operating methods have been designed from the beginning with cost management in mind you should have no need to rely on implementing formal fin-ops practices at your organization.

Your cloud should "run lean" by design. Cost management should be baked into the infrastructure design, monitoring, operational workflows.

FinOps: A few humble recommendations

Review the FinOps model holistically BEFORE you move to the cloud, then figure out how you're going to get your automation and infrastructure to natively implement its recommendations.
Ensure your current IT cost management methods are compatible with cost accounting for the cloud. Without this integration you may find yourself building a whole new accounting department just to deal with cloud!
FinOps framework may not handle third party hidden costs of operating in the cloud. It is not a holistic cost management practice.
“Bean Counting” is not an effective application of FinOps. Counting every wasted penny of cloud spend won’t make you gains. It’s also not a task humans excel at and will burn your star employees out in short order. Cost saving measures need to be systemic changes that lead to significant, sustainable improvements.

See the FinOps website for more information: https://www.finops.org/

Are there any technical changes I can make to save costs?

At this point you might offer why I don’t offer tons of technical advice for reducing cloud costs.

You might be wondering why I’m not talking more about application concurrency, multi-threading, memory management, algorithms etc …

Frankly, I’ve gone down that road before and the gains are minor - and easily lost.

Yes, it’s possible to improve the efficiency of your application and infrastructure by making many smart technical optimisations.

You should be doing this anyway.

The problem with with trying to improve cloud efficiency using purely technical means is as follows:

Limited Gains: The impact of technical optimisations is limited by the architecture you’ve chosen. At some point no matter how well you tune your application and operating system it’s the overall design of your system that will present a bottleneck.
Vanishing Gains: Technical optimisations can be lost by a single upgrade or compliance change. You’ve spent months Tuning Linux O.S parameters, postgres database performance parameters in a bid to reduce the number of servers in your cluster. You’ve finally made some gains and see a few days, maybe a few weeks of savings. Then BOOM! A new compliance requirement comes in, or the application requires new database access parameters, or Linux gets a security patch that conflicts with your optimisations! Oh dear, there goes all your hard work and your cluster blows back up to double the original size.
Complexity: If your application becomes dependent on many clever performance optimisations, hacks and perhaps even radical innovations. However, you can see the problem already: The level of engineering knowledge required to maintain this is significant and now you’re spending more on R&D than you might have spent just accepting your current cost levels.
Remember: If you’re spending more money (time) doing advanced technical R&D to save Cloud costs than you’re actually losing through poor cloud cost management, then you’ve already lost the game.

A technical point: The use of “spot” resources obtained by “bidding” on cloud resources

The list of technical optimisations for cloud cost saving is endless and ever-growing but I will only touch on one for now:

The use of transient “spot market” virtual machines to make savings by bidding in an artificial market for compute resources.

Can AWS/Azure "spot" instances help you?

Spot Instances are significantly discounted—up to 90% off market prices on Amazon EC2.

If your application is able to take advantage of spot instances you could make substantial savings on cloud. However:

The downsides of Spot Instances

They’re interruptible. Spot prices are volatile.
If the spot price changes or the provider needs that capacity for regular instances, your machines can be terminated at short notice.
On GCP and Azure, this means 30 seconds, and AWS provides two minutes notice before the instance is terminated.
Long-running, uninterruptible processes aren’t appropriate for Spot Instances.
Spot Instances also may not be available at your desired price. If you have an agent fleet made up of Spot Instances you can run into issues if the Spot Instance price spikes.

What this is telling you is that Spot Instances fit a very narrow set of uses: Your application must tolerate being run on temporary instances that can be ripped out from under you at any time.

This technology also depend on your ability to build extremely good automation to bid, install applications on and extract results from spot servers quickly before they disappear back into ‘the market’.

Who would even use such a thing!?

Spot Instances can be used for tasks that are flexible and can tolerate interruptions. They are not suited for mission-critical applications that need constant uptime. Use Spot Instances for background processing, data analysis, batch jobs, and non-essential tasks. Your application must be able to handle being stopped and restarted.

Apparently, companies like AutoDesk benefit from Spot Instances. You can read their case study here: https://aws.amazon.com/solutions/case-studies/autodesk-spot-instances/

Takeaway:

If your application supports it and you’re skilled at automation you could make substantial savings using cloud resources.
This is an example of one of those technologies only the Cloud can deliver due to its ability to create economies of scale and provide advanced automation.
Whether this feature is useful to your business case is another matter.

Where to go from here?

Ponder the following and act accordingly (or don't!):

Improve your Cloud Strategy: Cost saving is a holistic problem. Review your business Case for Cloud. Understand the unique advantages cloud offers (hint - not the same as your traditional data center). Confirm if you are even using Cloud for the right things?
Improve your cloud cost management practice: Review the FinOps framework and compare it with your current cloud accounting method. Consider adopting a “Lean“ version of FinOps.
Improve your Technical Implementation: Review the way you're currently architecting infrastructure. Are you doing high availability where it's not needed? Are you using PaaS where a container on a VM will do? Are you using premium SKU/Model of VM when a Standard or micro instance will do? There’s a legion of technical things you can do!
Review the skills of your engineers: Are they actively discovering, proposing and executing cost savings? Do your engineers care at all about cost? If not, can you make this a KPI?

Questions? Criticisms? Pitchforks? Feel free to reach out to me on my substack and let rip!

Traiano’s Substack

Discussion about this post