Inside Monzo's strategy to be the UK's most reliable bank

As the industry is blighted by customer-impacting outages, Monzo is working hard to steel its platform and set a new standard


Britain's major banks continue to be dogged by digital outages, with major incidents impacting individuals and businesses an all-too-regular occurrence. Challenger bank Monzo wants to set a new, better, standard.

In 2018, the Financial Conduct Authority reported that financial services outages had increased by 187 percent in the past year, 65 percent of which were from retail banks, including the infamous failed migration of a core system at TSB that locked millions of customers out of their accounts for weeks.

Founded in 2015 and having already amassed 3 million users, Monzo is of course starting from a highly beneficial position of being built in the cloud, rather than having been patched together from a stack of legacy systems and services like the big incumbent banks are. That doesn't mean it is immune to downtime though.

Read next: The government has outlined its recommendations to cut bank IT failures

"We are acutely aware that we've had nothing majorly go wrong this year, but I feel certainly like we've delivered a less reliable experience than I'd like," Chris Evans, platform team lead at the challenger bank told Computerworld during the Amazon Web Services (AWS) re:invent conference in Las Vegas last week.

His goal for the year ahead is to boost overall resiliency and ensure that Monzo is "the most reliable bank that people can go to".

Splinter cell

Evans joined Monzo last year when new CTO Meri Williams arrived from custom printing startup Moo – where they had worked together – to lead the platform team of 12. Monzo runs the majority of its technology in-house, predominantly on the AWS platform, with its more advanced data workloads on Google Cloud Platform and a healthy dose of containers and Kubernetes orchestration underpinning everything.

Evans' approach to boosting resiliency across this stack starts with mapping any potential issues or pinch points which need to be addressed if Monzo is to reach the lofty standards he is aiming for.

This involves a shift towards a cell-based architecture, an emerging model for highly scalable environments, where services can be carved up in a way that can "limit the blast radius" when incidents inevitably occur.

He asked: "How do we make it really, really difficult for any one change to bleed out and affect anything more than that specific component that it is changing?"

This means splitting up any areas of dependence so that any negative impacts only affect a small area of banking functionality, rather than all services and customers.

For example, in August, Evans published a blog post regarding an outage the bank suffered in July due to some mistakes made when scaling its use of the open source Apache Cassandra data store.

As the bank's customer base continued to grow beyond expectations it had to scale its Cassandra use, which meant "during peak load, our Cassandra cluster was running closer to its limits than we'd like. And even though this wasn't affecting our customers, we knew if we didn't address it soon we'd start seeing an increase in the time it'd take to serve requests," Evans wrote.

The answer, long term, for Monzo is to split up its single Cassandra cluster, which will "drastically reduce the likelihood and impact of repeat issues like this one, and make it safer for us to operate at scale," Evans added.

Monzo as an organisation takes a highly transparent approach to its failings, using its transparency dashboards and social media channels to keep customers abreast of service outages.

Read next: How Monzo is rewriting the rulebook when reporting data breaches

"These other banks are going up and down and, every time the clocks change, they will go down for however long. I think for us, because we share everything that goes wrong, the key goal is to not ever go wrong and then we don't have to tread that line and write big blog posts around Cassandra going wrong," Evans said.

A shift to managed services

A priority for Evans next year is to determine exactly where Monzo can hand over some responsibility to third-party vendors to run managed services.

"We're now at the point where we've grown enough that it doesn't feel like a lot of the things that we're running ourselves are differentiating," he said.

For example, Monzo is running its own Apache Kafka stack for streaming data, but AWS offers Managed Streaming for Apache Kafka (MSK).

"We could definitely just give that up, it's not a useful thing for us to spend time doing when it's a solved problem," Evans said.

"Kafka was the first one we were going after, and not really because it's causing us major operational burden now, but because we're seeing increasing usage of it across Monzo and it's going to get to the point where we're running many, many, many Kafka clusters and getting ahead of that to the point where we're not swamped by operational work is really what we're focused on."

Another example arrived in Evans' lap last week, when AWS announced a fully managed version of the Cassandra database, something which certainly got his attention, especially after he had scoped out AWS VP of cloud architecture strategy Adrian Cockcroft about the idea last year only to be told that there wasn't enough appetite for a managed Cassandra service.

"Clearly that's changed. So we have been on to account managers and will be very, very keen to look at that," he said. "We have a whole list of ideas of things we want to go after, which are really going to be highly leveraged things that will allow us to be much better as a bank, or much cheaper as a bank."

That being said, Evans isn't going to be shifting its use of Kubernetes to a managed service just yet.

"Kubernetes of all the things that we run, it's the thing that causes us the least issues and running it ourselves and understanding it means we've got really good expertise," he said.

However, the bank is trialling Amazon's Elastic Kubernetes Service (EKS) on a build cluster at present.

Getting ahead of the curve

This broad shift does create tension at an organisation like Monzo, which Evans is keenly aware of.

"I think certainly at Monzo, where it is a really strong engineering culture, people are quite nervous around handing off things that we see that we can jump in and fix, where suddenly we wouldn't be able to do that," he said. "You do give up that control and it's whether you trust that the person you've given it up to could do a better job than you of fixing that. We're clearly not immune from making mistakes with Cassandra, we will probably make them again in future. That's really the tension that exists, I think."

That being said, the shift towards managed services is more about easing the operational burden on the technology function as the company continues to grow at a rapid rate.

He added: "It's about getting ahead of the curve before you are so swamped with operational work that you can't innovate anymore. It's one of the biggest concerns I have... and I think managed services, for me, really is the answer."

Evans is also eyeing more auto-scaling and the use of discounted spare compute capacity, called spot instances, to reduce the cost of running its technology costs in the new year.

Copyright © 2019 IDG Communications, Inc.

Shop Tech Products at Amazon