NewsTelecoms

What caused the UK 999 emergency phone system to fail on Sunday 25th June 2023

On the morning of Sunday 25th June 2023 the UK emergency 999 system suffered one of its biggest failures to date. Calls to 999 were either been disconnected or people could not contact the number at all.

Since 1937, BT Group has supported the UK’s emergency services by handling all 999 calls. BT say that they take great pride in underpinning the national emergency response service and recognise the critical national importance their infrastructure plays. While no technology is 100% resilient, they have built a highly robust network with multiple layers of protection to connect the public to emergency services in their time of need.

The level of disruption to the service on Sunday 25th June has never been seen before and BT say they are sincerely sorry for the distress caused.

Timeline of events – via BT

At 06:24 on Sunday 25 June, our 999 call handling agents started to experience problems with some emergency calls being cut off on connection to the emergency services. Our network management team was informed at 06:40 and an internal incident was raised at 07:02.

After initial investigations failed to identify or ameliorate the problem, a conference call was opened at 07:20 with our technical specialist teams to further investigate.

The teams were aware of a growing number of 999 calls being impacted and the root cause of the issue was unknown.

To ensure a triple resilient network, we run the service on three primary network clusters, each of which has the capacity to handle all 999 call traffic. It was unclear which network cluster was affected because no alarms were presented. The decision was therefore made to switch from the three primary network clusters to the 999 backup system at 07:25.

Transfer to the backup system was attempted at 07:31. At 07:46 it became clear that this had been unsuccessful. While the backup system itself was ready to handle calls, the complex transfer process had not been completed successfully. We have since put in place steps to simplify this process.

In our ongoing attempts to restore full service we returned one of the primary network clusters back into operation. However, as became apparent later on, the network cluster that had been selected to attempt service restoration is where the fault lay. This resulted in callers being unable to connect to the 999 service between 07:32 and 08:50.    

The incident priority was increased at 07:47 and after the relevant teams had been briefed, at 08:01 the Lead 999 Centre notified all Emergency Authorities of the situation simultaneously via email.

The incident was designated to SI (serious incident) status at 08:20 which, in-line with our internal processes, meant that our Civil Resilience team became notified (at 08:44). 

Transfer to the backup system was successfully initiated at 08:37 (for calls from landline) and 08:50 (for calls from a mobile). This significantly improved 999 call answer success, albeit with only basic service functionality which meant increased call pick-up and call handling time.

Ofcom was alerted via a call at 09:05. The first of our external-facing media statements was issued at 09:35 confirming the issue and making clear that our backup system was online and that people should call 999 as usual. At 09:45 an email notification was sent to the Department for Science Innovation & Technology (DSIT), Ofcom and Devolved Administrations.   

With the backup system operating successfully, the teams’ primary focus switched to root cause investigations to enable return to the primary 999 call system. At 11:54, following further diagnosis, we started to reintroduce non-emergency traffic to the non-impacted primary network clusters, while continuing to isolate the impacted cluster.

After extended monitoring we began moving emergency calls onto the primary network clusters from 14:52. By 16:56 all emergency calls were being handled by the primary 999 system.

In parallel, diagnostics and a temporary fix on the impacted network cluster meant it was re-introduced at 20:50, firstly for non-emergency traffic. After no issues were experienced, emergency calls were re-introduced at 21:29.

At 22:14, following approvals from Government, we issued the second of our media statements confirming that the service was restored, and we were no longer relying on the backup system.

Remedial actions and next steps

During the disrupted period on Sunday, we have provisionally identified there was a total of 11,470 unique calls that were unsuccessfully connected to 999.  For each of these calls, BT’s commitment is to call the customer back and establish whether further help is needed and, if required, connect to the appropriate emergency service. If our contact is unsuccessful, we pass the detail onto the police to investigate.  Following the incident, our process was completed by 08:16 Wednesday, 28 June.

Since Sunday evening, our investigations have found a complex software issue that had never been seen through our continuous testing regime. It was causing a ‘caching’ issue which resulted in impacted calls not being routed correctly and the user’s call being disconnected.

We have identified the root cause of the initial fault and have put a robust temporary fix in place. The system is stable and running as normal, and we are now testing the permanent fix.

We are putting in place significant improvements to our systems and processes, and we will fully cooperate with Ofcom’s investigation.

Duncan

Duncan is a technology professional with over 20 years experience of working in various IT roles. He has a interest in cyber security, and has a wide range of other skills in radio, electronics and telecommunications.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.