Teridion has set out to make the internet faster than when left to its own devices. Just like the internet, Teridion works on an “always on” mantra. It is designed to ensure that a SaaS provider’s services and applications remain up as long as there is connectivity. At worse case, Teridion falls back to regular public internet connection temporarily while it takes care of the event and recover back to its optimized healthy state. In order to understand how Teridion deals with such events and failures, it is first important to recognize the major types of outages suffered over the internet.
Network Blackouts and Brownouts
The most common outages experienced over the internet are network related. These come in the form of either total loss of connection or blackout, or a degraded connection state also known as a brownout. Blackouts typically occur if there is a physical issue with routers, switches, or links. ISPs are constantly undergoing upgrades and unplanned downtimes resulting in complete network loss. Brownouts can be physical in nature or could be caused by traffic congestion. Large volumes of traffic over a narrow link can cause packet loss, re-transmissions and spikes in latency degrading the quality and state of the connection. Typically, during a brownout, everyone suffers from what’s known as “the noisy neighbor” problem. Brownouts are quite common in a multi–tenant environment where a single application could end up monopolizing shared network resources. Even though there are alternate paths available during blackouts and brownouts, BGP (border gateway protocol), the internet routing protocol, is not well equipped to handle these in a timely fashion. During a blackout, it will eventually recalculate a new path. However, it does not provide any mitigation during brownouts. Teridion is designed with these occurrences in mind and can eliminate these network events expeditiously.
Cloud/SaaS Infrastructure and Application Outages
Applications hosted in a public cloud can suffer outages due to service provider’s infrastructure or application error. Even though these are rare, they have known to happen.
Fig. 1: Facebook suffered an outage in Sep 2015
Cloud infrastructure is built to expect failure and self-heal, but still can fall victim to these outages. Even cloud giants like Amazon Web Services suffer outages from time to time. Last time AWS suffered an outage, around 40 minutes in 2013, the company lost about $1,100 dollars per second of net sales during that time (over $2.6 million in less than an hour). Teridion leverages the public cloud infrastructure as an “overlay” to the physical ISP network layer to provide performant intelligent routing for the internet. If an outage occurs in the cloud layer, Teridion is designed to manage these events as required.
A typical Teridion network backbone consists of two (or more) TCRs (Teridion Cloud Virtual Routers) routers deployed in HA at each end of the connection. One TCR at each end serves as the Primary for a particular application as shown below.
Fig. 2: Teridion Cloud Routers Deployed as HA
Client-side TCR Unreachable
Since all user traffic is intercepted via the client-side TCR, Teridion DNS service providers monitor both on specified service ports, for example HTTP or HTTPS, to verify their state (i.e. up or down). Monitoring takes place from multiple locations (from five continents) and if a TCR is unreachable, its downed state is verified via multiple monitors.
Fig. 3: DNS Server showing TCR Up for Teridion Service
Monitoring is continuous and timing for these heartbeats is configurable and happens multiple times per minute. In case the primary client-side TCR is unreachable, the DNS server immediately redirects traffic to the secondary TCR and continues Teridion service without a hitch. The Teridion Management System (TMS) spins up new TCRs as required to recover them back to the HA state.
In case both TCRs are unreachable (for example during a catastrophic cloud data center outage where the TCRs reside), the DNS service identifies both TCRs in downed state and reverts back to the original domain name for the SaaS provider data center, routing traffic over regular public internet. Even during total outages, Teridion ensures that the service remains up. Once the TMS deploys the new TCRs, either in the same or alternate cloud data center, the DNS resumes the service over the optimized Teridion backbone network. The TMS uses API calls to both the cloud providers as well as the DNS service to automate the entire process of spinning up new TCRs as well as creating DNS name records.
Server-side (Data Center) TCR Unreachable
In case the Primary server-side TCR becomes unreachable (for instance during network blackout/brownout), the user-side TCR, being aware of both server-side TCRs, routes traffic to the secondary (or auxiliary) TCR maintaining the Teridion service.
And finally, if both server-side TCRs are unreachable, the client-side TCR redirects traffic directly to the original domain of the SaaS provider data center going over regular public internet. Yet again, Teridion aims to maintain maximum service uptime in the face of total catastrophic failure. Upon recovery of the server-side TCRs orchestrated by the TMS, Teridion service resumes immediately.
Note that the discussion here focuses on loss of TCR connectivity. In some cases, the DNS provider itself may be the cause of disruption. Here, provisioning of multiple DNS providers is key. Teridion does this, and we’ll explain how in a future blog.
At Teridion, we understand we have a big responsibility. We know and welcome the need to sustain the high standards of user expectations as well as application performance. We are aware that internet is never always fully up but it’s also never down. It‘s constantly undergoing changes and events and we must navigate around these to provide a consistent experience to both our end-users as well as SaaS providers.