Upgrading backend servers in a production environment can be a challenge for your operations or DevOps team, whether they are dealing with an individual server or upgrading an application by moving to a new set of servers. Putting upstream servers behind NGINX Plus can make the upgrade process much more manageable while also eliminating or greatly lessening downtime.
[Editor – This post has been updated to use the NGINX Plus API for dynamic configuration of upstream groups, replacing the Upstream Conf module which was originally used.]
In a three‑part series of articles, we’ll focus on NGINX Plus – with a number of features above and beyond those in NGINX Open Source, it’s a more comprehensive and controllable solution for upgrades with zero downtime. This first article describes the two NGINX Plus features you can use for backend upgrades – the NGINX Plus API and health checks – in detail and compares them to upgrading with NGINX Open Source.
The related articles explain how to use the methods for two classes of upgrades:
- Upgrading hardware or software on an individual server machine
- Upgrading to a new version of an application by switching traffic to completely different servers or upstream groups
Choosing an Upgrade Method in NGINX Plus
NGINX Plus provides two methods for dynamically upgrading production servers and application version:
- NGINX Plus API – Use the NGINX Plus API to send HTTP requests to NGINX Plus that add, remove, or modify the servers in an upstream group.
- Application‑aware health checks – Define health checks so that you can purposely fail servers you want to take out of the load balancing rotation, and make them pass the health check when they are again ready to receive traffic.
The two methods differ with respect to several factors, so the choice between them depends on your priorities:
- Speed of change – With the API, the change takes effect immediately. With health checks, the change doesn’t take effect until a health check fails (the default frequency of health checks is 5 seconds).
- Initial traffic volume – With health checks, you can configure slow start: when a server returns to service, NGINX Plus slowly ramps up the load to the server over a defined period, allowing applications to “warm up” (populate caches, run just‑in‑time compilations, establish database connections, and so on). The server is not overwhelmed by connections, which might time out and cause it to be marked as failed again. With the API, NGINX Plus by default immediately sends a server its full share of traffic.
- Automation and scripting – With the API, you can automate and script most phases of the upgrade, and do so within the NGINX Plus configuration. To automate upgrades when using health checks, you must also create scripts that run on the servers being upgraded (for example, to manipulate the file used for semaphore health checks).
In general, we recommend the NGINX Plus API for most use cases because changes take effect immediately and the API is fully scriptable and automatable.
Upgrading with NGINX Open Source
First, let’s review how upgrades work with NGINX Open Source, and explore some possible issues. Here you change upstream server groups by editing the upstream
configuration block and reloading the configuration file. The configuration reload is seamless because a new set of worker processes are started to utilize the new configuration, while the existing worker processes continue to run and handle connections that were open when the reload occurred. Each old worker process terminates as soon as all its connections have completed. This design guarantees that no connections or requests are lost during the reload, and makes the reload method suitable even when upgrading NGINX itself from one version to another.
Depending on the nature of the outstanding connections, the time it takes to complete them all can range from just seconds to several minutes. If the configuration doesn’t change often, running two sets of workers for a short time usually has no bad effects. However, if changes (and consequently reloads) are very frequent, old workers might not finish processing requests and terminate before the next reload takes place, leaving multiple sets of workers running at once. With enough workers, you might eventually end up exhausting memory and hitting 100% CPU, particularly if you’re already optimizing use of resources by running your servers at close to capacity.
When you’re load balancing application servers, upstream groups are the part of the configuration that changes most frequently, whether it’s to scale capacity up and down, upgrade to a new version, or take servers offline for maintenance. Customers running hundreds of virtual servers load balancing traffic across thousands of backend servers might need to modify upstream groups very frequently. Using NGINX Plus’ API or health checks, you avoid the problem of frequent configuration reloads.
Overview of NGINX Plus Upgrade Methods
The use cases discussed in the two related articles use one of the following methods, sometimes in combination with auxiliary actions.
Upgrading with the NGINX Plus API
To use the NGINX Plus API to manage the servers in an upstream group, you issue HTTP methods against the following base URL. We’re using the conventional location name for the API, /api, but you can configure a different name (see the section about the base configuration in the second or third article).
http://NGINX-server[:port]/api/api-version/http/upstreams/upstream-group-name/servers
In the commands below, this URL is represented as BASE-URL
.
When you issue the curl
command with no additional parameters, a list of the servers and their parameters is returned, as in this example for the use cases we’ll cover in the other two articles. Here we pipe the output to the jq
utility to put each field on its own line for easier reading:
$ curl -s BASE-URL | jq
[
{
"id": 0,
"server": "172.16.210.81:80",
"weight": 1,
"max_conns": 0,
"max_fails": 0,
"fail_timeout": "10s",
"slow_start": "10s",
"route": "",
"backup": false,
"down": false
},
{
"id": 1,
"server": "172.16.210.82:80",
"weight": 1,
"max_conns": 0,
"max_fails": 0,
"fail_timeout": "10s",
"slow_start": "10s",
"route": "",
"backup": false,
"down": false
}
]
We can filter the output further to show just the hostname or IP address, and internal ID, of each server. We need the ID to identify a server when we remove it or change its state as in the instructions below.
$ curl -s BASE-URL | jq -c '.peers[] | {server, id}'
{"server":"172.16.210.81:80","id":0}
{"server":"172.16.210.82:80","id":1}
To make changes to the servers in the upstream group, use the indicated methods (confirmation or other messages that might be returned are omitted):
-
Add a server.
$ curl -X POST -d '{"server":"address-or-hostname[:port]"}' BASE-URL
By default, the server is marked as up and NGINX starts sending traffic to it immediately. To mark it as down so that it does not receive traffic until you are ready to mark it as up, set the
down
parameter totrue
as you add the server:$ curl -X POST -d '{"server":"address-or-hostname[:port]","down":true}' BASE-URL
-
Remove a server – NGINX Plus terminates all connections immediately and sends no more requests to it.
$ curl -X DELETE BASE-URL/server-ID
-
Mark a server as down – NGINX Plus stops opening new connections to the server, but any existing connections are allowed to complete. Using the NGINX Plus live activity monitoring dashboard or API, you can see when the server no longer has any open connections and can be safely taken offline.
$ curl -X PATCH -d '{"down":true}' BASE-URL/server-ID
-
Mark a running server as draining – NGINX Plus stops sending traffic from new clients to the server, but allows clients who have a persistent session with the server to continue opening connections and sending requests to it. Once you feel that you have allowed enough time for sessions to complete, you can mark the server as down and take it offline. For a discussion of ways to automate the check for completed sessions, see Using the API with Session Persistence for an Individual Server Upgrade.
$ curl -X PATCH -d '{"drain":true}' BASE-URL/server-ID
-
Mark a server as up – NGINX Plus immediately starts sending traffic to it.
$ curl -X PATCH -d '{"down":false}' BASE-URL/server-ID
-
Change server configuration – You can set any of the parameters on the
server
directive with thePOST
method when adding a server or thePATCH
method on existing servers. We’ll use this feature to set server weights in several of the use cases in the follow‑on posts. - You take servers up and down by taking action on the backend servers rather than by interacting with NGINX Plus. Most commonly, you define the health check to succeed if a particular file (healthcheck.html, for example) exists on the server, and to fail if it doesn’t. To take the server down, you make the health check fail by removing or renaming the file; to bring it up, you make the health check succeed by restoring the file or changing the name back to healthcheck.html.
- With health checks, changes are not immediate as with the API, but instead depend on the health‑check frequency. By default, health checks run every five seconds and only one failure is required for a server to be considered unhealthy. So with the default setting, it can take up to five seconds for NGINX Plus to change the state of the server.
- An advantage of health checks over the API is that you can specify a timeframe after a server returns to health during which NGINX Plus gradually ramps up the load on the server (the slow‑start feature). This is helpful if your servers need to “warm up” before they are ready to receive their fair share of load.
- You can’t use health checks when using session persistence. When NGINX Plus marks a server as
down
because it fails a health check, the server no longer receives new connections, even from clients that are pegged to it by a session persistence mechanism. (In other words, with health checks you can set server state to the equivalent of the API’s"down":false
and"down":true
, but not to"drain":true
.) - Upgrading hardware or software on an individual server machine
- Upgrading to a new version of an application by switching traffic to completely different servers or upstream groups
Upgrading with Application Health Checks
Configuring application health checks is an easy way to improve the user experience at your site. By having NGINX Plus continually check whether backend servers are up and remove unavailable servers from the load‑balancing rotation, you reduce the number of errors seen by clients. You can also use health checks to bring servers up and down, instead of (or in addition to) the API.
There are a few significant differences between health checks and the API:
Conclusion
NGINX Plus provides operations and DevOps engineers with several options for managing the upgrade process for both individual servers and groups of servers, all while continuing to provide a good customer experience by avoiding downtime. For comprehensive instructions on using the upgrade methods for specific use cases, see the other two articles in this series:
Try NGINX Plus out for yourself and see how it makes upgrades easier and more efficient – start a free 30-day trial today or contact us to discuss your use case.