NGINX.COM
Web Server Load Balancing with NGINX Plus

Upgrading backend servers in a production environment can be a challenge for your operations or DevOps team, whether they are dealing with an individual server or upgrading an application by moving to a new set of servers. Putting upstream servers behind NGINX Plus can make the upgrade process much more manageable while also eliminating or greatly lessening downtime.

[Editor – This post has been updated to use the NGINX Plus API for dynamic configuration of upstream groups, replacing the Upstream Conf module which was originally used.]

In a three‑part series of articles, we’ll focus on NGINX Plus – with a number of features above and beyond those in NGINX Open Source, it’s a more comprehensive and controllable solution for upgrades with zero downtime. This first article describes the two NGINX Plus features you can use for backend upgrades – the NGINX Plus API and health checks – in detail and compares them to upgrading with NGINX Open Source.

The related articles explain how to use the methods for two classes of upgrades:

Choosing an Upgrade Method in NGINX Plus

NGINX Plus provides two methods for dynamically upgrading production servers and application version:

  • NGINX Plus API – Use the NGINX Plus API to send HTTP requests to NGINX Plus that add, remove, or modify the servers in an upstream group.
  • Application‑aware health checks – Define health checks so that you can purposely fail servers you want to take out of the load balancing rotation, and make them pass the health check when they are again ready to receive traffic.

The two methods differ with respect to several factors, so the choice between them depends on your priorities:

  • Speed of change – With the API, the change takes effect immediately. With health checks, the change doesn’t take effect until a health check fails (the default frequency of health checks is 5 seconds).
  • Initial traffic volume – With health checks, you can configure slow start: when a server returns to service, NGINX Plus slowly ramps up the load to the server over a defined period, allowing applications to “warm up” (populate caches, run just‑in‑time compilations, establish database connections, and so on). The server is not overwhelmed by connections, which might time out and cause it to be marked as failed again. With the API, NGINX Plus by default immediately sends a server its full share of traffic.
  • Automation and scripting – With the API, you can automate and script most phases of the upgrade, and do so within the NGINX Plus configuration. To automate upgrades when using health checks, you must also create scripts that run on the servers being upgraded (for example, to manipulate the file used for semaphore health checks).

In general, we recommend the NGINX Plus API for most use cases because changes take effect immediately and the API is fully scriptable and automatable.

Upgrading with NGINX Open Source

First, let’s review how upgrades work with NGINX Open Source, and explore some possible issues. Here you change upstream server groups by editing the upstream configuration block and reloading the configuration file. The configuration reload is seamless because a new set of worker processes are started to utilize the new configuration, while the existing worker processes continue to run and handle connections that were open when the reload occurred. Each old worker process terminates as soon as all its connections have completed. This design guarantees that no connections or requests are lost during the reload, and makes the reload method suitable even when upgrading NGINX itself from one version to another.

Depending on the nature of the outstanding connections, the time it takes to complete them all can range from just seconds to several minutes. If the configuration doesn’t change often, running two sets of workers for a short time usually has no bad effects. However, if changes (and consequently reloads) are very frequent, old workers might not finish processing requests and terminate before the next reload takes place, leaving multiple sets of workers running at once. With enough workers, you might eventually end up exhausting memory and hitting 100% CPU, particularly if you’re already optimizing use of resources by running your servers at close to capacity.

When you’re load balancing application servers, upstream groups are the part of the configuration that changes most frequently, whether it’s to scale capacity up and down, upgrade to a new version, or take servers offline for maintenance. Customers running hundreds of virtual servers load balancing traffic across thousands of backend servers might need to modify upstream groups very frequently. Using NGINX Plus’ API or health checks, you avoid the problem of frequent configuration reloads.

Overview of NGINX Plus Upgrade Methods

The use cases discussed in the two related articles use one of the following methods, sometimes in combination with auxiliary actions.

Upgrading with the NGINX Plus API

To use the NGINX Plus API to manage the servers in an upstream group, you issue HTTP methods against the following base URL. We’re using the conventional location name for the API, /api, but you can configure a different name (see the section about the base configuration in the second or third article).

http://NGINX-server[:port]/api/api-version/http/upstreams/upstream-group-name/servers

In the commands below, this URL is represented as BASE-URL.

When you issue the curl command with no additional parameters, a list of the servers and their parameters is returned, as in this example for the use cases we’ll cover in the other two articles. Here we pipe the output to the jq utility to put each field on its own line for easier reading:

$ curl -s BASE-URL | jq
[
  {
    "id": 0,
    "server": "172.16.210.81:80",
    "weight": 1,
    "max_conns": 0,
    "max_fails": 0,
    "fail_timeout": "10s",
    "slow_start": "10s",
    "route": "",
    "backup": false,
    "down": false
  },
  {
    "id": 1,
    "server": "172.16.210.82:80",
    "weight": 1,
    "max_conns": 0,
    "max_fails": 0,
    "fail_timeout": "10s",
    "slow_start": "10s",
    "route": "",
    "backup": false,
    "down": false
  }
]

We can filter the output further to show just the hostname or IP address, and internal ID, of each server. We need the ID to identify a server when we remove it or change its state as in the instructions below.

$ curl -s BASE-URL | jq -c '.peers[] | {server, id}'
{"server":"172.16.210.81:80","id":0}
{"server":"172.16.210.82:80","id":1}

To make changes to the servers in the upstream group, use the indicated methods (confirmation or other messages that might be returned are omitted):

  • Add a server.

    $ curl -X POST -d '{"server":"address-or-hostname[:port]"}' BASE-URL

    By default, the server is marked as up and NGINX starts sending traffic to it immediately. To mark it as down so that it does not receive traffic until you are ready to mark it as up, set the down parameter to true as you add the server:

    $ curl -X POST -d '{"server":"address-or-hostname[:port]","down":true}' BASE-URL
  • Remove a server – NGINX Plus terminates all connections immediately and sends no more requests to it.

    $ curl -X DELETE BASE-URL/server-ID
  • Mark a server as down – NGINX Plus stops opening new connections to the server, but any existing connections are allowed to complete. Using the NGINX Plus live activity monitoring dashboard or API, you can see when the server no longer has any open connections and can be safely taken offline.

    $ curl -X PATCH -d '{"down":true}' BASE-URL/server-ID
  • Mark a running server as draining – NGINX Plus stops sending traffic from new clients to the server, but allows clients who have a persistent session with the server to continue opening connections and sending requests to it. Once you feel that you have allowed enough time for sessions to complete, you can mark the server as down and take it offline. For a discussion of ways to automate the check for completed sessions, see Using the API with Session Persistence for an Individual Server Upgrade.

    $ curl -X PATCH -d '{"drain":true}' BASE-URL/server-ID
  • Mark a server as up – NGINX Plus immediately starts sending traffic to it.

    $ curl -X PATCH -d '{"down":false}' BASE-URL/server-ID
  • Change server configuration – You can set any of the parameters on the server directive with the POST method when adding a server or the PATCH method on existing servers. We’ll use this feature to set server weights in several of the use cases in the follow‑on posts.

  • Upgrading with Application Health Checks

    Configuring application health checks is an easy way to improve the user experience at your site. By having NGINX Plus continually check whether backend servers are up and remove unavailable servers from the load‑balancing rotation, you reduce the number of errors seen by clients. You can also use health checks to bring servers up and down, instead of (or in addition to) the API.

    There are a few significant differences between health checks and the API:

    • You take servers up and down by taking action on the backend servers rather than by interacting with NGINX Plus. Most commonly, you define the health check to succeed if a particular file (healthcheck.html, for example) exists on the server, and to fail if it doesn’t. To take the server down, you make the health check fail by removing or renaming the file; to bring it up, you make the health check succeed by restoring the file or changing the name back to healthcheck.html.
    • With health checks, changes are not immediate as with the API, but instead depend on the health‑check frequency. By default, health checks run every five seconds and only one failure is required for a server to be considered unhealthy. So with the default setting, it can take up to five seconds for NGINX Plus to change the state of the server.
    • An advantage of health checks over the API is that you can specify a timeframe after a server returns to health during which NGINX Plus gradually ramps up the load on the server (the slow‑start feature). This is helpful if your servers need to “warm up” before they are ready to receive their fair share of load.
    • You can’t use health checks when using session persistence. When NGINX Plus marks a server as down because it fails a health check, the server no longer receives new connections, even from clients that are pegged to it by a session persistence mechanism. (In other words, with health checks you can set server state to the equivalent of the API’s "down":false and "down":true, but not to "drain":true.)

    Conclusion

    NGINX Plus provides operations and DevOps engineers with several options for managing the upgrade process for both individual servers and groups of servers, all while continuing to provide a good customer experience by avoiding downtime. For comprehensive instructions on using the upgrade methods for specific use cases, see the other two articles in this series:

    Try NGINX Plus out for yourself and see how it makes upgrades easier and more efficient – start a free 30-day trial today or contact us to discuss your use case.

Hero image
免费 O'Reilly 电子书:
《NGINX 完全指南》

更新于 2022 年,一本书了解关于 NGINX 的一切

关于作者

Rick Nelson

Rick Nelson

方案工程区域副总裁

Rick Nelson is the Manager of Pre‑Sales, with over 30 years of experience in technical and leadership roles at a variety of technology companies, including Riverbed Technology. From virtualization to load balancing to accelerating application delivery, Rick brings deep technical expertise and a proven approach to maximizing customer success.

关于 F5 NGINX

F5, Inc. 是备受欢迎的开源软件 NGINX 背后的商业公司。我们为现代应用的开发和交付提供一整套技术。我们的联合解决方案弥合了 NetOps 和 DevOps 之间的横沟,提供从代码到用户的多云应用服务。访问 nginx-cn.net 了解更多相关信息。