Processing Backlogs and Rate Control

Ideally, all observations would be used to update the Ichnaea database. This would require backend resources (the worker servers, the cache, and the database) that can easily handle peak traffic. Such a system will be overprovisioned most of the time, as traffic ebbs and flows over days and weeks. A backend that can easily handle peak traffic may be prohibitively expensive or impractical, and require quickly adding more resources if traffic is consistently growing. There is a complex relationship between API traffic and observations, so it is not possible to predict the required backend resources for future traffic.

Monthly traffic to service. There is a daily cycle as well as a weekly cycle, and the trend is up.

When the backend is unable to process observations, a backlog builds up. This is acceptable for traffic spikes or daily peaks, and the backend works through the backlog when the traffic drops below the peaks.

When traffic consistently exceeds the capacity of the backend, a steadily increasing backlog is generated, causing problems. A large backlog tends to slow down the cache and database, which can increase the rate of backlog growth. The backlog is stored in Redis, which eventually runs out of memory. Redis slows down dramatically as it determines what data it can throw away, resulting in slow API requests and timeouts. Eventually, Redis will discard entire backlogs, “fixing” the problem but losing data. An administrator can monitor the backlog by looking at Redis memory usage and at the queue Metric.

The chart below shows what a steadily increasing backlog looks like. The backlog would take less than an hour to clear, but new observations continue to be added from API usage. Around 3 PM, API usage generates fewer observations, allowing the backend to make progress on reducing the backlog. Around 6 PM, after an hour of lower traffic, the API usage exceeds the backend capabilities again, and the backlog begins increasing.

A backlog builds up from 6 AM to 3 PM, takes nearly 3 hours to clear, and starts building again.

If traffic is steadily increasing over weeks and months, eventually the backend will be unable to recover in a full 24 hour cycle, leading to slower service and eventually data loss.

Ichnaea has rate controls that can be used to sample incoming data, and reduce the observations that need to be processed by the backend.

Rate Control by API Key

There are two rate controls that are applied by API key, in the api_key database table:

  • store_sample_locate (0 - 100) - The percent of locate API calls that are turned into observations.

  • store_sample_submit (0 - 100) - The percent of submission API calls that are turned into observations

An administrator can use these to limit the observations from large API users, or to ignore traffic from questionable API users. The default is 100% (all locate and submission observations) for new API keys.

Global Rate Control

The Redis key global_locate_sample_rate is a number (stored as a string) between 0 and 100.0 that controls a sample rate for all locate calls. This is applied as a multiple on any API key controls, so if an API has store_sample_locate set to 60, and global_locate_sample_rate is 50.0, the effective sample rate for that API key is 30%.

An administrator can use this control to globally limit observations from geolocate calls. A temporary rate of 0% is an effective way to allow the backend to process a large backlog of observations. If unset, the default global rate is 100%.

There is no global rate control for submissions.

Automated Rate Control (Beta)

Optionally, an automated rate controller can set the global locate sample rate. The rate controller is given a target of the maximum data queue backlog, and periodically compares this to the backlog. It lowers the rate while the backlog is near or above the target, and raises it to 100% again when below the target.

To enable the rate controller:

  1. Set the Redis key rate_controller_target to the desired maximum queue size, such as 1000000 for 1 million observations. A suggested value is 5-10 minutes of maximum observation processing, as seen by summing the data.observation.insert metric during peak periods with a backlog.

  2. Set the Redis key rate_controller_enabled to 1 to enable or 0 to disable the rate controller. If the rate controller is enabled without a target, it will be automatically disabled.

The rate controller runs once a minute, at the same time that queue metrics are emitted. The rate is adjusted during the peak traffic to keep the backlog near the target rate, and the backlog is more quickly processed when the peak ends.

The backlog reaches a maximum around the target, and clears more quickly when the input reduces

In our simulation, the controller picked a sample rate between 90% and 100% during peak traffic, which was sufficient to keep the queue sizes slightly above the target. This means that most observations will be processed, even during busy periods. It quickly responded to traffic spikes during peak periods by dropping the sample rate to 60%.

The sample rate drops to around 90% during peaks, and to 50% during an input spike.

The rate controller is a general proportional-integral-derivative controller (PID controller), provided by simple-pid. By default, only the proportional gain Kp is enabled, making it a P controller. The input is the queue size in observations, and the output is divided by the target, so the output is between 0.0 and 1.0 when the data queues exceed the target, and greater than 1.0 when below the target. This is limited to imited to the range 0.0 to 1.0, and then multiplied by 100 to derive the new sample rate.

The gain parameters are stored in Redis keys, and can be adjusted:

  • Kp (Redis key rate_controller_kp, default 8.0) - The proportional gain. Values between 1.0 and 10.0 work well in simulation. This controls how aggressively the controller drops the rate when the targer is exceeded. For example, for the same queue size, Kp=2.0 may lower the rate to 95% while Kp=8.0 may lower it to 80%.

  • Ki (Redis key rate_controller_ki, default 0.0) - The integral gain. The integral collects the accumulated “error” from the target. It tends to cause the queue size to overshoot the target, then sets the rate to 0% to recover. 0.0 is recommended, and start at low values like 0.0001 only if there is a steady backlog due to an underprovisioned backend.

  • Kd (Redis key rate_controller-kd, default 0.0) - The derivative gain. The derivative measures the change since the last reading. In simulation, this had little noticable effect, and may require a value of 50.0 or higher to see any changes.

The rate controller emits several metrics. An administrator can use these metrics to monitor the rate controller, and to determine if backend resources should be increased or decreased based on long-term traffic trends.