Metrics¶
As discussed in the deployment document, Ichnaea emits metrics through the Statsd client library with the intent of aggregating and viewing them on a compatible dashboard.
This document describes the metrics collected.
The code emits most metrics using the statsd tags extension. A metric
name of task#name:function,version:old
therefor means a statsd metric
called task
will be emitted with two tags name:function
and
version:old
. If the statsd backend does not support tags, the
statsd client can be configured with a tag_support = false
option.
In this case the above metric would be emitted as:
task.name_function.version_old
.
API Request Metrics¶
These are metrics that track how many times each specific public API is used and which clients identified by their API keys do so. They are grouped by the type of the API, where type is one of locate, region and submit, independent of the specific version of that API.
These metrics can help in deciding when to remove a deprecated API.
locate.request#path:v1.search,key:<apikey>
,
locate.request#path:v1.geolocate,key:<apikey>
,
region.request#path:v1.country,key:<apikey>
,
submit.request#path:v1.submit,key:<apikey>
,
submit.request#path:v1.geosubmit,key:<apikey>
,
submit.request#path:v2.geosubmit,key:<apikey>
: counters
These metrics count how many times a specific API was called by a specific API key expressed via the API keys short name. The API key is the actual API key, often a UUID.
Two special short names exist for tracking invalid (
invalid
) and no (none
) provided API keys.
API User Metrics¶
For all API requests including the submit-type APIs we gather metrics about the number of unique users based on the users IP addresses.
These metrics are gathered under the metric prefix:
<api_type>.user#key<apikey>
: gauge
They have an additional tag to determine the time interval for which the unique users are aggregated for:
#interval:1d
, interval:7d
: tags
Unique users per day or last 7 days.
Technically these metrics are based on HyperLogLog cardinality numbers maintained in a Redis service. They should be accurate to about 1% of the actual number.
API Query Metrics¶
For each incoming API query we log metrics about the data contained in the query with the metric name and tags:
<api_type>.query#key<apikey>,region:<region_code>
: counter
api_type describes the type of API being used, independent of the version number of the API. So v1/country gets logged as region and both v1/search and v1/geolocate get logged as locate.
region_code is either a two-letter GENC region code like de or the special value none if the region of origin of the incoming request could not be determined.
We extend the metric with additional tags based on the data contained in it:
#geoip:false
: tag
This tag only gets added if there was no valid client IP address for this query. Since almost all queries contain a client IP address we usually skip this tag.
#blue:none
, #blue:one
, #blue:many
,
#cell:none
, #cell:one
, #cell:many
,
#wifi:none
, #wifi:one
, #wifi:many
: tags
If the query contained any Bluetooth, cell or WiFi networks, one blue, cell and wifi tag get added. The tags depend on the number of valid stations for each of the three.
API Result Metrics¶
Similar to the API query metrics we also collect metrics about each result of an API query. This follows the same per API type and per region rules under the prefix / tag combination:
<api_type>.result#key:<apikey>,region:<region_code>
The result metrics measure if we satisfied the incoming API query in the best possible fashion. Incoming queries can generally contain an IP address, Bluetooth, cell, WiFi networks or any combination thereof. If the query contained only cell networks, we do not expect to get a high accuracy result, as there is too little data in the query to do so.
We express this by classifying each incoming query into one of four categories:
- High Accuracy (
#accuracy:high
) - A query containing at least two Bluetooth or WiFi networks.
- Medium Accuracy (
#accuracy:medium
) - A query containing neither Bluetooth nor WiFi networks but at least one cell network.
- Low Accuracy (
#accuracy:low
) - A query containing no networks but only the IP address of the client.
- No Accuracy (
#accuracy:none
) - A query containing no usable information, for example an IP-only query that explicitly disables the IP fallback.
A query containing multiple data types gets put into the best possible category, so for example any query containing cell data will at least be of medium accuracy.
One we have determined the expected accuracy category for the query, we compare it to the accuracy category of the result we determined. If we can deliver an equal or better category we consider the status to be a hit. If we don’t satisfy the expected category we consider the result to be a miss.
For each result we then log exactly one of the following tag combinations:
#accuracy:high,status:hit
, #accuracy:high,status:miss
,
#accuracy:medium,status:hit
, #accuracy:medium,status:miss
,
#accuracy:low,status:hit
, #accuracy:low,status:miss
: tags
We don’t log metrics for the uncommon case of none
or no expected
accuracy.
One special case exists for cell networks. If we cannot find an exact
cell match, we might fall back to a cell area based estimate. If the
range of the cell area is fairly small we consider this to be a
#accuracy:medium,status:hit
. But if the size of the cell area is
extremely large, in the order of tens of kilometers to hundreds of
kilometers, we consider it to be a #accuracy:medium,status:miss
.
In the past we only collected stats based on whether or not cell based data was used to answer a cell based query and counted it as a cell-based success, even if the provided accuracy was really bad.
In addition to the accuracy of the result, we also tag the result metric with the data source that got used to provide the result, but only for results that met the expected accuracy.
#source:<source_name>
: tag
Data sources can be one of:
internal
- Data from our own crowd-sourcing effort.
fallback
- Data from the optional external fallback provider.
geoip
- Data from a GeoIP database.
And finally we add a tag to state whether or not the query was allowed to use the fallback source.
#fallback_allowed:<value>
: tag
The value is either true or false.
API Source Metrics¶
In addition to the final API result, we also collect metrics about each
individual data source we use to answer queries under the
<api_type>.source#key:<apikey>,region:<region_code>
metric.
Each request may use one or multiple of these sources to deliver a result. We log the same metrics as mentioned above for the result.
All of this combined might lead to a tagged metric like:
locate.source#key:test,region:de,source:geoip,accuracy:low,status:hit
API Fallback Source Metrics¶
The external fallback source has a couple extra metrics to observe the performance of outbound network calls and the effectiveness of its cache.
locate.fallback.cache#status:hit
,
locate.fallback.cache#status:miss
,
locate.fallback.cache#status:bypassed
,
locate.fallback.cache#status:inconsistent
,
locate.fallback.cache#status:failure
: counter
Counts the number of hits and misses for the fallback cache. If the query should not be cached, a bypassed status is used. If the cached values couldn’t be read, a failure status is used. If the cached values didn’t agree on a consistent position, a inconsistent status is used.
locate.fallback.lookup#fallback_name:<fallback_name>
: timer
Measures the time it takes to do each outbound network request. The fallback name tag specifies which fallback service is used.
locate.fallback.lookup#fallback_name:<fallback_name>,status:<code>
: counter
Counts the HTTP response codes for all outbound requests per named fallback service. There is one counter per HTTP response code, for example 200.
Data Pipeline Metrics¶
When a batch of reports is accepted at one of the submission API endpoints, it is decomposed into a number of “items” – wifi or cell observations – each of which then works its way through a process of normalization, consistency-checking and eventually (possibly) integration into aggregate station estimates held in the main database tables. Along the way several counters measure the steps involved:
data.batch.upload
,
data.batch.upload#key:<apikey>
: counters
Counts the number of “batches” of reports accepted to the data processing pipeline by an API endpoint. A batch generally corresponds to the set of reports uploaded in a single HTTP POST to one of the submit APIs. In other words this metric counts “submissions that make it past coarse-grained checks” such as API-key and JSON schema validity checking.
The metric is either emitted per tracked API key, or for everything else without a key tag.
data.report.upload
,
data.report.upload#key:<apikey>
: counters
Counts the number of reports accepted into the data processing pipeline. The metric is either emitted per tracked API key, or for everything else without a key tag.
data.report.drop
,
data.report.drop#key:<apikey>
: counter
Count incoming reports that were discarded due to some internal consistency, range or validity-condition error.
data.observation.upload#type:blue
,
data.observation.upload#type:blue,key:<apikey>
,
data.observation.upload#type:cell
,
data.observation.upload#type:cell,key:<apikey>
,
data.observation.upload#type:wifi
,
data.observation.upload#type:wifi,key:<apikey>
: counters
Count the number of Bluetooth, cell or WiFi observations entering the data processing pipeline; before normalization and blocklist processing have been applied. In other words this metric counts “total Bluetooth, cell or WiFi observations inside each submitted batch”, as each batch is composed of individual observations.
The metrics are either emitted per tracked API key, or for everything else without a key tag.
data.observation.drop#type:blue
,
data.observation.drop#type:blue,key:<apikey>
,
data.observation.drop#type:cell
,
data.observation.drop#type:cell,key:<apikey>
,
data.observation.drop#type:wifi
data.observation.drop#type:wifi,key:<apikey>
: counters
Count incoming Bluetooth, cell or WiFi observations that were discarded before integration due to some internal consistency, range or validity-condition error encountered while attempting to normalize the observation.
data.observation.insert#type:blue
,
data.observation.insert#type:cell
,
data.observation.insert#type:wifi
: counters
Count Bluetooth, cell or WiFi observations that are successfully normalized, integrated and not discarded due to consistency errors.
data.station.blocklist#type:blue
,
data.station.blocklist#type:cell
,
data.station.blocklist#type:wifi
: counters
Count any Bluetooth, cell or WiFi network that is blocklisted due to the acceptance of multiple observations at sufficiently different locations. In these cases, we decide that the station is “moving” (such as a picocell or mobile hotspot on a public transit vehicle) and blocklist it, to avoid estimating query positions using the station.
data.station.confirm#type:blue
,
data.station.confirm#type:cell
,
data.station.confirm#type:wifi
: counters
Count the number of Bluetooth, cell or WiFi station that were successfully confirmed by any type of observations.
data.station.new#type:blue
,
data.station.new#type:cell
,
data.station.new#type:wifi
: counters
Count the number of Bluetooth, cell or WiFi station that were discovered for the first time.
Data Pipeline Export Metrics¶
Incoming reports can also be sent to a number of different export targets. We keep metrics about how those individual export targets perform.
data.export.batch#key:<export_key>
: counter
Count the number of batches sent to the export target.
data.export.upload#key:<export_key>
: timer
Track how long the upload operation took per export target.
data.export.upload#key:<export_key>,status:<status>
: counter
Track the upload status of the current job. One counter per status. A status can either be a simple success and failure or a HTTP response code like 200, 400, etc.
Internal Monitoring¶
api.limit#key:<apikey>,#path:<path>
: gauge
One gauge is created per API key and API path which has rate limiting enabled on it. This gauge measures how many requests have been done for each such API key and path combination for the current day.
queue#queue:celery_blue
,
queue#queue:celery_cell
,
queue#queue:celery_default
,
queue#queue:celery_export
,
queue#queue:celery_incoming
,
queue#queue:celery_monitor
,
queue#queue:celery_reports
,
queue#queue:celery_wifi
: gauges
These gauges measure the number of tasks in each of the Redis queues. They are sampled at an approximate per-minute interval.
queue#queue:update_blue_0
,
queue#queue:update_blue_f
,
queue#queue:update_cell_gsm
,
queue#queue:update_cell_wcdma
,
queue#queue:update_cell_lte
,
queue#queue:update_cellarea
,
queue#queue:update_datamap_ne
,
queue#queue:update_datamap_nw
,
queue#queue:update_datamap_se
,
queue#queue:update_datamap_sw
,
queue#queue:update_wifi_0
,
queue#queue:update_wifi_f
: gauges
These gauges measure the number of items in the Redis update queues.
HTTP Counters¶
Every legitimate, routed request to an API endpoint or to a content
view increments a request#path:<path>,method:<method>,status:<code>
counter.
The path of the counter is the based on the path of the HTTP request, with slashes replaced with periods. The method tag contains the lowercased HTTP method of the request. The status tag contains the response code produced by the request.
For example, a GET of /stats/regions
that results in an HTTP 200
status code, will increment the counter
request#path:stats.regions,method:get,status:200
.
Response codes in the 400 range (eg. 404) are only generated for HTTP paths referring to API endpoints. Logging them for unknown and invalid paths would overwhelm the system with all the random paths the friendly Internet bot army sends along.
HTTP Timers¶
In addition to the HTTP counters, every legitimate, routed request
emits a request#path:<path>,method:<method>
timer.
These timers have the same structure as the HTTP counters, except they do not have the response code tag.
Task Timers¶
Our data ingress and data maintenance actions are managed by a Celery queue of tasks. These tasks are executed asynchronously, and each task emits a timer indicating its execution time.
For example:
task#task:data.export_reports
task#task:data.update_statcounter
Datamaps Timers¶
We include a script to generate a data map from the gathered map statistics. This script includes a number of timers and pseudo-timers to monitor its operation.
datamaps#func:export
,
datamaps#func:encode
,
datamaps#func:merge
,
datamaps#func:main
,
datamaps#func:render
,
datamaps#func:upload
: timers
These timers track the individual functions of the generation process.
datamaps#count:csv_rows
,
datamaps#count:quadtrees
,
datamaps#count:tile_new
,
datamaps#count:tile_changed
,
datamaps#count:tile_deleted
,
datamaps#count:tile_unchanged
: timers
Pseudo-timers to track the number of CSV rows, Quadtree files and image tiles.