Metrics and Structured Logs

Ichnaea provides two classes of runtime data:

  • Statsd-style Metrics, for real-time monitoring and easy visual analysis of trends

  • Structured Logs, for offline analysis of data and targeted custom reports

Structured logs were added in 2020, and the migration of data from metrics to logs is not complete. For more information, see the Implementation section.

Metrics are emitted by the web / API application, the backend task application, and the datamaps script:

Metric Name

App

Type

Tags

api.limit

task

gauge

key, path

data.batch.upload

web

counter

key

data.export.batch

task

counter

key

data.export.upload

task

counter

key, status

data.export.upload.timing

task

timer

key

data.observation.drop

task

counter

type, key

data.observation.insert

task

counter

type

data.observation.upload

task

counter

type, key

data.report.drop

task

counter

key

data.report.upload

task

counter

key

data.station.blocklist

task

counter

type

data.station.confirm

task

counter

type

data.station.dberror

task

counter

type, errno

data.station.new

task

counter

type

datamaps

datamaps

timer

func, count

datamaps.dberror

datamaps

counter

errno

locate.fallback.cache

web

counter

fallback_name, status

locate.fallback.lookup

web

counter

fallback_name, status

locate.fallback.lookup.timing

web

timer

fallback_name, status

locate.query

web

counter

key, geoip, blue, cell, wifi

locate.request

web

counter

key, path

locate.result

web

counter

key, accuracy, status, source, fallback_allowed

locate.source

web

counter

key, accuracy, status, source

locate.user

task

gauge

key, interval

queue

task

gauge

queue

region.query

web

counter

key, geoip, blue, cell, wifi

region.request

web

counter

key, path

region.result

web

counter

key, accuracy, status, source, fallback_allowed

region.user

task

gauge

key, interval

request

web

counter

path, method, status

request.timing

web

timer

path, method

submit.request

web

counter

key, path

submit.user

task

gauge

key, interval

task

task

timer

task

Web Application Metrics

The website handles HTTP requests, which may be page requests or API calls.

Request Metrics

Each HTTP request, including API calls, emits metrics and a structured log entry.

request

request is a counter for almost all HTTP requests, including API calls. The exceptions are static assets, like CSS, images, Javascript, and fonts, as well as some static content like robots.txt.

Additionally, invalid requests (HTTP status in the 4xx range) do not omit this metric, unless they are API endpoints.

The path tag is the request path, like /stats/regions, but normalized to tag-safe characters. The initial slash is dropped, and remaining slashes are replaced with periods, so /stats/regions becomes stats.regions. The homepage, / is normalized as .homepage, to avoid an empty tag value.

Tags:

  • path: The metrics-normalized HTTP path, like stats.regions, v1.geolocate, and .homepage

  • method: The HTTP method in lowercase, like post, get, head, and options

  • status: The returned HTTP status, like 200 for success and 400 for client errors

Related structured log data:

request.timing

request.timing is a timer for how long the HTTP request took to complete in milliseconds.

The tags (path, method, and status) are the same as request.

Related structured log data:

  • duration_s: The time the request took in seconds, rounded to the millisecond

API Metrics

These metrics are emitted when the API is called.

data.batch.upload

data.batch.upload is a counter that is incremented when a submit API, like /v2/geosubmit, is called with any valid data. A submission batch could contain a single report or multiple reports, but both would increment data.batch.upload by one. A batch with no (valid) reports does not increment this metric.

Tags:

  • key: The API key, often a UUID, or omitted if the API key is not valid.

Related structured log data:

  • api_key: The same value as tag key for valid keys

locate.query

locate.query is a counter, incremented each time the Geolocate API is used with a valid API key that is not rate limited. It is used to segment queries by the station data contained in the request body.

Tags:

  • key: The API key, often a UUID

  • geoip: false if there was no GeoIOP data, and omitted when there is GeoIP data for the client IP (the common case)

  • blue: Count of valid Bluetooth stations in the request, none, one or many

  • cell: Count of valid cell stations in the request, none, one, or many

  • wifi: Count of valid WiFi stations in the request, none, one, or many

Changed in version 2020.04.16: Removed the region tag

Related structured log data:

  • api_key: The same value as tag key

  • has_geoip: Always set, False when geoip is false

  • blue: Count of Bluetooth stations, as a number instead of text like many

  • cell: Count of Cell stations

  • wifi: Count of WiFi stations

locate.request

locate.request is a counter, incremented for each call to the Geolocate API.

Tags:

  • key: The API key, often a UUID, or invalid for a known key that can not call the API, or none for an omitted key.

  • path: v1.geolocate, the standardized API path

Related structured log data:

  • api_key: The same value as tag key, except that instead of invalid, the request key is used, and api_key_allowed=False

  • api_key_allowed: False when the key is not allowed to use the API

  • api_path: The same value as tag path

  • api_type: The value locate

locate.result

locate.result is a counter, incremented for each call to the Geolocate API with a valid API key that is not rate limited.

If there are no Bluetooth, Cell, or WiFi networks provided, and GeoIP data is not available (for example, the IP fallback is explicitly disabled), then this metric is not emitted.

Tags:

  • key: The API key, often a UUID

  • accuracy: The expected accuracy, based on the sources provided:

    • high: At least two Bluetooth or WiFi networks

    • medium: No Bluetooth or WiFi networks, at least one cell network

    • low: No networks, only GeoIP data

  • status: Could we provide a location estimate?

    • hit if we can provide a location with the expected accuracy,

    • miss if we can not provide a location with the expected accuracy. For cell networks (accuracy=medium), a hit includes the case where there is not an exact cell match, but the cell area (the area covered by related cells) is small enough (smaller than tens of kilometers across) for an estimate.

  • source: The source that provided the hit:

    • internal: Our crowd-sourced network data

    • geoip: The MaxMind GeoIP database

    • fallback: An optional external fallback provider

    • Omitted when status=miss

  • fallback_allowed:

    • true if the external fallback provider was allowed

    • Omitted if the external fallback provider was not allowed

Changed in version 2020.04.16: Removed the region tag

Related structured log data:

locate.source

locate.source is a counter, incremented for each processed source in a location query. If station data (Bluetooth, WiFi, and Cell data) is provided, this usually two metrics for one request, one for the internal source and one for the geoip source.

The required accuracy for a hit is set by the kind of station data in the request. For example, a request with no station data requires a low accuracy, while one with multiple WiFi networks requires a high accuracy. The high accuracy is at least 500 meters, and the minimum current MaxMind accuracy is 1000 meters, so the geoip source is expected to have a miss status when accuracy is high.

Tags (similar to locate.result) :

  • key: The API key, often a UUID

  • accuracy: The expected accuracy, based on the sources provided:

    • high: At least two Bluetooth or WiFi networks

    • medium: No Bluetooth or WiFi networks, at least one cell network

    • low: No networks, only GeoIP data

  • status: Could we provide a location estimate?

    • hit: We can provide a location with the expected accuracy,

    • miss: We can not provide a location with the expected accuracy

  • source: The source that was processed:

    • internal: Our crowd-sourced network data

    • geoip: The MaxMind GeoIP database

    • fallback: An optional external fallback provider

  • fallback_allowed:

    • true if the external fallback provider was allowed

    • Omitted if the external fallback provider was not allowed

Changed in version 2020.04.16: Removed the region tag

Related structured log data:

region.query

region.query is a counter, incremented each time the Region API is used with a valid API key. It is used to segment queries by the station data contained in the request body.

It has the same tags (key, geoip, blue, cell, and wifi) as locate.query.

region.request

region.request is a counter, incremented for each call to the Region API.

It has the same tags (key and path) as locate.request, except the path tag is v1.country, the standardized API path.

region.result

region.result is a counter, incremented for each call to the Region API with a valid API key that is not rate limited.

If there are no Bluetooth, Cell, or WiFi networks provided, and GeoIP data is not available (for example, the IP fallback is explicitly disabled), then this metric is not emitted.

It has the same tags (key, accuracy, status, source, and fallback_allowed) as locate.result.

region.source

region.source is a counter, incremented for each processed source in a region query. If station data (Bluetooth, WiFi, and Cell data) is provided, this usually two metrics for one request, one for the internal source and one for the geoip source. In practice, most users provide no station data, and only the geoip source is emitted.

It has the same tags (key, accuracy, status, source, and fallback_allowed) as locate.source.

submit.request

submit.request is a counter, incremented for each call to a Submit API:

This counter can be used to determine when the deprecated APIs can be removed.

It has the same tags (key and path) as locate.request, except the path tag is v2.geosubmit, v1.submit, or v1.geosubmit, the standardized API path.

API Fallback Metrics

These metrics were emitted when the fallback location provider was called. MLS stopped using this feature in 2019, so these metrics are not emitted, but the code remains as of 2020.

These metrics have not been converted to structured logs.

locate.fallback.cache

locate.fallback.cache is a counter for the performance of the fallback cache.

Tags:

  • fallback_name: The name of the external fallback provider, from the API key table

  • status: The status of the fallback cache:

    • hit: The cache had a previous result for the query

    • miss: The cache did not have a previous result for the query

    • bypassed: The cache was not used, due to mixed stations in the query, or the high number of individual stations

    • inconsistent: The cached results were for multiple inconsistent locations

    • failure: The cache was unreachable

locate.fallback.lookup

locate.fallback.lookup is a counter for the HTTP response codes returned from the fallback server.

Tags:

  • fallback_name: The name of the external fallback provider, from the API key table

  • status: The HTTP status code, such as 200

locate.fallback.lookup.timing

locate.fallback.lookup.timing is a timer for the call to the fallback location server.

Tags:

  • fallback_name: The name of the external fallback provider, from the API key table

  • status: The HTTP status code, such as 200

Web Application Structured Logs

There is one structured log emitted for each request, which may be an API request. The structured log data includes data that was emitted as one or more metrics.

Request Metrics

All requests, with the exception of static assets and static views (see request), include this data:

  • duration_s: The time in seconds, rounded to the millisecond, to serve the request.

  • http_method: The HTTP method, like POST or GET.

  • http_path: The request path, like / for the homepage, or /v1/geolocate for the API.

  • http_status: The response status, like 200 or 400.

This data is duplicated in metrics:

API Metrics

If a request is an API call, additional data can be added to the log:

  • accuracy: The accuracy of the result, high, medium, or low.

  • accuracy_min: The minimum required accuracy of the result for a hit, high, medium, or low.

  • api_key: An API key that has an entry in the API key table, often a UUID, or none if omitted. Same as statsd tag key, except that known but disallowed API keys are the key value, rather than invalid.

  • api_key_allowed: False if a known API key is not allowed to call the API, omitted otherwise.

  • api_key_db_fail: True when a database error prevented checking the API key. Omitted when the check is successful.

  • api_path: The normalized API path, like v1.geolocate and v2.geosubmit. Same as statsd tag path when an API is called.

  • api_response_sig: A hash to identify repeated geolocate requests getting the same response without identifying the client.

  • api_type: The API type, locate, submit, or region.

  • blue: The count of Bluetooth radios in the request.

  • blue_valid: The count of valid Bluetooth radios in the request.

  • cell: The count of cell tower radios in the request.

  • cell_valid: The count of valid cell tower radios in the request.

  • fallback_allowed: True if the optional fallback location provider can be used by this API key, False if not.

  • has_geoip: True if there is GeoIP data for the client IP, otherwise False.

  • has_ip: True if the client IP was available, otherwise False.

  • invalid_api_key: The invalid API key not found in API table, omitted if known or empty.

  • rate_allowed: True if allowed, False if not allowed due to rate limit, or omitted if the API is not rate-limited.

  • rate_quota: The daily rate limit, or omitted if API is not rate-limited.

  • rate_remaining: The remaining API calls to hit limit, 0 if none remaining, or omitted if the API is not rate-limited.

  • region: The ISO region code for the IP address, null if none.

  • result_status: hit if an accurate estimate could be made, miss if it could not.

  • source_fallback_accuracy: The accuracy level of the external fallback source, high, medium, or low.

  • source_fallback_accuracy_min: The required accuracy level of the fallback source.

  • source_fallback_status: hit if the fallback source provided an accurate estimate, miss if it did not.

  • source_internal_accuracy: The accuracy level of the internal source (Bluetooth, WiFi, and cell data compared against the database), high, medium, or low.

  • source_internal_accuracy_min: The required accuracy level of the internal source.

  • source_internal_status: hit if the internal check provided an accurate estimate, miss if it did not.

  • source_geoip_accuracy: The accuracy level of the GeoIP source, high, medium, or low.

  • source_geoip_accuracy_min: The required accuracy level of the GeoIP source.

  • source_geoip_status: hit if the GeoIP database provided an accurate estimate, miss if it did not.

  • wifi: The count of WiFi radios in the request.

  • wifi_valid: The count of valid WiFi radios in the request.

Some of this data is duplicated in metrics:

Task Application Metrics

The task application, running on celery in the backend, implements the data pipeline and other periodic tasks. These emit metrics, but have not been converted to structured logging.

API Monitoring Metrics

These metrics are emitted periodically to monitor API usage. A Redis key is incremented or updated during API requests, and the current value is reported via these metrics:

api.limit

api.limit is a gauge of the API requests, segmented by API key and API path, for keys with daily limits. It is updated every 10 minutes.

Tags:

  • key: The API key, often a UUID

  • path: The normalized API path, such as v1.geolocate or v2.geosubmit

Related structured log data is added during the request when an API key has rate limits:

  • rate_allowed: True if the request was allowed, False if not allowed due to the rate limit

  • rate_quota: The daily rate limit

  • rate_remaining: The remaining API calls to hit limit, 0 if none remaining

locate.user

locate.user is a gauge of the estimated number of daily and weekly users of the Geolocate API by API key. It is updated every 10 minutes.

The estimate is based on the client’s IP address. At request time, the IP is added via PFADD to a HyperLogLog structure. This structure can be used to estimate the cardinality (number of unique IP addresses) to within about 1%. See PFCOUNT for details on the HyperLogLog implementation.

Tags:

  • key: The API key, often a UUID

  • interval: 1d for the daily estimate, 7d for the weekly estimate.

region.user

region.user is a gauge of the estimated number of daily and weekly users of the Region API by API key. It is updated every 10 minutes.

It has the same tags (key and interval) as locate.user.

submit.user

submit.user is a gauge of the estimated number of daily and weekly users of the submit APIs (/v2/geosubmit and the deprecated submit APIs) by API key. It is updated every 10 minutes.

It has the same tags (key and interval) as locate.user.

Data Pipeline Metrics - Gather and Export

The data pipeline processes data from two sources:

  • Submission reports, from the submission APIs, which include a position from an external source like GPS, along with the Wifi, Cell, and Bluetooth stations that were seen.

  • Location queries, from the geolocate and region APIs, which include an estimated position, along with the stations.

Multiple reports can be submitted in one call to the submission APIs. Each batch of reports increment the data.batch.upload metric when the API is called. A single report is created for each location query, and there is no corresponding metric.

The APIs feed these reports into a Redis queue update_incoming, processed by the backend task of the same name. This task copies reports to “export” queues. Four types are supported:

  • dummy: Does nothing, for pipeline testing

  • geosubmit: POST reports to a service supporting the Geosubmit API.

  • internal: Divide reports into observations, for further processing to update the internal database.

  • s3: Store report JSON in S3.

Ichnaea supports multiple export targets for a type. In production, there are three export targets, identified by an export key:

  • backup: An s3 export, to a Mozilla-private S3 bucket

  • tostage: A geosubmit export, to send a sample of reports to stage for integration testing.

  • internal: An internal export, to update the database

The data pipeline has not been converted to structured logging. As data moves through this part of the data pipeline, these metrics are emitted:

data.export.batch

data.export.batch is a counter of the report batches exported to external and internal targets.

Tags:

  • key: The export key, from the export table. Keys used in Mozilla production:

    • backup: Reports archived in S3

    • tostage: Reports sent from production to stage, as a form of integration testing

    • internal: Reports queued for processing to update the internal station database

data.export.upload

data.export.upload is a counter that tracks the status of export jobs.

Tags:

  • key: The export key, from the export table. Keys used in Mozilla production are backup and tostage, with the same meaning as data.export.batch. Unlike that metric, internal is not used.

  • status: The status of the export, which varies by type of export:

    • backup: success or failure storing the report to S3

    • tostage: HTTP code returned by the submission API, usually 200 for success or 400 for failure.

data.export.upload.timing

data.export.upload.timing is a timer for the report batch export process.

Tags:

  • key: The export key, from the export table. See data.export.batch for the values used in Mozilla production.

data.observation.drop

data.observation.drop is a counter of the Bluetooth, cell, or WiFi observations that were discarded before integration due to some internal consistency, range or validity-condition error encountered while attempting to normalize the observation.

Tags:

  • key: The API key, often a UUID. Omitted if unknown or not available

  • type: The station type, one of blue, cell, or wifi

data.observation.upload

data.observation.upload is a counter of the number of Bluetooth, cell or WiFi observations entering the data processing pipeline, before normalization and blocked station processing. This count is taken after a batch of reports are decomposed into observations.

The tags (key and type) are the same as data.observation.drop.

data.report.drop

data.report.drop is a counter of the reports discarded due to some internal consistency, range, or validity-condition error.

Tags:

  • key: The API key, often a UUID. Omitted if unknown or not available

data.report.upload

data.report.upload is a counter of the reports accepted into the data processing pipeline.

It has the same tag (key) as data.report.drop.

Data Pipeline Metrics - Update Internal Database

The internal export process decomposes reports into observations, pairing one position with one station. Each observation works its way through a process of normalization, consistency-checking, and (possibly) integration into the database, to improve future location estimates.

The data pipeline has not been converted to structured logging. As data moves through the pipeline, these metrics are emitted:

data.observation.insert

data.observation.insert is a counter of the Bluetooth, cell, or WiFi observations that were successfully validated, normalized, integrated.

Tags:

  • type: The station type, one of blue, cell, or wifi

data.station.blocklist

data.station.blocklist is a counter of the Bluetooth, cell, or WiFi stations that are blocked from being used to estimate positions. These are added because there are multiple valid observations at sufficiently different locations, supporting the theory that it is a mobile station (such as a picocell or a mobile hotspot on public transit), or was recently moved (such as a WiFi base station that moved with the owner to a new home).

Tags:

  • type: The station type, one of blue, cell, or wifi

data.station.confirm

data.station.confirm is a counter of the Bluetooth, cell or WiFi stations that were confirmed to still be active. An observation from a location query can be used to confirm a station with a position based on submission reports.

It has the same tag (type) as data.station.blocklist

data.station.dberror

data.station.dberror is a counter of retryable database errors, which are encountered as multiple task threads attempt to update the internal database.

Retryable database errors, like a lock timeout (1205) or deadlock (1213) cause the station updating task to sleep and start over. Other database errors are not counted, but instead halt the task and are recorded in Sentry.

Tags:

data.station.new

data.station.new is a counter of the Bluetooth, cell or WiFi stations that were discovered for the first time.

Tags:

  • type: The station type, one of blue, cell, or wifi

Backend Monitoring Metrics

queue

queue is a gauge that reports the current size of task and data queues. Queues are implemented as Redis lists, with a length returned by LLEN.

Task queues hold the backlog of celery async tasks. The names of the task queues are:

  • celery_blue, celery_cell, celery_wifi - A task to process a chunk of observation data

  • celery_default - A generic task queue

  • celery_export - Tasks exporting data, either public cell data or the Data Pipeline

  • celery_incoming - Unused

  • celery_monitor - Tasks updating metrics gauges for this metric and API Monitoring Metrics

  • celery_reports - Tasks handling batches of submission reports or location queries

Data queues are the backlog of observations and other data items to be processed. Data queues have names that mirror the shared database tables:

  • update_blue_0 through update_blue_f (16 total) - Observations of Bluetooth stations

  • update_cell_gsm, update_cell_lte, and update_cell_wcdma - Observations of cell stations

  • update_cell_area - Aggregated observations of cell towers

  • update_datamap_ne, update_datamap_nw, update_datamap_se, and update_datamap_sw - Approximate locations for the contribution map

  • update_wifi_0 through update_wifi_f (16 total) - Observations of WiFi stations

Tags:

  • queue: The name of the task or data queue

task

task is a timer that measures how long each Celery task takes. Celery tasks are used to implement the data pipeline and monitoring tasks.

Tags:

  • task: The task name, such as data.export_reports or data.update_statcounter

Datamaps Metrics

The datamap script generates a data map from the gathered map statistics. It has not been updated to work with current production infrastructure, so these metrics were emitted from the previous infrastructure.

datamaps

datamaps is a timer for functions in the datamap process. It also counts items, but as a timer.

Note

The item counts should be moved to a new counter metric

Tags:

  • func: The export function being timed, such as export, encode, merge, main, render, or upload

  • count: The item counts, recorded as a timer, such as csv_rows, quadtrees, tile_new, tile_changed, tile_deleted, tile_unchanged

datamaps.dberror

datamaps.dberror counts the number of retryable database errors.

Tags:

Implementation

Ichnaea emits statsd-compatible metrics using markus, if the STATSD_HOST is configured (see the config section). Metrics use the the tags extension, which add queryable dimensions to the metrics. In development, the metrics are displayed with the logs. In production, the metrics are stored in an InfluxDB database, and can be displayed as graphs with Grafana.

Ichnaea also emits structured logs using structlog. In development, these are displayed in a human-friendly format. In production, they use the MozLog JSON format, and the data is stored in BigQuery.

In the past, metrics were the main source of runtime data, and tags were used to segment the metrics and provide insights. However, metric tags and their values were limited to avoid performance issues. InfluxDB and other time-series databases store metrics by the indexed series of tag values. This performs well when tags have a small number of unique values, and the combinations of tags are limited. When tags have many unique values and are combined, the number of possible series can explode and cause storage and performance issues (the “high cardinality” problem).

Metric tag values are limited to avoid high cardinality issues. For example, rather than storing the number of WiFi stations, the wifi tag of the locate.query metric has the values none, one, and many. The region, such as US or DE, was once stored as a tag, but this can have almost 250 values, causing MLS to have the highest processing load across Mozilla projects.

BigQuery easily handles high-cardinality data, so structured logs can contain precise values, such as the actual number of WiFi stations provided, and more items, such as the region and unexpected keys. On the other hand, there isn’t a friendly tool like Grafana to quickly explore the data.

As of 2020, we are in the process of duplicating data from metrics into structured logging, expanding the data collected, and creating dashboards. We’ll also remove data from metrics, first to reduce the current issues around high-cardinality, then to focus metrics on operational data. Structured data will be used for service analysis and monitoring of long-term trends, and dashboards created for reference.