Metrics and Structured Logs¶
Ichnaea provides two classes of runtime data:
Statsd-style Metrics, for real-time monitoring and easy visual analysis of trends
Structured Logs, for offline analysis of data and targeted custom reports
Structured logs were added in 2020, and the migration of data from metrics to logs is not complete. For more information, see the Implementation section.
Metrics are emitted by the web / API application, the backend task application, and the datamaps script:
Metric Name |
App |
Type |
Tags |
|---|---|---|---|
task |
gauge |
key, path |
|
web |
counter |
key |
|
task |
counter |
key |
|
task |
counter |
key, status |
|
task |
timer |
key |
|
task |
counter |
type, key |
|
task |
counter |
type |
|
task |
counter |
type, key |
|
task |
counter |
key |
|
task |
counter |
key |
|
task |
counter |
type |
|
task |
counter |
type |
|
task |
counter |
type, errno |
|
task |
counter |
type |
|
task |
counter |
errno |
|
web |
counter |
fallback_name, status |
|
web |
counter |
fallback_name, status |
|
web |
timer |
fallback_name, status |
|
web |
counter |
key, geoip, blue, cell, wifi |
|
web |
counter |
key, path |
|
web |
counter |
key, accuracy, status, source, fallback_allowed |
|
web |
counter |
key, accuracy, status, source |
|
task |
gauge |
key, interval |
|
task |
gauge |
data_type, queue, queue_type |
|
task |
gauge |
||
task |
gauge |
||
task |
gauge |
||
task |
gauge |
||
task |
gauge |
||
task |
gauge |
||
task |
gauge |
||
web |
counter |
key, geoip, blue, cell, wifi |
|
web |
counter |
key, path |
|
web |
counter |
key, accuracy, status, source, fallback_allowed |
|
task |
gauge |
key, interval |
|
web |
counter |
path, method, status |
|
web |
timer |
path, method |
|
web |
counter |
key, path |
|
task |
gauge |
key, interval |
|
task |
timer |
task |
|
task |
gauge |
||
task |
gauge |
||
task |
gauge |
||
task |
gauge |
Web Application Metrics¶
The website handles HTTP requests, which may be page requests or API calls.
Request Metrics¶
Each HTTP request, including API calls, emits metrics and a structured log entry.
request¶
request is a counter for almost all HTTP requests, including API calls. The
exceptions are static assets, like CSS, images, Javascript, and fonts, as well as
some static content like robots.txt.
Additionally, invalid requests (HTTP status in the 4xx range) do not omit this
metric, unless they are API endpoints.
The path tag is the request path, like /stats/regions, but normalized
to tag-safe characters. The initial slash is dropped, and remaining slashes
are replaced with periods, so /stats/regions becomes stats.regions.
The homepage, / is normalized as .homepage, to avoid an empty tag
value.
Tags:
path: The metrics-normalized HTTP path, likestats.regions,v1.geolocate, and.homepagemethod: The HTTP method in lowercase, likepost,get,head, andoptionsstatus: The returned HTTP status, like200for success and400for client errors
Related structured log data:
http_method: The non-normalized HTTP method
http_path: The non-normalized request path
http_status: The HTTP status code
request.timing¶
request.timing is a timer for how long the HTTP request took to complete in
milliseconds.
Tags:
The tags path and method are the same as request. The tag status
is omitted.
Related structured log data:
duration_s: The time the request took in seconds, rounded to the millisecond
API Metrics¶
These metrics are emitted when the API is called.
data.batch.upload¶
data.batch.upload is a counter that is incremented when a submit API, like
/v2/geosubmit, is called with any valid data. A
submission batch could contain a single report or multiple reports, but both
would increment data.batch.upload by one. A batch with no (valid) reports
does not increment this metric.
Tags:
key: The API key, often a UUID, or omitted if the API key is not valid.
Related structured log data:
api_key: The same value as tag
keyfor valid keys
locate.query¶
locate.query is a counter, incremented each time the
Geolocate API is used with a valid API key that
is not rate limited. It is used to segment queries by the station data
contained in the request body.
Tags:
key: The API key, often a UUIDgeoip:falseif there was no GeoIOP data, and omitted when there is GeoIP data for the client IP (the common case)blue: Count of valid Bluetooth stations in the request,none,oneormanycell: Count of valid cell stations in the request,none,one, ormanywifi: Count of valid WiFi stations in the request,none,one, ormany
Changed in version 2020.04.16: Removed the region tag
Related structured log data:
locate.request¶
locate.request is a counter, incremented for each call to the
Geolocate API.
Tags:
key: The API key, often a UUID, orinvalidfor a known key that can not call the API, ornonefor an omitted key.path:v1.geolocate, the standardized API path
Related structured log data:
api_key: The same value as tag
key, except that instead ofinvalid, the request key is used, andapi_key_allowed=Falseapi_key_allowed:
Falsewhen the key is not allowed to use the APIapi_path: The same value as tag
pathapi_type: The value
locate
locate.result¶
locate.result is a counter, incremented for each call to the
Geolocate API with a valid API key that is not
rate limited.
If there are no Bluetooth, Cell, or WiFi networks provided, and GeoIP data is not available (for example, the IP fallback is explicitly disabled), then this metric is not emitted.
Tags:
key: The API key, often a UUIDaccuracy: The expected accuracy, based on the sources provided:high: At least two Bluetooth or WiFi networksmedium: No Bluetooth or WiFi networks, at least one cell networklow: No networks, only GeoIP data
status: Could we provide a location estimate?hitif we can provide a location with the expected accuracy,missif we can not provide a location with the expected accuracy. For cell networks (accuracy=medium), ahitincludes the case where there is not an exact cell match, but the cell area (the area covered by related cells) is small enough (smaller than tens of kilometers across) for an estimate.
source: The source that provided the hit:internal: Our crowd-sourced network datageoip: The MaxMind GeoIP databasefallback: An optional external fallback providerOmitted when
status=miss
fallback_allowed:trueif the external fallback provider was allowedOmitted if the external fallback provider was not allowed
Changed in version 2020.04.16: Removed the region tag
Related structured log data:
accuracy: The accuracy level of the result,
high,medium, orlowaccuracy_min: The same value as tag
accuracyapi_key: The same value as tag
keyresult_status: The same value as tag
status
locate.source¶
locate.source is a counter, incremented for each processed source in
a location query. If station data (Bluetooth, WiFi, and Cell data)
is provided, this usually two metrics for one request, one for the
internal source and one for the geoip source.
The required accuracy for a hit is set by the kind of station data in the
request. For example, a request with no station data requires a low
accuracy, while one with multiple WiFi networks requires a high accuracy.
The high accuracy is at least 500 meters, and the minimum current MaxMind
accuracy is 1000 meters, so the geoip source is expected to have a miss
status when accuracy is high.
Tags (similar to locate.result) :
key: The API key, often a UUIDaccuracy: The expected accuracy, based on the sources provided:high: At least two Bluetooth or WiFi networksmedium: No Bluetooth or WiFi networks, at least one cell networklow: No networks, only GeoIP data
status: Could we provide a location estimate?hit: We can provide a location with the expected accuracy,miss: We can not provide a location with the expected accuracy
source: The source that was processed:internal: Our crowd-sourced network datageoip: The MaxMind GeoIP databasefallback: An optional external fallback provider
fallback_allowed:trueif the external fallback provider was allowedOmitted if the external fallback provider was not allowed
Changed in version 2020.04.16: Removed the region tag
Related structured log data:
api_key: The same value as tag
keysource_internal_accuracy: The accuracy level of the internal source
source_internal_accuracy_min: The required accuracy level of the internal source, same value as tag
accuracywhensource=internalsource_internal_status: The same value as tag
statuswhensource=internalsource_geoip_accuracy: The accuracy level of the GeoIP source
source_geoip_accuracy_min: The required accuracy level of the GeoIP source, same value as tag
accuracywhensource=geoipsource_geoip_status: The same value as tag
statuswhensource=geoipsource_fallback_accuracy: The accuracy level of the external fallback source
source_fallback_accuracy_min: The required accuracy level of the fallback source, same value as tag
accuracywhensource=fallbacksource_fallback_status: The same value as tag
statuswhensource=fallback
region.query¶
region.query is a counter, incremented each time the
Region API is used with a valid API key. It is used
to segment queries by the station data contained in the request body.
It has the same tags (key, geoip, blue, cell, and wifi) as
locate.query.
region.request¶
region.request is a counter, incremented for each call to the
Region API.
It has the same tags (key and path) as locate.request, except the
path tag is v1.country, the standardized API path.
region.result¶
region.result is a counter, incremented for each call to the
Region API with a valid API key that is not
rate limited.
If there are no Bluetooth, Cell, or WiFi networks provided, and GeoIP data is not available (for example, the IP fallback is explicitly disabled), then this metric is not emitted.
It has the same tags (key, accuracy, status, source, and
fallback_allowed) as locate.result.
region.source¶
region.source is a counter, incremented for each processed source in
a region query. If station data (Bluetooth, WiFi, and Cell data)
is provided, this usually two metrics for one request, one for the
internal source and one for the geoip source. In practice, most
users provide no station data, and only the geoip source is emitted.
It has the same tags (key, accuracy, status, source, and
fallback_allowed) as locate.source.
submit.request¶
submit.request is a counter, incremented for each call to a Submit API:
This counter can be used to determine when the deprecated APIs can be removed.
It has the same tags (key and path) as locate.request, except the
path tag is v2.geosubmit, v1.submit, or v1.geosubmit, the
standardized API path.
API Fallback Metrics¶
These metrics were emitted when the fallback location provider was called. MLS stopped using this feature in 2019, so these metrics are not emitted, but the code remains as of 2020.
These metrics have not been converted to structured logs.
locate.fallback.cache¶
locate.fallback.cache is a counter for the performance of the fallback cache.
Tags:
fallback_name: The name of the external fallback provider, from the API key tablestatus: The status of the fallback cache:hit: The cache had a previous result for the querymiss: The cache did not have a previous result for the querybypassed: The cache was not used, due to mixed stations in the query, or the high number of individual stationsinconsistent: The cached results were for multiple inconsistent locationsfailure: The cache was unreachable
locate.fallback.lookup¶
locate.fallback.lookup is a counter for the HTTP response codes returned
from the fallback server.
Tags:
fallback_name: The name of the external fallback provider, from the API key tablestatus: The HTTP status code, such as200
locate.fallback.lookup.timing¶
locate.fallback.lookup.timing is a timer for the call to the fallback
location server.
Tags:
fallback_name: The name of the external fallback provider, from the API key tablestatus: The HTTP status code, such as200
Web Application Structured Logs¶
There is one structured log emitted for each request, which may be an API request. The structured log data includes data that was emitted as one or more metrics.
Request Metrics¶
All requests, with the exception of static assets and static views (see request), include this data:
duration_s: The time in seconds, rounded to the millisecond, to serve the request.http_method: The HTTP method, likePOSTorGET.http_path: The request path, like/for the homepage, or/v1/geolocatefor the API.http_status: The response status, like200or400.
This data is duplicated in metrics:
API Metrics¶
If a request is an API call, additional data can be added to the log:
accuracy: The accuracy of the result,high,medium, orlow.accuracy_min: The minimum required accuracy of the result for a hit,high,medium, orlow.api_key: An API key that has an entry in the API key table, often a UUID, ornoneif omitted. Same as statsd tagkey, except that known but disallowed API keys are the key value, rather thaninvalid.api_key_allowed:Falseif a known API key is not allowed to call the API, omitted otherwise.api_key_db_fail:Truewhen a database error prevented checking the API key. Omitted when the check is successful.api_path: The normalized API path, likev1.geolocateandv2.geosubmit. Same as statsd tagpathwhen an API is called.api_response_sig: A hash to identify repeated geolocate requests getting the same response without identifying the client.api_type: The API type,locate,submit, orregion.blue: The count of Bluetooth radios in the request.blue_valid: The count of valid Bluetooth radios in the request.cell: The count of cell tower radios in the request.cell_valid: The count of valid cell tower radios in the request.fallback_allowed:Trueif the optional fallback location provider can be used by this API key,Falseif not.has_geoip:Trueif there is GeoIP data for the client IP, otherwiseFalse.has_ip:Trueif the client IP was available, otherwiseFalse.invalid_api_key: The invalid API key not found in API table, omitted if known or empty.rate_allowed:Trueif allowed,Falseif not allowed due to rate limit, or omitted if the API is not rate-limited.rate_quota: The daily rate limit, or omitted if API is not rate-limited.rate_remaining: The remaining API calls to hit limit, 0 if none remaining, or omitted if the API is not rate-limited.region: The ISO region code for the IP address,nullif none.result_status:hitif an accurate estimate could be made,missif it could not.source_fallback_accuracy: The accuracy level of the external fallback source,high,medium, orlow.source_fallback_accuracy_min: The required accuracy level of the fallback source.source_fallback_status:hitif the fallback source provided an accurate estimate,missif it did not.source_internal_accuracy: The accuracy level of the internal source (Bluetooth, WiFi, and cell data compared against the database),high,medium, orlow.source_internal_accuracy_min: The required accuracy level of the internal source.source_internal_status:hitif the internal check provided an accurate estimate,missif it did not.source_geoip_accuracy: The accuracy level of the GeoIP source,high,medium, orlow.source_geoip_accuracy_min: The required accuracy level of the GeoIP source.source_geoip_status:hitif the GeoIP database provided an accurate estimate,missif it did not.wifi: The count of WiFi radios in the request.wifi_valid: The count of valid WiFi radios in the request.
Some of this data is duplicated in metrics:
Task Application Metrics¶
The task application, running on celery in the backend, implements the data pipeline and other periodic tasks. These emit metrics, but have not been converted to structured logging.
API Monitoring Metrics¶
These metrics are emitted periodically to monitor API usage. A Redis key is incremented or updated during API requests, and the current value is reported via these metrics:
api.limit¶
api.limit is a gauge of the API requests, segmented by API key and API
path, for keys with daily limits. It is updated every 10 minutes.
Tags:
key: The API key, often a UUIDpath: The normalized API path, such asv1.geolocateorv2.geosubmit
Related structured log data is added during the request when an API key has rate limits:
rate_allowed:
Trueif the request was allowed,Falseif not allowed due to the rate limitrate_quota: The daily rate limit
rate_remaining: The remaining API calls to hit limit, 0 if none remaining
locate.user¶
locate.user is a gauge of the estimated number of daily and weekly users of
the Geolocate API by API key. It is updated
every 10 minutes.
The estimate is based on the client’s IP address. At request time, the IP is added via PFADD to a HyperLogLog structure. This structure can be used to estimate the cardinality (number of unique IP addresses) to within about 1%. See PFCOUNT for details on the HyperLogLog implementation.
Tags:
key: The API key, often a UUIDinterval:1dfor the daily estimate,7dfor the weekly estimate.
region.user¶
region.user is a gauge of the estimated number of daily and weekly users of
the Region API by API key. It is updated every 10
minutes.
It has the same tags (key and interval) as locate.user.
submit.user¶
submit.user is a gauge of the estimated number of daily and weekly users of
the submit APIs (/v2/geosubmit and the
deprecated submit APIs) by API key. It is updated every 10 minutes.
It has the same tags (key and interval) as locate.user.
Data Pipeline Metrics - Gather and Export¶
The data pipeline processes data from two sources:
Submission reports, from the submission APIs, which include a position from an external source like GPS, along with the Wifi, Cell, and Bluetooth stations that were seen.
Location queries, from the geolocate and region APIs, which include an estimated position, along with the stations.
Multiple reports can be submitted in one call to the submission APIs. Each batch of reports increment the data.batch.upload metric when the API is called. A single report is created for each location query, and there is no corresponding metric.
The APIs feed these reports into a Redis queue update_incoming,
processed by the backend task of the same name. This task copies reports to
“export” queues. Four types are supported:
dummy: Does nothing, for pipeline testinggeosubmit: POST reports to a service supporting the Geosubmit API.internal: Divide reports into observations, for further processing to update the internal database.s3: Store report JSON in S3.
Ichnaea supports multiple export targets for a type. In production, there are three export targets, identified by an export key:
backup: Ans3export, to a Mozilla-private S3 buckettostage: Ageosubmitexport, to send a sample of reports to stage for integration testing.internal: Aninternalexport, to update the database
The data pipeline has not been converted to structured logging. As data moves through this part of the data pipeline, these metrics are emitted:
data.export.batch¶
data.export.batch is a counter of the report batches exported to external
and internal targets.
Tags:
key: The export key, from the export table. Keys used in Mozilla production:backup: Reports archived in S3tostage: Reports sent from production to stage, as a form of integration testinginternal: Reports queued for processing to update the internal station database
data.export.upload¶
data.export.upload is a counter that tracks the status of export jobs.
Tags:
key: The export key, from the export table. Keys used in Mozilla production arebackupandtostage, with the same meaning as data.export.batch. Unlike that metric,internalis not used.status: The status of the export, which varies by type of export:backup:successorfailurestoring the report to S3tostage: HTTP code returned by the submission API, usually200for success or400for failure.
data.export.upload.timing¶
data.export.upload.timing is a timer for the report batch export process.
Tags:
key: The export key, from the export table. See data.export.batch for the values used in Mozilla production.
data.observation.drop¶
data.observation.drop is a counter of the Bluetooth, cell, or WiFi
observations that were discarded before integration due to some
internal consistency, range or validity-condition error encountered while
attempting to normalize the observation.
Tags:
key: The API key, often a UUID. Omitted if unknown or not availabletype: The station type, one ofblue,cell, orwifi
data.observation.upload¶
data.observation.upload is a counter of the number of Bluetooth, cell or
WiFi observations entering the data processing pipeline, before
normalization and blocked station processing. This count is taken after a batch
of reports are decomposed into observations.
The tags (key and type) are the same as data.observation.drop.
data.report.drop¶
data.report.drop is a counter of the reports discarded due to
some internal consistency, range, or validity-condition error.
Tags:
key: The API key, often a UUID. Omitted if unknown or not available
data.report.upload¶
data.report.upload is a counter of the reports accepted into the data
processing pipeline.
It has the same tag (key) as data.report.drop.
Data Pipeline Metrics - Update Internal Database¶
The internal export process decomposes reports into observations, pairing one position with one station. Each observation works its way through a process of normalization, consistency-checking, and (possibly) integration into the database, to improve future location estimates.
The data pipeline has not been converted to structured logging. As data moves through the pipeline, these metrics are emitted:
data.observation.insert¶
data.observation.insert is a counter of the Bluetooth, cell, or WiFi
observations that were successfully validated, normalized, integrated.
Tags:
type: The station type, one ofblue,cell, orwifi
data.station.blocklist¶
data.station.blocklist is a counter of the Bluetooth, cell, or WiFi
stations that are blocked from being used to estimate positions.
These are added because there are multiple valid observations at
sufficiently different locations, supporting the theory that it is a
mobile station (such as a picocell or a mobile hotspot on public transit),
or was recently moved (such as a WiFi base station that moved with the
owner to a new home).
Tags:
type: The station type, one ofblue,cell, orwifi
data.station.confirm¶
data.station.confirm is a counter of the Bluetooth, cell or WiFi
stations that were confirmed to still be active. An observation
from a location query can be used to confirm a station with a position based
on submission reports.
It has the same tag (type) as data.station.blocklist
data.station.dberror¶
data.station.dberror is a counter of retryable database errors, which are
encountered as multiple task threads attempt to update the internal database.
Retryable database errors, like a lock timeout (1205) or deadlock
(1213) cause the station updating task to sleep and start over. Other
database errors are not counted, but instead halt the task and are recorded in
Sentry.
Tags:
errno: The error number, which can be found in the MySQL Server Error Referencetype: The station, one ofblue,cell, orwifi, or the aggregate station typecellarea
data.station.new¶
data.station.new is a counter of the Bluetooth, cell or WiFi
stations that were discovered for the first time.
Tags:
type: The station type, one ofblue,cell, orwifi
datamaps.dberror¶
datamaps.dberror is a counter of the number of retryable database errors
when updating the datamaps tables.
Tags:
errno: The error number, same as data.station.dberror
Backend Monitoring Metrics¶
queue¶
queue is a gauge that reports the current size of task and data queues.
Queues are implemented as Redis lists, with a length returned by LLEN.
Task queues hold the backlog of celery async tasks. The names of the task queues are:
celery_blue,celery_cell,celery_wifi- A task to process a chunk of observation datacelery_content- Tasks that update website content, like the datamaps and statisticscelery_default- A generic task queuecelery_export- Tasks exporting data, either public cell data or the Data Pipelinecelery_monitor- Tasks updating metrics gauges for this metric and API Monitoring Metricscelery_reports- Tasks handling batches of submission reports or location queries
Data queues are the backlog of observations and other data items to be processed. Data queues have names that mirror the shared database tables:
update_blue_0throughupdate_blue_f(16 total) - Observations of Bluetooth stationsupdate_cell_gsm,update_cell_lte, andupdate_cell_wcdma- Observations of cell stationsupdate_cell_area- Aggregated observations of cell towersdata_type: cellareaupdate_datamap_ne,update_datamap_nw,update_datamap_se, andupdate_datamap_sw- Approximate locations for the contribution mapupdate_incoming- Incoming reports from geolocate and submission APIsupdate_wifi_0throughupdate_wifi_f(16 total) - Observations of WiFi stations
Tags:
queue: The name of the task or data queuequeue_type:taskordatadata_type: For data queues,bluetooth,cell,cellarea,datamap,report(queueupdate_incoming), orwifi. Omitted for task queues.
task¶
task is a timer that measures how long each Celery task takes. Celery tasks
are used to implement the data pipeline and monitoring tasks.
Tags:
task: The task name, such asdata.export_reportsordata.update_statcounter
Rate Control Metrics¶
The optional rate controller can be used to dynamically set the global locate sample rate and prevent the data queues from growing without bounds. There are several metrics emitted to monitor the rate controller.
rate_control.locate¶
rate_control.locate is a gauge that reports the current setting of the
global locate sample rate, which may be unset
(100.0), manually set, set by the rate controller, or set to 0 by the
transaction history monitor.
rate_control.locate.target¶
rate_control.locate.target is a gauge that reports the current target queue
size of the rate controller. It is emitted when the rate controller is enabled.
rate_control.locate.kp¶
rate_control.locate.kp is a gauge that reports the current value of
Kp, the proportional gain. It is emitted when the rate controller is enabled.
rate_control.locate.ki¶
rate_control.locate.ki is a gauge that reports the current value of
Ki, the integral gain. It is emitted when the rate controller is enabled.
rate_control.locate.kd¶
rate_control.locate.kd is a gauge that reports the current value of Kd, the derivative gain. It is emitted when the rate controller is
enabled.
rate_control.locate.pterm¶
rate_control.locate.pterm is a gauge that reports the current value of of
the proportional term of the rate controller. It is emitted when the rate
controller is enabled.
rate_control.locate.iterm¶
rate_control.locate.pterm is a gauge that reports the current value of of
the integral term of the rate controller. It is emitted when the rate
controller is enabled.
rate_control.locate.dterm¶
rate_control.locate.dterm is a gauge that reports the current value of of
the derivative term of the rate controller. It is emitted when the rate
controller is enabled.
Transaction History Metrics¶
Processing the data queues can cause the MySQL InnoDB transaction history to grow faster than it can be purged. The transaction history length can be monitored, and when it exceeds a maximum, it can turn off observation processing until it is reduced. See transaction history monitoring for details.
Monitoring the transaction history length requires that the celery worker
database connection (“read-write”) has the PROCESS privilege. If the
connection does not have this privilege, then no related metrics are emitted.
If the connection has this privilege, then one or more metrics are emitted to
monitor this process:
trx_history.length¶
trx_history.length is a gauge that reports the current length of the
InnoDB transaction history.
trx_history.purging¶
If the rate controller is enabled, then trx_history.purging is a gauge that
becomes 1 when the transaction history exceeds the maximum level.
Observation processing is paused by setting the
global locate sample rate to 0%, which allow the
MySQL purge process to reduce the transaction history. When it drops below a
safe minimum level, the rate is allowed to rise again.
If the rate controller is not enabled, then purging mode is not used, and this metric is not emitted.
trx_history.max¶
trx_history.max is a gauge that report the current maximum value for the
transaction history length before the system switches to purging mode.
If the rate controller is not enabled, then purging mode is not used, and this metric is not emitted.
trx_history.min¶
trx_history.min is a gauge that report the current minimum value for the
transaction history length before the system switches out of purging mode.
If the rate controller is not enabled, then purging mode is not used, and this metric is not emitted.
Datamaps Structured Log¶
The datamap script generates a data map from the gathered observations. It does not emit metrics.
The final canonical-log-line log entry has this data:
bucket_name: The name of the S3 bucketconcurrency: The number of concurrent threads usedcreate: True if--createwas set to generate tilescsv_converted_count: How many CSV files were converted to quadtreescsv_count: How many CSV files were exported from the databaseduration_s: How long in seconds to run the scriptexport_duration_s: How long in seconds to export from tables to CSVintermediate_quadtree_count: How many partial quadtrees were created (due to multiple CSVs exported from large tables) and merged into one per-table quadtreemerge_duration_s: How long in seconds to merge the per-table quadtreesquadtree_count: How many per-table quadtrees were generatedquadtree_duration_s: How long in seconds to convert CSV to quadtreesrender_duration_s: How long in seconds to render the merged quadtree to tilesrow_count: The number of rows across datamap tablesscript_name: The name of the script (ichnaea.scripts.datamap)success: True if the script completed without errorssync_duration_s: How long in seconds it took to upload tiles to S3tile_changed: How many existing S3 tiles were updatedtile_count: The total number of tiles generatedtile_deleted: How many existing S3 tiles were deletedtile_failed: How many upload or deletion failurestile_new: How many new tiles were uploaded to S3tile_unchanged: How many tiles were the same as the S3 tilesupload: True if--uploadwas set to upload / sync tiles
Much of this data is also found in the file tiles/data.json in the S3
bucket for the most recent run.
Implementation¶
Ichnaea emits statsd-compatible metrics using markus, if the STATSD_HOST
is configured (see the config section). Metrics use the the
tags extension, which add queryable dimensions to the metrics. In development,
the metrics are displayed with the logs. In production, the metrics are stored
in an InfluxDB database, and can be displayed as graphs with Grafana.
Ichnaea also emits structured logs using structlog. In development, these are displayed in a human-friendly format. In production, they use the MozLog JSON format, and the data is stored in BigQuery.
In the past, metrics were the main source of runtime data, and tags were used to segment the metrics and provide insights. However, metric tags and their values were limited to avoid performance issues. InfluxDB and other time-series databases store metrics by the indexed series of tag values. This performs well when tags have a small number of unique values, and the combinations of tags are limited. When tags have many unique values and are combined, the number of possible series can explode and cause storage and performance issues (the “high cardinality” problem).
Metric tag values are limited to avoid high cardinality issues. For example,
rather than storing the number of WiFi stations, the wifi tag of the
locate.query metric has the values none, one, and many. The
region, such as US or DE, was once stored as a tag, but this can have
almost 250 values, causing MLS to have the highest processing load across
Mozilla projects.
BigQuery easily handles high-cardinality data, so structured logs can contain precise values, such as the actual number of WiFi stations provided, and more items, such as the region and unexpected keys. On the other hand, there isn’t a friendly tool like Grafana to quickly explore the data.
As of 2020, we are in the process of duplicating data from metrics into structured logging, expanding the data collected, and creating dashboards. We’ll also remove data from metrics, first to reduce the current issues around high-cardinality, then to focus metrics on operational data. Structured data will be used for service analysis and monitoring of long-term trends, and dashboards created for reference.