Metrics and Structured Logs¶
Ichnaea provides two classes of runtime data:
Statsd-style Metrics, for real-time monitoring and easy visual analysis of trends
Structured Logs, for offline analysis of data and targeted custom reports
Structured logs were added in 2020, and the migration of data from metrics to logs is not complete. For more information, see the Implementation section.
Metrics are emitted by the web / API application, the backend task application, and the datamaps script:
Metric Name |
App |
Type |
Tags |
---|---|---|---|
task |
gauge |
key, path |
|
web |
counter |
key |
|
task |
counter |
key |
|
task |
counter |
key, status |
|
task |
timer |
key |
|
task |
counter |
type, key |
|
task |
counter |
type |
|
task |
counter |
type, key |
|
task |
counter |
key |
|
task |
counter |
key |
|
task |
counter |
type |
|
task |
counter |
type |
|
task |
counter |
type, errno |
|
task |
counter |
type |
|
task |
counter |
errno |
|
web |
counter |
fallback_name, status |
|
web |
counter |
fallback_name, status |
|
web |
timer |
fallback_name, status |
|
web |
counter |
key, geoip, blue, cell, wifi |
|
web |
counter |
key, path |
|
web |
counter |
key, accuracy, status, source, fallback_allowed |
|
web |
counter |
key, accuracy, status, source |
|
task |
gauge |
key, interval |
|
task |
gauge |
data_type, queue, queue_type |
|
task |
gauge |
||
task |
gauge |
||
task |
gauge |
||
task |
gauge |
||
task |
gauge |
||
task |
gauge |
||
task |
gauge |
||
web |
counter |
key, geoip, blue, cell, wifi |
|
web |
counter |
key, path |
|
web |
counter |
key, accuracy, status, source, fallback_allowed |
|
task |
gauge |
key, interval |
|
web |
counter |
path, method, status |
|
web |
timer |
path, method |
|
web |
counter |
key, path |
|
task |
gauge |
key, interval |
|
task |
timer |
task |
|
task |
gauge |
||
task |
gauge |
||
task |
gauge |
||
task |
gauge |
Web Application Metrics¶
The website handles HTTP requests, which may be page requests or API calls.
Request Metrics¶
Each HTTP request, including API calls, emits metrics and a structured log entry.
request¶
request
is a counter for almost all HTTP requests, including API calls. The
exceptions are static assets, like CSS, images, Javascript, and fonts, as well as
some static content like robots.txt
.
Additionally, invalid requests (HTTP status in the 4xx
range) do not omit this
metric, unless they are API endpoints.
The path
tag is the request path, like /stats/regions
, but normalized
to tag-safe characters. The initial slash is dropped, and remaining slashes
are replaced with periods, so /stats/regions
becomes stats.regions
.
The homepage, /
is normalized as .homepage
, to avoid an empty tag
value.
Tags:
path
: The metrics-normalized HTTP path, likestats.regions
,v1.geolocate
, and.homepage
method
: The HTTP method in lowercase, likepost
,get
,head
, andoptions
status
: The returned HTTP status, like200
for success and400
for client errors
Related structured log data:
http_method: The non-normalized HTTP method
http_path: The non-normalized request path
http_status: The HTTP status code
request.timing¶
request.timing
is a timer for how long the HTTP request took to complete in
milliseconds.
Tags:
The tags path
and method
are the same as request. The tag status
is omitted.
Related structured log data:
duration_s: The time the request took in seconds, rounded to the millisecond
API Metrics¶
These metrics are emitted when the API is called.
data.batch.upload¶
data.batch.upload
is a counter that is incremented when a submit API, like
/v2/geosubmit, is called with any valid data. A
submission batch could contain a single report or multiple reports, but both
would increment data.batch.upload
by one. A batch with no (valid) reports
does not increment this metric.
Tags:
key
: The API key, often a UUID, or omitted if the API key is not valid.
Related structured log data:
api_key: The same value as tag
key
for valid keys
locate.query¶
locate.query
is a counter, incremented each time the
Geolocate API is used with a valid API key that
is not rate limited. It is used to segment queries by the station data
contained in the request body.
Tags:
key
: The API key, often a UUIDgeoip
:false
if there was no GeoIOP data, and omitted when there is GeoIP data for the client IP (the common case)blue
: Count of valid Bluetooth stations in the request,none
,one
ormany
cell
: Count of valid cell stations in the request,none
,one
, ormany
wifi
: Count of valid WiFi stations in the request,none
,one
, ormany
Changed in version 2020.04.16: Removed the region
tag
Related structured log data:
locate.request¶
locate.request
is a counter, incremented for each call to the
Geolocate API.
Tags:
key
: The API key, often a UUID, orinvalid
for a known key that can not call the API, ornone
for an omitted key.path
:v1.geolocate
, the standardized API path
Related structured log data:
api_key: The same value as tag
key
, except that instead ofinvalid
, the request key is used, andapi_key_allowed=False
api_key_allowed:
False
when the key is not allowed to use the APIapi_path: The same value as tag
path
api_type: The value
locate
locate.result¶
locate.result
is a counter, incremented for each call to the
Geolocate API with a valid API key that is not
rate limited.
If there are no Bluetooth, Cell, or WiFi networks provided, and GeoIP data is not available (for example, the IP fallback is explicitly disabled), then this metric is not emitted.
Tags:
key
: The API key, often a UUIDaccuracy
: The expected accuracy, based on the sources provided:high
: At least two Bluetooth or WiFi networksmedium
: No Bluetooth or WiFi networks, at least one cell networklow
: No networks, only GeoIP data
status
: Could we provide a location estimate?hit
if we can provide a location with the expected accuracy,miss
if we can not provide a location with the expected accuracy. For cell networks (accuracy=medium
), ahit
includes the case where there is not an exact cell match, but the cell area (the area covered by related cells) is small enough (smaller than tens of kilometers across) for an estimate.
source
: The source that provided the hit:internal
: Our crowd-sourced network datageoip
: The MaxMind GeoIP databasefallback
: An optional external fallback providerOmitted when
status=miss
fallback_allowed
:true
if the external fallback provider was allowedOmitted if the external fallback provider was not allowed
Changed in version 2020.04.16: Removed the region
tag
Related structured log data:
accuracy: The accuracy level of the result,
high
,medium
, orlow
accuracy_min: The same value as tag
accuracy
api_key: The same value as tag
key
result_status: The same value as tag
status
locate.source¶
locate.source
is a counter, incremented for each processed source in
a location query. If station data (Bluetooth, WiFi, and Cell data)
is provided, this usually two metrics for one request, one for the
internal
source and one for the geoip
source.
The required accuracy for a hit
is set by the kind of station data in the
request. For example, a request with no station data requires a low
accuracy, while one with multiple WiFi networks requires a high
accuracy.
The high
accuracy is at least 500 meters, and the minimum current MaxMind
accuracy is 1000 meters, so the geoip
source is expected to have a miss
status when accuracy is high
.
Tags (similar to locate.result) :
key
: The API key, often a UUIDaccuracy
: The expected accuracy, based on the sources provided:high
: At least two Bluetooth or WiFi networksmedium
: No Bluetooth or WiFi networks, at least one cell networklow
: No networks, only GeoIP data
status
: Could we provide a location estimate?hit
: We can provide a location with the expected accuracy,miss
: We can not provide a location with the expected accuracy
source
: The source that was processed:internal
: Our crowd-sourced network datageoip
: The MaxMind GeoIP databasefallback
: An optional external fallback provider
fallback_allowed
:true
if the external fallback provider was allowedOmitted if the external fallback provider was not allowed
Changed in version 2020.04.16: Removed the region
tag
Related structured log data:
api_key: The same value as tag
key
source_internal_accuracy: The accuracy level of the internal source
source_internal_accuracy_min: The required accuracy level of the internal source, same value as tag
accuracy
whensource=internal
source_internal_status: The same value as tag
status
whensource=internal
source_geoip_accuracy: The accuracy level of the GeoIP source
source_geoip_accuracy_min: The required accuracy level of the GeoIP source, same value as tag
accuracy
whensource=geoip
source_geoip_status: The same value as tag
status
whensource=geoip
source_fallback_accuracy: The accuracy level of the external fallback source
source_fallback_accuracy_min: The required accuracy level of the fallback source, same value as tag
accuracy
whensource=fallback
source_fallback_status: The same value as tag
status
whensource=fallback
region.query¶
region.query
is a counter, incremented each time the
Region API is used with a valid API key. It is used
to segment queries by the station data contained in the request body.
It has the same tags (key
, geoip
, blue
, cell
, and wifi
) as
locate.query.
region.request¶
region.request
is a counter, incremented for each call to the
Region API.
It has the same tags (key
and path
) as locate.request, except the
path
tag is v1.country
, the standardized API path.
region.result¶
region.result
is a counter, incremented for each call to the
Region API with a valid API key that is not
rate limited.
If there are no Bluetooth, Cell, or WiFi networks provided, and GeoIP data is not available (for example, the IP fallback is explicitly disabled), then this metric is not emitted.
It has the same tags (key
, accuracy
, status
, source
, and
fallback_allowed
) as locate.result.
region.source¶
region.source
is a counter, incremented for each processed source in
a region query. If station data (Bluetooth, WiFi, and Cell data)
is provided, this usually two metrics for one request, one for the
internal
source and one for the geoip
source. In practice, most
users provide no station data, and only the geoip
source is emitted.
It has the same tags (key
, accuracy
, status
, source
, and
fallback_allowed
) as locate.source.
submit.request¶
submit.request
is a counter, incremented for each call to a Submit API:
This counter can be used to determine when the deprecated APIs can be removed.
It has the same tags (key
and path
) as locate.request, except the
path
tag is v2.geosubmit
, v1.submit
, or v1.geosubmit
, the
standardized API path.
API Fallback Metrics¶
These metrics were emitted when the fallback location provider was called. MLS stopped using this feature in 2019, so these metrics are not emitted, but the code remains as of 2020.
These metrics have not been converted to structured logs.
locate.fallback.cache¶
locate.fallback.cache
is a counter for the performance of the fallback cache.
Tags:
fallback_name
: The name of the external fallback provider, from the API key tablestatus
: The status of the fallback cache:hit
: The cache had a previous result for the querymiss
: The cache did not have a previous result for the querybypassed
: The cache was not used, due to mixed stations in the query, or the high number of individual stationsinconsistent
: The cached results were for multiple inconsistent locationsfailure
: The cache was unreachable
locate.fallback.lookup¶
locate.fallback.lookup
is a counter for the HTTP response codes returned
from the fallback server.
Tags:
fallback_name
: The name of the external fallback provider, from the API key tablestatus
: The HTTP status code, such as200
locate.fallback.lookup.timing¶
locate.fallback.lookup.timing
is a timer for the call to the fallback
location server.
Tags:
fallback_name
: The name of the external fallback provider, from the API key tablestatus
: The HTTP status code, such as200
Web Application Structured Logs¶
There is one structured log emitted for each request, which may be an API request. The structured log data includes data that was emitted as one or more metrics.
Request Metrics¶
All requests, with the exception of static assets and static views (see request), include this data:
duration_s
: The time in seconds, rounded to the millisecond, to serve the request.http_method
: The HTTP method, likePOST
orGET
.http_path
: The request path, like/
for the homepage, or/v1/geolocate
for the API.http_status
: The response status, like200
or400
.
This data is duplicated in metrics:
API Metrics¶
If a request is an API call, additional data can be added to the log:
accuracy
: The accuracy of the result,high
,medium
, orlow
.accuracy_min
: The minimum required accuracy of the result for a hit,high
,medium
, orlow
.api_key
: An API key that has an entry in the API key table, often a UUID, ornone
if omitted. Same as statsd tagkey
, except that known but disallowed API keys are the key value, rather thaninvalid
.api_key_allowed
:False
if a known API key is not allowed to call the API, omitted otherwise.api_key_db_fail
:True
when a database error prevented checking the API key. Omitted when the check is successful.api_path
: The normalized API path, likev1.geolocate
andv2.geosubmit
. Same as statsd tagpath
when an API is called.api_response_sig
: A hash to identify repeated geolocate requests getting the same response without identifying the client.api_type
: The API type,locate
,submit
, orregion
.blue
: The count of Bluetooth radios in the request.blue_valid
: The count of valid Bluetooth radios in the request.cell
: The count of cell tower radios in the request.cell_valid
: The count of valid cell tower radios in the request.fallback_allowed
:True
if the optional fallback location provider can be used by this API key,False
if not.has_geoip
:True
if there is GeoIP data for the client IP, otherwiseFalse
.has_ip
:True
if the client IP was available, otherwiseFalse
.invalid_api_key
: The invalid API key not found in API table, omitted if known or empty.rate_allowed
:True
if allowed,False
if not allowed due to rate limit, or omitted if the API is not rate-limited.rate_quota
: The daily rate limit, or omitted if API is not rate-limited.rate_remaining
: The remaining API calls to hit limit, 0 if none remaining, or omitted if the API is not rate-limited.region
: The ISO region code for the IP address,null
if none.result_status
:hit
if an accurate estimate could be made,miss
if it could not.source_fallback_accuracy
: The accuracy level of the external fallback source,high
,medium
, orlow
.source_fallback_accuracy_min
: The required accuracy level of the fallback source.source_fallback_status
:hit
if the fallback source provided an accurate estimate,miss
if it did not.source_internal_accuracy
: The accuracy level of the internal source (Bluetooth, WiFi, and cell data compared against the database),high
,medium
, orlow
.source_internal_accuracy_min
: The required accuracy level of the internal source.source_internal_status
:hit
if the internal check provided an accurate estimate,miss
if it did not.source_geoip_accuracy
: The accuracy level of the GeoIP source,high
,medium
, orlow
.source_geoip_accuracy_min
: The required accuracy level of the GeoIP source.source_geoip_status
:hit
if the GeoIP database provided an accurate estimate,miss
if it did not.wifi
: The count of WiFi radios in the request.wifi_valid
: The count of valid WiFi radios in the request.
Some of this data is duplicated in metrics:
Task Application Metrics¶
The task application, running on celery in the backend, implements the data pipeline and other periodic tasks. These emit metrics, but have not been converted to structured logging.
API Monitoring Metrics¶
These metrics are emitted periodically to monitor API usage. A Redis key is incremented or updated during API requests, and the current value is reported via these metrics:
api.limit¶
api.limit
is a gauge of the API requests, segmented by API key and API
path, for keys with daily limits. It is updated every 10 minutes.
Tags:
key
: The API key, often a UUIDpath
: The normalized API path, such asv1.geolocate
orv2.geosubmit
Related structured log data is added during the request when an API key has rate limits:
rate_allowed:
True
if the request was allowed,False
if not allowed due to the rate limitrate_quota: The daily rate limit
rate_remaining: The remaining API calls to hit limit, 0 if none remaining
locate.user¶
locate.user
is a gauge of the estimated number of daily and weekly users of
the Geolocate API by API key. It is updated
every 10 minutes.
The estimate is based on the client’s IP address. At request time, the IP is added via PFADD to a HyperLogLog structure. This structure can be used to estimate the cardinality (number of unique IP addresses) to within about 1%. See PFCOUNT for details on the HyperLogLog implementation.
Tags:
key
: The API key, often a UUIDinterval
:1d
for the daily estimate,7d
for the weekly estimate.
region.user¶
region.user
is a gauge of the estimated number of daily and weekly users of
the Region API by API key. It is updated every 10
minutes.
It has the same tags (key
and interval
) as locate.user.
submit.user¶
submit.user
is a gauge of the estimated number of daily and weekly users of
the submit APIs (/v2/geosubmit and the
deprecated submit APIs) by API key. It is updated every 10 minutes.
It has the same tags (key
and interval
) as locate.user.
Data Pipeline Metrics - Gather and Export¶
The data pipeline processes data from two sources:
Submission reports, from the submission APIs, which include a position from an external source like GPS, along with the Wifi, Cell, and Bluetooth stations that were seen.
Location queries, from the geolocate and region APIs, which include an estimated position, along with the stations.
Multiple reports can be submitted in one call to the submission APIs. Each batch of reports increment the data.batch.upload metric when the API is called. A single report is created for each location query, and there is no corresponding metric.
The APIs feed these reports into a Redis queue update_incoming
,
processed by the backend task of the same name. This task copies reports to
“export” queues. Four types are supported:
dummy
: Does nothing, for pipeline testinggeosubmit
: POST reports to a service supporting the Geosubmit API.internal
: Divide reports into observations, for further processing to update the internal database.s3
: Store report JSON in S3.
Ichnaea supports multiple export targets for a type. In production, there are three export targets, identified by an export key:
backup
: Ans3
export, to a Mozilla-private S3 buckettostage
: Ageosubmit
export, to send a sample of reports to stage for integration testing.internal
: Aninternal
export, to update the database
The data pipeline has not been converted to structured logging. As data moves through this part of the data pipeline, these metrics are emitted:
data.export.batch¶
data.export.batch
is a counter of the report batches exported to external
and internal targets.
Tags:
key
: The export key, from the export table. Keys used in Mozilla production:backup
: Reports archived in S3tostage
: Reports sent from production to stage, as a form of integration testinginternal
: Reports queued for processing to update the internal station database
data.export.upload¶
data.export.upload
is a counter that tracks the status of export jobs.
Tags:
key
: The export key, from the export table. Keys used in Mozilla production arebackup
andtostage
, with the same meaning as data.export.batch. Unlike that metric,internal
is not used.status
: The status of the export, which varies by type of export:backup
:success
orfailure
storing the report to S3tostage
: HTTP code returned by the submission API, usually200
for success or400
for failure.
data.export.upload.timing¶
data.export.upload.timing
is a timer for the report batch export process.
Tags:
key
: The export key, from the export table. See data.export.batch for the values used in Mozilla production.
data.observation.drop¶
data.observation.drop
is a counter of the Bluetooth, cell, or WiFi
observations that were discarded before integration due to some
internal consistency, range or validity-condition error encountered while
attempting to normalize the observation.
Tags:
key
: The API key, often a UUID. Omitted if unknown or not availabletype
: The station type, one ofblue
,cell
, orwifi
data.observation.upload¶
data.observation.upload
is a counter of the number of Bluetooth, cell or
WiFi observations entering the data processing pipeline, before
normalization and blocked station processing. This count is taken after a batch
of reports are decomposed into observations.
The tags (key
and type
) are the same as data.observation.drop.
data.report.drop¶
data.report.drop
is a counter of the reports discarded due to
some internal consistency, range, or validity-condition error.
Tags:
key
: The API key, often a UUID. Omitted if unknown or not available
data.report.upload¶
data.report.upload
is a counter of the reports accepted into the data
processing pipeline.
It has the same tag (key
) as data.report.drop.
Data Pipeline Metrics - Update Internal Database¶
The internal export process decomposes reports into observations, pairing one position with one station. Each observation works its way through a process of normalization, consistency-checking, and (possibly) integration into the database, to improve future location estimates.
The data pipeline has not been converted to structured logging. As data moves through the pipeline, these metrics are emitted:
data.observation.insert¶
data.observation.insert
is a counter of the Bluetooth, cell, or WiFi
observations that were successfully validated, normalized, integrated.
Tags:
type
: The station type, one ofblue
,cell
, orwifi
data.station.blocklist¶
data.station.blocklist
is a counter of the Bluetooth, cell, or WiFi
stations that are blocked from being used to estimate positions.
These are added because there are multiple valid observations at
sufficiently different locations, supporting the theory that it is a
mobile station (such as a picocell or a mobile hotspot on public transit),
or was recently moved (such as a WiFi base station that moved with the
owner to a new home).
Tags:
type
: The station type, one ofblue
,cell
, orwifi
data.station.confirm¶
data.station.confirm
is a counter of the Bluetooth, cell or WiFi
stations that were confirmed to still be active. An observation
from a location query can be used to confirm a station with a position based
on submission reports.
It has the same tag (type
) as data.station.blocklist
data.station.dberror¶
data.station.dberror
is a counter of retryable database errors, which are
encountered as multiple task threads attempt to update the internal database.
Retryable database errors, like a lock timeout (1205
) or deadlock
(1213
) cause the station updating task to sleep and start over. Other
database errors are not counted, but instead halt the task and are recorded in
Sentry.
Tags:
errno
: The error number, which can be found in the MySQL Server Error Referencetype
: The station, one ofblue
,cell
, orwifi
, or the aggregate station typecellarea
data.station.new¶
data.station.new
is a counter of the Bluetooth, cell or WiFi
stations that were discovered for the first time.
Tags:
type
: The station type, one ofblue
,cell
, orwifi
datamaps.dberror¶
datamaps.dberror
is a counter of the number of retryable database errors
when updating the datamaps
tables.
Tags:
errno
: The error number, same as data.station.dberror
Backend Monitoring Metrics¶
queue¶
queue
is a gauge that reports the current size of task and data queues.
Queues are implemented as Redis lists, with a length returned by LLEN.
Task queues hold the backlog of celery async tasks. The names of the task queues are:
celery_blue
,celery_cell
,celery_wifi
- A task to process a chunk of observation datacelery_content
- Tasks that update website content, like the datamaps and statisticscelery_default
- A generic task queuecelery_export
- Tasks exporting data, either public cell data or the Data Pipelinecelery_monitor
- Tasks updating metrics gauges for this metric and API Monitoring Metricscelery_reports
- Tasks handling batches of submission reports or location queries
Data queues are the backlog of observations and other data items to be processed. Data queues have names that mirror the shared database tables:
update_blue_0
throughupdate_blue_f
(16 total) - Observations of Bluetooth stationsupdate_cell_gsm
,update_cell_lte
, andupdate_cell_wcdma
- Observations of cell stationsupdate_cell_area
- Aggregated observations of cell towersdata_type: cellarea
update_datamap_ne
,update_datamap_nw
,update_datamap_se
, andupdate_datamap_sw
- Approximate locations for the contribution mapupdate_incoming
- Incoming reports from geolocate and submission APIsupdate_wifi_0
throughupdate_wifi_f
(16 total) - Observations of WiFi stations
Tags:
queue
: The name of the task or data queuequeue_type
:task
ordata
data_type
: For data queues,bluetooth
,cell
,cellarea
,datamap
,report
(queueupdate_incoming
), orwifi
. Omitted for task queues.
task¶
task
is a timer that measures how long each Celery task takes. Celery tasks
are used to implement the data pipeline and monitoring tasks.
Tags:
task
: The task name, such asdata.export_reports
ordata.update_statcounter
Rate Control Metrics¶
The optional rate controller can be used to dynamically set the global locate sample rate and prevent the data queues from growing without bounds. There are several metrics emitted to monitor the rate controller.
rate_control.locate¶
rate_control.locate
is a gauge that reports the current setting of the
global locate sample rate, which may be unset
(100.0), manually set, set by the rate controller, or set to 0 by the
transaction history monitor.
rate_control.locate.target¶
rate_control.locate.target
is a gauge that reports the current target queue
size of the rate controller. It is emitted when the rate controller is enabled.
rate_control.locate.kp¶
rate_control.locate.kp
is a gauge that reports the current value of
Kp, the proportional gain. It is emitted when the rate controller is enabled.
rate_control.locate.ki¶
rate_control.locate.ki
is a gauge that reports the current value of
Ki, the integral gain. It is emitted when the rate controller is enabled.
rate_control.locate.kd¶
rate_control.locate.kd
is a gauge that reports the current value of Kd, the derivative gain. It is emitted when the rate controller is
enabled.
rate_control.locate.pterm¶
rate_control.locate.pterm
is a gauge that reports the current value of of
the proportional term of the rate controller. It is emitted when the rate
controller is enabled.
rate_control.locate.iterm¶
rate_control.locate.pterm
is a gauge that reports the current value of of
the integral term of the rate controller. It is emitted when the rate
controller is enabled.
rate_control.locate.dterm¶
rate_control.locate.dterm
is a gauge that reports the current value of of
the derivative term of the rate controller. It is emitted when the rate
controller is enabled.
Transaction History Metrics¶
Processing the data queues can cause the MySQL InnoDB transaction history to grow faster than it can be purged. The transaction history length can be monitored, and when it exceeds a maximum, it can turn off observation processing until it is reduced. See transaction history monitoring for details.
Monitoring the transaction history length requires that the celery worker
database connection (“read-write”) has the PROCESS
privilege. If the
connection does not have this privilege, then no related metrics are emitted.
If the connection has this privilege, then one or more metrics are emitted to
monitor this process:
trx_history.length¶
trx_history.length
is a gauge that reports the current length of the
InnoDB transaction history.
trx_history.purging¶
If the rate controller is enabled, then trx_history.purging
is a gauge that
becomes 1
when the transaction history exceeds the maximum level.
Observation processing is paused by setting the
global locate sample rate to 0%, which allow the
MySQL purge process to reduce the transaction history. When it drops below a
safe minimum level, the rate is allowed to rise again.
If the rate controller is not enabled, then purging mode is not used, and this metric is not emitted.
trx_history.max¶
trx_history.max
is a gauge that report the current maximum value for the
transaction history length before the system switches to purging mode.
If the rate controller is not enabled, then purging mode is not used, and this metric is not emitted.
trx_history.min¶
trx_history.min
is a gauge that report the current minimum value for the
transaction history length before the system switches out of purging mode.
If the rate controller is not enabled, then purging mode is not used, and this metric is not emitted.
Datamaps Structured Log¶
The datamap script generates a data map from the gathered observations. It does not emit metrics.
The final canonical-log-line
log entry has this data:
bucket_name
: The name of the S3 bucketconcurrency
: The number of concurrent threads usedcreate
: True if--create
was set to generate tilescsv_converted_count
: How many CSV files were converted to quadtreescsv_count
: How many CSV files were exported from the databaseduration_s
: How long in seconds to run the scriptexport_duration_s
: How long in seconds to export from tables to CSVintermediate_quadtree_count
: How many partial quadtrees were created (due to multiple CSVs exported from large tables) and merged into one per-table quadtreemerge_duration_s
: How long in seconds to merge the per-table quadtreesquadtree_count
: How many per-table quadtrees were generatedquadtree_duration_s
: How long in seconds to convert CSV to quadtreesrender_duration_s
: How long in seconds to render the merged quadtree to tilesrow_count
: The number of rows across datamap tablesscript_name
: The name of the script (ichnaea.scripts.datamap
)success
: True if the script completed without errorssync_duration_s
: How long in seconds it took to upload tiles to S3tile_changed
: How many existing S3 tiles were updatedtile_count
: The total number of tiles generatedtile_deleted
: How many existing S3 tiles were deletedtile_failed
: How many upload or deletion failurestile_new
: How many new tiles were uploaded to S3tile_unchanged
: How many tiles were the same as the S3 tilesupload
: True if--upload
was set to upload / sync tiles
Much of this data is also found in the file tiles/data.json
in the S3
bucket for the most recent run.
Implementation¶
Ichnaea emits statsd-compatible metrics using markus, if the STATSD_HOST
is configured (see the config section). Metrics use the the
tags extension, which add queryable dimensions to the metrics. In development,
the metrics are displayed with the logs. In production, the metrics are stored
in an InfluxDB database, and can be displayed as graphs with Grafana.
Ichnaea also emits structured logs using structlog. In development, these are displayed in a human-friendly format. In production, they use the MozLog JSON format, and the data is stored in BigQuery.
In the past, metrics were the main source of runtime data, and tags were used to segment the metrics and provide insights. However, metric tags and their values were limited to avoid performance issues. InfluxDB and other time-series databases store metrics by the indexed series of tag values. This performs well when tags have a small number of unique values, and the combinations of tags are limited. When tags have many unique values and are combined, the number of possible series can explode and cause storage and performance issues (the “high cardinality” problem).
Metric tag values are limited to avoid high cardinality issues. For example,
rather than storing the number of WiFi stations, the wifi
tag of the
locate.query metric has the values none
, one
, and many
. The
region, such as US
or DE
, was once stored as a tag, but this can have
almost 250 values, causing MLS to have the highest processing load across
Mozilla projects.
BigQuery easily handles high-cardinality data, so structured logs can contain precise values, such as the actual number of WiFi stations provided, and more items, such as the region and unexpected keys. On the other hand, there isn’t a friendly tool like Grafana to quickly explore the data.
As of 2020, we are in the process of duplicating data from metrics into structured logging, expanding the data collected, and creating dashboards. We’ll also remove data from metrics, first to reduce the current issues around high-cardinality, then to focus metrics on operational data. Structured data will be used for service analysis and monitoring of long-term trends, and dashboards created for reference.