Query Samples

Our software sees (but does not transmit) every unencrypted query that enters your database server. We obfuscate all queries we observe by normalizing them into digests, stripping out variable portions of the query including literal values that may contain Personally Identifiable Information (PII) and other sensitive data. This obfuscation takes place locally and not within the API, and we transmit the digested query to the application.

For the most part, we simply generate metrics about these digested queries. But to make this data much more useful, we probabilistically capture individual raw samples of queries, including literal values, so you can examine them. When a raw query sample is available for examination, we display a small circle on the query details sparkplot, one for each sample. Here’s a screenshot:

Top Queries

VividCortex also stores the checksum (first 64 bits of MD5) for each query, which is used throughout the application to assist with analysis (such as through metric names which include the query checksum).

This table illustrates how a query is digested and checksummed:

Item Example Usage
raw query SELECT * FROM db1.tbl1 WHERE SSN = '123-45-6789' Sampled on a periodic basis and displayed in the UI to facilitate inspection of queries.
digested query select * from db?.tbl? where ssn = ? Transmitted to the VividCortex API the first time the software see it. Used for aggregate metrics.
checksum 51c37c16602e541c Used as a key for metrics about queries that share the same digest, e.g. host.queries.51c37c16602e541c.latency_us

Because we are capturing raw query samples and sending them to our application for your analysis, there is the potential for Personally Identifiable Information or other sensitive data to be transmitted to VividCortex. The options below allow you to limit or completely eliminate the transmission of query samples.

While we treat this data with the utmost care, you should be familiar with the ways we secure your private data from end to end.

Configuring Capture of Query Samples

In order to prevent the transmission of PII or other sensitive data to VividCortex, the following configurations may be enabled on a per-host basis.

  • Sampling can be completely disabled.

Disabling sampling entirely prevents the agents from taking any query samples, which prevents the transmission of raw query statements as well as any additional information related to an individual sample, such as latency, connection ID, EXPLAIN plans, etc. We typically do not recommend this option as it is the most limiting.

  • Samples can be captured, but the query text itself is discarded.

This option allows us to collect query metadata for individual queries, such as connection ID, latency, EXPLAIN plans, etc., but does not transmit raw query samples (such as the ‘raw query’ example in the table above). Many of these details are useful, so it is helpful to have samples even if the sample is deleted. When using this option, we also take steps to remove PII which may be in the EXPLAIN plan. For MySQL, we do not capture the JSON representation of the EXPLAIN plan in order to avoid capturing the query text. For PostgreSQL, we filter out the Filter, Hash Cond, Index Cond, Merge Cond, and Recheck Cond fields.

If it is believed that the EXPLAIN plan might still contain PII, EXPLAIN functionality can be disabled entirely (see below).

  • Queries can be whitelisted or blacklisted by query text or regular expression.

This option allow us to collect all queries except for those matching a blacklist, or discard all queries except those matching a whitelist. For example, you can disable query samples for all queries using the above option, and then allow the capture of a limited set of queries that you know are safe.

Enabling

You can enable these settings using a VividCortex configuration file containing the option(s) you wish to enable. The settings are as follows and are enabled on a per-host basis:

  • disable-sampling: When set to true, sampling is completely disabled. No query text, nor any information related to individual queries (such as connection ID, latency, etc.), is captured.

  • disable-sampling-text: When set to true, query metadata is collected but the raw query text itself is discarded.

  • query-blacklist-pattern: This is a regular expression describing the queries that must not be sampled. These queries will be treated as if disable-sampling was enabled. The syntax available for use in a regular expression can be found here.

  • query-whitelist-pattern: This is a regular expression describing the queries that may be sampled. Use this in conjunction with disable-sampling to disable all samples except for those that match this regular expression. The syntax available for use in a regular expression can be found here. Alternatively, you may instead provide a comma-separated list of query checksums (found on the query details page) which are safe to query.

  • enable-explains: When set to false, we will not capture EXPLAIN plans. Most often this is used in conjunction with disable-sampling-text, when it is valuable to capture query metadata but the EXPLAIN plan, like the raw query itself, may contain PII or other sensitive data.

For On-Host configurations, these settings are configured in the query agent config file, except for disable-sampling-text and enable-explains, which are configured in the metrics agent config file. For Off-Host configurations, all of the settings are configured in the metrics agent

As an example, if you wished to disable the capture of query text in an On-Host configuration for PostgreSQL, you would edit the file /etc/vividcortex/vc-pgsql-metrics.conf to contain "disable-sampling-text":"true":

{
"foo":"bar",
"disable-sampling-text":"true",
"baz": 1
}

Please contact VividCortex Customer Support if you have any questions or concerns about query samples or the available options to configure this functionality.