Aggregate data faster with approximate query processing

The purpose of this blog article is to show the reader how to use approximate queries and, more importantly, how to evaluate and decide when to use them. The article includes an introduction to the approximate query processing concept, a definition of the available functions and their usage, and a generic test script to evaluate approximate query performance.

 

1. Introduction to approximate query processing

 

Data aggregation is a principal asset for data analysts when exploring any type of data, as well as an essential task for Business Intelligence (BI) tools when generating charts and dashboards for business users. The superlative usage of data aggregations makes performance a key factor for many databases (DB) and BI tools, especially in those cases when data volumes are significant.

 

Several solutions already exist to improve the performance of data analyses that require costly aggregations. It is common to create aggregated tables containing the aggregated measures you are specifically interested in, and to make use of cache strategies available in the DB or the BI tool. However, Oracle also offers another particular way of improving aggregated query performance, called approximate query processing, which can be a simple yet effective solution in some cases.

 

As stated in the Oracle Database 19c White Paper, which explains all the novelties in the brand new Oracle DB 19c, approximate query processing is a new class of data analysis that can quickly return approximate answers, with a high degree of accuracy, without requiring excessive resource utilization. Approximate queries are based on SQL functions such as APPROX_COUNT_DISTINCT and APPROX_SUM, which are very similar to the standard COUNT(DISTINCT …) and SUM functions, except that only a portion of the selected rows is processed when calculating the query output, thus reducing query processing time.

 

2. Approximate query processing functions

 

2.1. Functions definition

 

Approximate query processing was first seen in Oracle Database 11g when the APPROX_COUNT_DISTINCT function was added, but not documented, to improve the speed when gathering statistics using the DBMS_STATS package; then version 12.2 included the following query processing functions in its documentation:

  • APPROX_COUNT_DISTINCT
  • APPROX_COUNT_DISTINCT_DETAIL – returns a BLOB value with the detailed information about the number of distinct rows for each value of the expression defined.
  • TO_APPROX_COUNT_DISTINCT – converts the BLOB value into a number value.
  • APPROX_COUNT_DISTINCT_AGG – uses a BLOB value as input, returns any aggregation over the detailed information available.
  • APPROX_MEDIAN – is an approximate inverse distribution that assumes a continuous distribution model.
  • APPROX_PERCENTILE – returns the value corresponding to any percentile.
  • APPROX_PERCENTILE_DETAIL – converts the BLOB value into a number value.
  • TO_APPROX_PERCENTILE – returns a BLOB value with the detail information about the approximate percentile information of the expression defined.
  • APPROX_PERCENTILE_AGG – uses a BLOB value as input, returns any aggregation over the detailed information available.

 

This Oracle version also included three new parameters called zero code changes parameters, which can be set at system or session level to smoothly activate approximate query processing when running an SQL code with standard functions:

 

  • APPROX_FOR_AGGREGATION – sets APPROX_FOR_COUNT_DISTINCT to TRUE and APPROX_FOR_PERCENTILE to ALL.
  • APPROX_FOR_COUNT_DISTINCT – when TRUE converts COUNT(DISTINCT …) calls to APPROX_COUNT_DISTINCT calls.
  • APPROX_FOR_PERCENTILE

 

Finally, in Oracle Database 18c, three more functions were introduced to allow approximate top-n query processing:

 

  • APPROX_RANK – returns the rank values in an approximate top-n attribute values.
  • APPROX_SUM – returns the approximate sum in an approximate top-n attribute values.
  • APPROX_COUNT – returns the approximate rows count in an approximate top-n attribute values.

 

2.2. Functions syntax and limitations

 

The functions included in version 12.2, as shown in the examples included in the testing script below, are very intuitive and do not require any extra clause or definition in the SQL issued.

 

On the other hand, the approximate top-n functions included in version 18c are more peculiar and require some specific syntax and limitations to be taken into account:

 

  • The expression of the APPROX_RANK function can only be based on an APPROX_SUM or an APPROX_COUNT function; the ORDER BY clause and the DESC keyword are also mandatory, so these functions cannot be used to perform bottom-n queries.
  • When using an APPROX_RANK, APPROX_SUM or APPROX_COUNT function in the SELECT, a HAVING clause including the APPROX_RANK definition of each approximate function selected is required; this HAVING clause also defines the top-n rows returned by the query.
  • Other optional keywords accepted are MAX_ERROR (as a parameter for the APPROX_SUM and APPROX_COUNT functions) and PARTITION BY (as a clause in the APPROX_RANK expression).

 

For further information on how to use these functions, visit the Aggregate Functions section of the official Oracle documentation corresponding to your database version: 12.2, 18c, 19c.

 

2.3. Functions usage in BI Tools

 

As previously commented, data exploration and ad hoc analysis using a SQL Client is a principal use case for approximate query processing. However, approximate queries can also be used in the expression of data objects contained in BI tools to reduce the loading time of the analyses and dashboards developed.

 

When using approximate queries in BI tools such as OBIEE, Tableau and Qlik, the first thing to remember is that approximate queries are an Oracle database-specific feature: this means that the expression of any BI tool data object containing an approximate function must be sent directly to the Oracle database, without first being interpreted by the BI tool.

 

Some BI tools such as Qlik always send data object expressions to the configured data source without any pre-interpretation, and so approximate queries can normally be used. However, other BI tools such as OBIEE and Tableau normally interpret data object expressions and then send them to the data source system which sends back the query result; in these cases, pass-through functions are needed to force the BI tool to send the expression straight to the data source system. OBIEE provides the EVALUATE function to do this, while Tableau offers RAWSQL.

 

3. Approximate query processing evaluation

 

 

The aim of this section is to evaluate how effective and useful approximate query processing functions are compared to their equivalent standard functions. Specifically, query output and query processing time are compared for different table sizes and different queries.

 

3.1. Testing script

 

The test performed followed these steps:

 

 

  1. Creation of an Oracle Autonomous Data Warehouse (ADW) cloud instance.
  2. Creation of four dummy tables with significantly different volumes in the ADW instance. The first table has 10k rows, the second 1M rows, the third 10M rows and the fourth 20M rows.
    The DML corresponding to the creation of the 10k-row table is as follows:

    CREATE TABLE t1 AS
    SELECT level AS id,
           'Description of ' || level as description,
           TRUNC(DBMS_RANDOM.value(1,4)) as attribute_1_id,
           TRUNC(DBMS_RANDOM.value(1,20)) as attribute_2_id,
           TRUNC(sysdate) - TRUNC(DBMS_RANDOM.value(1,10001)) AS created_date
    FROM dual
    CONNECT BY level <= 10000;

     

  3. Preparation and execution of a set of approximate queries and their corresponding standard queries using the four tables previously created.
    For each execution, the query output and processing time are recorded.
    A summary list of the approximate queries and the corresponding standard queries run can be seen below:

    Query ID

    Approximate query function

    Approximate query

    Standard query

    01APPROX_COUNT_DISTINCTSELECT APPROX_COUNT_DISTINCT(id) FROM tSELECT COUNT(DISTINCT(id)) FROM t
    02APPROX_COUNT_DISTINCTSELECT APPROX_COUNT_DISTINCT(id) FROM t WHERE attribute_1_id IN (2,3,4)SELECT COUNT(DISTINCT(id)) FROM t
    WHERE attribute_1_id IN (2,3,4)
    03APPROX_COUNT_DISTINCTSELECT APPROX_COUNT_DISTINCT(attribute_1_id) FROM t WHERE attribute_2_id BETWEEN 1 AND 10SELECT COUNT(DISTINCT(attribute_1_id)) FROM t
    WHERE attribute_2_id BETWEEN 1 AND 10
    04APPROX_MEDIANSELECT APPROX_MEDIAN(id) FROM tSELECT MEDIAN(id) FROM t
    05APPROX_MEDIANSELECT APPROX_MEDIAN(attribute_1_id) FROM tSELECT MEDIAN(attribute_1_id) FROM t
    06APPROX_COUNT & APPROX_RANKSELECT
    attribute_1_id,
    APPROX_COUNT(id)
    FROM t
    WHERE attribute_2_id IN (1,2)
    GROUP BY attribute_1_id
    HAVING APPROX_RANK( ORDER BY APPROX_COUNT(id) DESC) <= 10
    SELECT
    attribute_1_id,
    COUNT(id) AS count
    FROM t
    WHERE attribute_2_id IN (1,2)
    GROUP BY attribute_1_id
    ORDER BY COUNT
    FETCH FIRST 10 ROWS ONLY
    07APPROX_COUNT & APPROX_RANKSELECT
    attribute_1_id,
    APPROX_SUM(attribute_2_id)
    FROM t
    GROUP BY attribute_1_id
    HAVING APPROX_RANK( ORDER BY APPROX_SUM(attribute_2_id) DESC) <= 10
    SELECT
    attribute_1_id,
    SUM(attribute_2_id) AS sum
    FROM t
    GROUP BY attribute_1_id
    ORDER BY sum desc
    FETCH NEXT 4 ROWS ONLY

     

  4. Once all the query results have been gathered, including outputs and processing times, each approximate query result is compared to its corresponding standard query result.

 

3.2. Testing conclusions

 

Let’s look at the query output accuracy and processing time:

 

The accuracy of the approximate queries results compared to their corresponding standard queries results is very high and meets the expectations of a maximum accuracy error of 5%. Indeed, the maximum percentual difference experienced between both methods is 4% on an APPROX_COUNT_DISTINCT query over a highly dispersed attribute. We can safely conclude that this aspect is positively achieved.

 

On the other hand, the performance improvement of approximate queries compared to their corresponding standard queries is more variable, ranging from approximate queries being 20 times faster to approximate queries with no performance improvement. The full list of factors that affect approximate queries performance is difficult to define although the following factors should be considered:

 

  • Database performance.
  • Approximate function used.
  • Table size – the bigger the table, the higher the performance improvement.
  • Dispersion of the field or fields to be aggregated.
  • Specific query characteristics such as filters, aggrupation and sorts.

 

Conclusion

 

Approximate query processing is an interesting feature to consider when experiencing aggregated data access performance issues, when running ad hoc queries from a SQL Client, as well as when executing analyses and dashboards in a BI tool. As commented, each BI tool has its own way to send queries directly to the data source system which can be used to run approximate queries.

 

Due to the variability seen in the performance of approximate queries, we recommend firstly evaluating how suitable approximate queries are to your specific case, and then you can take several actions:

 

  • Run a standard test script like the one included in this article, or any other.
  • Set the approx_for_aggregation flag to true and test whether the currently used count distinct and percentile queries perform better or not.
  • Take some of your most critical queries, build their approximate queries synonyms and check if they perform better or not.

 

For more information about approximate queries, data aggregation or any BI and analytics topic, don’t hesitate to contact ClearPeaks! Our consultants will be happy to help you!

Pere V
pere.vegas@clearpeaks.com