Massive Survey Data Exploration with Dremio (Part 1)

16 Nov 2022 Massive Survey Data Exploration with Dremio (Part 1)

Posted at 00:00h in Amazon, Big Data & Cloud by Alex M

As the amount of data piles up it is crucial to be able to manage that growth and not let our tools fall behind; we must analyse the use case and choose a suitable platform to handle the data as it grows.

This is the first of a two-part series where we are going to discuss a use case related to a massive survey dataset, scaling up to 1 billion rows of data, on which we have been working recently for one of our customers. The dataset contains the answers to multiple surveys, each survey having hundreds of questions and being answered by tens of thousands of people.

Initially, due to the volume of data, our customer was only loading a few surveys into a RDBMS, and some reports were created with a BI tool to enable analysis. To achieve fast responses, our customer leveraged extracts provided by the BI tool. This solution was working well enough for a low number of surveys (with tens of millions of rows), but our customer was worried about the cost of analysing the data from more surveys whilst maintaining reasonable query response times. The goal was to get responses within 10 seconds to most of the ad hoc queries and the BI tool visualisations.

As the data would grow (with more surveys being added), the extracts of the BI tool would also grow and the cost (BI tool licence and hardware, but especially the licence) would also grow significantly – our customer calculated what the cost would be when dealing with a billion rows, and it was plainly unreasonable.

And that is when we came in – what we proposed to handle this amount of information was to change the storage to a data lake and to use a powerful and scalable query engine able to work directly on the data lake. This would move the workload from the BI tool extracts to the query engine, and the storage to the data lake, thus reducing costs. The perfect fit was Dremio, which thanks to its reflections (its flagship feature), enables running fast SQL queries from the data stored in the AWS S3 data lake. Note that the solution would also work on other cloud storage or data lake technologies; another benefit of our solution is that it works with pretty much any BI tool. Since Dremio offers outstanding compatibility with Power BI – one of our favourites – we have, for the sake of these blogs, reproduced some of the customer dashboards in Power BI.

In this article we are going to see how we can accelerate queries with Dremio in this real use case with massive survey data. We are going to run through a brief introduction to Dremio reflections and how we used them to improve the performance of some plain SQL queries executed on this massive dataset. In the next part of this series, we will focus on accelerating the queries from the BI tool visualisations, i.e., Power BI, which may use different filters in each execution.

Before going into more detail, we recommend reading our previous blog post about Dremio for an introduction to this fantastic data lakehouse platform, to understand how it achieves query acceleration via reflections, and to see how it can be deployed on a cloud such as AWS. And if you’re short on time, there’s a brief introduction to Dremio in this blog post too!

1. Introducing Dremio

Dremio is a data lakehouse SQL engine that excels at dealing with large amounts of data from multiple sources simultaneously at an end-user level (BI, analytics, etc.), as well as working directly with data lakes.

The way Dremio improves performance is by using reflections – these are physically optimised representations of the source data. Reflections can be raw or aggregated. In all cases they use optimal file and table formats (under-the-hood optimisations such as those provided by Iceberg are used, as are partitioning and sorting). The main difference between the two types is that raw reflections are useful for focusing on a subset of the data or simply enhancing its format, whereas aggregation reflections are more suitable when the data is going to be queried in a BI fashion (i.e., using “GROUP BY” statements), because essentially Dremio stores precomputed aggregations based on the fields defined in the reflection definition.

Another important thing to bear in mind is that Dremio uses “elastic engines”, which in the case of AWS, are EC2 instances; these are the worker nodes of the architecture that will handle the workloads.

For this project we are using a single elastic engine with m5d.2xlarge EC2 nodes: 1 coordinator and 4 workers. However, we can set Dremio to use more engines and to automatically upscale or downscale in response to the workload and the number of users (for example, there could be sets of users using dedicated engines, so they have isolated computing resources).

Now that we have seen how Dremio works, let’s move on to the dataset.

2. The Dataset

We have the following dataset of 1 billion rows, which contains the fact table “V_SURVEY_FEEDBACK” and 17 other dimensions:

Figure 1: Massive survey dataset

Each time a user completes a survey, the results are stored in S3. Since it is a data lake, it is designed to handle enormous amounts of data, so we can store anything we want without worrying about capacity.

Please note that the data used and shown in these blog posts is synthetic data and does not correspond to any existing survey.

2.1. Adding the Physical Dataset

Once we know which dataset we are dealing with, we can start setting everything up to enable Dremio to query the data.

First, we need to add a connection to “S3 Data Lake” by giving the access key and access secret of the account that has access to the bucket:

Figure 2: Menu to add a new S3 source

Now we can start specifying the datasets we want to use from our S3 storage by clicking on “S3 Data Lake”, in the lower-left corner:

Figure 3: Button to access the S3 data

This opens a browser where we can get to the actual object we want to use as the source. We can select a whole folder for a single table, as is the case for the fact table, by hovering the mouse over an object and selecting the button that appears on the right:

Figure 4: Button to convert a folder into a source dataset

Once we have selected an object as the source, a window pops up, where we specify the format of the data to allow Dremio to read it properly:

Figure 5: Dataset format window

For the rest of the tables, we’ll select each individual object as they only have one file per table:

Figure 6: List of datasets for the dimension tables

Remember that Dremio does not load the data, and the S3 objects we have selected will remain in the same place; we’ve only told Dremio how to read the data in S3.

Now that all the sources are set up, we can start with the views.

3. Views

We can create views on top of the physical datasets. A view is a virtual table, created by running SQL statements or functions on a physical table or another view.

It is recommended that the views are made incrementally, creating multiple layers of views.

Figure 7: Different layers that a Dremio project may contain

The first layer is Staging: here we will be performing 1-1 views from the physical tables we previously added from S3, so each physical table will have a view created of it. We’ll also correct the names of the columns, their type, and make any additional changes, such as calculated fields, if necessary.

One example of a change we need to perform is to separate the field “question_answer_group_id” into two different ones: “question_id” and “answer_group_id”. To separate these two fields, we can go to the virtual “survey_feedback” table and click on the dropdown menu of the problematic field:

Figure 8: Preview of the QUESTION_ANSWER_GROUP_ID column, inside the survey_feedback table

We then select ‘Extract Text…’ to select the left side of the field, and then the right side (e.g.: ‘12345-54321’ becomes ‘12345’ and ‘54321’):

Figure 9: Dropdown menu for a single column

We extract by pattern, from the “First” match, with the regular expression “\d+” to select a number with multiple digits. Be careful not to check the ‘Drop Source Field’ option, as we need to maintain the source field for joins:

Figure 10: Extract Text menu

We’ll name the resulting field “question_id”. Afterwards, we repeat the same actions but select the “Last” match instead, and name the resulting field “answer_group_id”.

Finally, we want to make sure that our data is in an optimised format, so if it is not already stored in Apache Parquet (or similar) we should make a raw reflection of the 1-1 view.

For the following layers of views, we will be focusing more on the queries we want to cover.

So far we have seen how to add physical datasets to Dremio, how to adapt the data with 1-1 views, and we have also ensured that the data is in an optimised format. We’re all set to start with more reflections, but first let’s take a look at the queries we want to accelerate to plan the next steps properly.

4. Queries

The goal of the data is to meet business needs which, in this case, are visual reports that offer key insights. These visualisations made with our BI tool (Power BI) create queries that are passed to Dremio, where they are executed to retrieve the data from S3.

In addition to these queries from the visualisation itself, we also have some extra queries to help understand how reflections work and how to improve performance. In our next blog post we’ll cover a more realistic use case with Power BI.

4.1. Example Query

Take a look at the following query:

-- Percentage of participants grouped by age range filtered per year 
with groupNA as ( 
    select "year", 
    count(survey_id) total_filt_answers 
    from Processing.sf_a_d 
    where age_id >= 100 or age_id = 5 or age_id = 0 
            and "year" > 2018 
    group by "year" 
), 
group15_24 as ( 
    select "year", 
    count(survey_id) total_filt_answers 
    from Processing.sf_a_d 
    where age_id >= 15 and age_id < 25 or age_id = 1 
            and "year" > 2018 
    group by "year" 
), 
group25_39 as ( 
    select "year", 
    count(survey_id) total_filt_answers 
    from Processing.sf_a_d 
    where age_id >= 25 and age_id < 40 or age_id = 2 
            and "year" > 2018 
    group by "year" 
), 
group40_54 as ( 
    select "year", 
    count(survey_id) total_filt_answers 
    from Processing.sf_a_d 
    where age_id >= 40 and age_id < 55 or age_id = 3 
            and "year" > 2018 
    group by "year" 
), 
group55_plus as ( 
    select "year", 
    count(survey_id) total_filt_answers 
    from Processing.sf_a_d 
    where age_id >= 55 and age_id < 100 or age_id = 4 
            and "year" > 2018 
    group by "year" 
), 
total_participants as ( 
    select count(survey_id) total_answers, "year" 
    from Processing.sf_a_d 
    group by "year" 
) 
select tot."year", 
    g15_24.total_filt_answers g15_24, 
    g25_39.total_filt_answers g25_39, 
    g40_54.total_filt_answers g40_54, 
    g55_plus.total_filt_answers g55_plus, 
    gna.total_filt_answers gna, 
    tot.total_answers 
from groupNA gna 
inner join group15_24 g15_24 on gna."year" = g15_24."year" 
inner join group25_39 g25_39 on gna."year" = g25_39."year" 
inner join group40_54 g40_54 on gna."year" = g40_54."year" 
inner join group55_plus g55_plus on gna."year" = g55_plus."year" 
inner join total_participants tot on gna."year" = tot."year" 
order by tot."year" desc 
;

We simply count the number of surveys answered, grouped by year and filtered by age; then we select the total to be able to compare participation proportionally within each age range.

Note that the tables used in the query are joins of the S3 source data. The table Processing.sf_a_d means that the view is in the Processing layer and the joined tables are survey_feedback, answer_groups, and date_dim. We created this join to make it easier for the query planner to identify the view that contains the reflection for the query. In addition, to keep everything nice and tidy, we created the Processing layer for all the joins we need.

If we were to plot this query it would look something like this:

Figure 11: Plot of the percentage of participants grouped by age and filtered by yea

If we look closely at the query we can see a pattern; we only need to focus on one of the subqueries:

select "year", 
count(survey_id) total_filt_answers 
from Processing.sf_a_d 
where age_id >= 25 and age_id < 40 or age_id = 2 
             and "year" > 2018 
group by "year"

Once we have managed to accelerate this one, all the others will be accelerated too (because they are identical, but with different values). There are no more aggregations outside the subqueries.

4.2. Reflection

The key to generating reflections is to create views that select the fields that participate in the query we want to accelerate, then build the reflection on top of this view.

Going back to the previous query, we wanted to focus on the subquery that appears multiple times. Creating the view for that is as simple as this:

select d."year", sf.survey_id, age.age_id 
from Staging.survey_feedback sf 
inner join Staging.age_dim age on sf.age_id = age.age_id 
inner join Staging.date_dim d on sf.date_id = d.date_id

The reflection for that view would be an aggregate reflection with:

Figure 12: Reflection to accelerate the query of the percentage of participants grouped by age and filtered by year

After creating the reflection, we execute the query to test its performance:

Figure 13: Execution details for the percentage of participants grouped by age and filtered by year query

We can see an impressive execution time of 2 seconds for 1B rows!

4.3. More Queries

Here’s another query for the same dataset:

-- Gender difference on top 5 answered surveys by de predominant gender 
with total_participation as ( 
    select survey_id, count(survey_id) n_answers 
    from Processing.sf_g 
    group by survey_id 
), 
man_participation as ( 
    select survey_id, count(survey_id) n_answers 
    from Processing.sf_g 
    where gender_id = 1 
    group by survey_id 
), 
woman_participation as ( 
    select survey_id, count(survey_id) n_answers 
    from Processing.sf_g 
    where gender_id = 2 
    group by survey_id 
) 
select m.survey_id, m.n_answers man_ans, w.n_answers woman_ans, 
    cast(m.n_answers as float)/(t.n_answers)*100 man_pct, 
    cast(w.n_answers as float)/(t.n_answers)*100 woman_pct, 
    ( 
        select max(max_gender) 
        from (values(m.n_answers), (w.n_answers)) as gender_max(max_gender) 
    ) as max_gender 
from man_participation m 
inner join woman_participation w on m.survey_id = w.survey_id 
inner join total_participation t on m.survey_id = t.survey_id 
order by cast(max_gender as float)/(t.n_answers) desc 
limit 5 
;

One way we could represent this query with a chart would be:

Figure 14: Representation of top gender difference in participation surve

Here’s another pattern-based approach – we count participation in each survey while filtering by gender:

select survey_id, count(survey_id) n_answers 
from Processing.sf_g 
where gender_id = 2 
group by survey_id

In addition to this we also have another subquery with an aggregation that is unlike the other subqueries:

select max(max_gender) 
from (values(m.n_answers), (w.n_answers)) as gender_max(max_gender)

The first thing is to create a view with all the participating fields; then we build the reflection selecting the following fields:

Figure 15: Reflection, with count and max as the measures, of the gender difference in answered surveys query

And the execution time for this query is:

Figure 16: Execution details of the gender difference in answered surveys query

We have one last query before going to the visualisation. This splits two queries that work together (one selects from the other):

-- Surveys that are more than 5% longer (on questions) than the average. 
with question_count as ( 
    select survey_id, count(distinct question_id) cnt_questions 
    from Staging.survey_feedback sf 
    group by survey_id 
), 
avg_questions_per_survey as ( 
    select ceil(avg(cnt_questions)) avg_q_per_survey 
    from question_count qc 
) 
select survey_id longer_survey, cnt_questions-(select avg_q_per_survey from 
avg_questions_per_survey) question_diff 
from question_count qc 
where cnt_questions > (select avg_q_per_survey from avg_questions_per_survey)*1.05 
;

We need to focus on this subquery:

select survey_id, count(distinct question_id) cnt_questions 
from Staging.survey_feedback sf 
group by survey_id

And the following subquery, which selects from the previous one:

select ceil(avg(cnt_questions)) avg_q_per_survey 
from question_count qc

To accelerate this, we are going to create two views, and a reflection on top of each of them.

The first view is focused on the root subquery:

select survey_id, question_id 
from Staging.survey_feedback sf

With the following reflection:

Figure 17: Reflection, with count as the measure, for the root subquery of the 5% larger surveys quer

And the second view is oriented to accelerate the next:

with question_count as( 
    select survey_id, count(distinct question_id) cnt_questions 
    from Staging.survey_feedback sf 
    group by survey_id 
) 
select cnt_questions 
from question_count

With the following reflection:

Figure 18: Reflection, with count and sum as the measures, for the outermost subquery of the 5% larger surveys query

Note that the second view we created stems from the root subquery, so that we can select the measures of the reflection based on the previous calculation.

Then we can execute the query and check that both reflections were used:

Figure 19: Job details on the reflections used for the execution of the 5% larger surveys query

And the execution time is:

Figure 20: Execution details of the 5% larger surveys query

Now that we have seen how we can create top-notch reflections for our queries, let’s run through the last topic in this article.

4.4. Incremental Reflections

Before creating reflections it is important to have the data ready to guarantee faster creation and the ability to carry out other types of queries.

In this case, we know that the data will always be queried by some specific fields, meaning that there will be restricting filters like:

WHERE question_id = ‘123456’

So a good foundation consists of creating a “general purpose” raw reflection on top of the fact table. This reflection will be partitioned by those fields used on restricted filters and those that appear in joins, although it is not advisable to use a lot of partitioning fields. We can build this reflection directly in the Staging layer.

We end up with a reflection like this:

Figure 21: General purpose reflection built on top of the virtualisation of the fact table

With this reflection, we see how the other reflections from the layers in Staging are created faster, and that all queries will be accelerated even if we don’t build another reflection oriented to the query in question.

Now that we have seen how we can leverage reflections to accelerate queries, we are ready to start with the second part, where we will focus on a more realistic use case: querying from the plots created with a BI tool like Power BI.

5. Next Steps

In our next blog post, we will see how to prepare reflections to accelerate BI-oriented queries that apply different filters for the same plot. We will also see how keeping reflections simple is sometimes better, with execution times to prove this!

If you have any questions about this blog post, Dremio, or building the most suitable architecture for your business, contact us and our certified experts will be happy to help you. Here at ClearPeaks we have ample experience in dealing with data no matter the field: Business Intelligence, Big Data, Machine Learning, and more!

Alex M

alex.maillo@clearpeaks.com

16 Nov 2022 Massive Survey Data Exploration with Dremio (Part 1)

1. Introducing Dremio

2. The Dataset

3. Views

4. Queries

5. Next Steps

Alex M

More Links

From Our Blog