Deduplication using Analytic Function

05 Jul 2011 Deduplication using Analytic Function

Posted at 10:02h in Enterprise BI, ETL by Jordi S

When implementing an ETL, one of the most common tasks is the deduplication of rows. If you need to implement this in PL/SQL, one method could be with correlated subqueries. However, in this article we will take the approach sometimes forgotten by developers, which is using Analytic Functions.

The main objective of this article is to compare the two different ways to realize the deduplication.

Scenario:

The goal is to delete the data that has one more row per primary key in this case in the PK_BILL column. The rows that we want to delete are all which come after the most recent DT_EXTRACTION row.

However, our case is a bit more complex as the rows to be deleted under DT_EXTRACTION contain the same values, which means every column is duplicated. In this case we randomly deleted all rows except for one.

Solution A: with correlated SQL / GROUP BY clause

A possible solution using the common correlated subqueries would be to do it with two queries:

· Below is one query to delete the rows which contain the same PK_BILL value except the row which has the most recent DT_EXTRACTION.

· Below is another query to select all the rows to delete which contain the same PK_BILL value except for one (i.e.: the lowest rowid)

Solution B: with Analytic Function

Below you can find a brief review about the Analytic Functions:

Parts of an Analytic Function

Represents an abstract set of rows such as:

A partition

– Specified by PARTITION BY clause

– Allows to subdivide the table, much like GROUP BY clause

– Without PARTITION BY clause, the whole table is in a single partition.

– Can contain a frame

A frame:

– Specified by ORDER BY clause and frame clause

– Defined in a partition

– Allows to tell how far the set is applied

– Defines the ordering of the set.

– Never goes across two partitions.

The general syntax of analytic function is:

Function(args) OVER ( [PARTITION BY <…>] [ORDER BY <….>] )

List of some important analytic functions:

ROW_NUMBER() : Returns number of the current row
RANK() : Returns rank of the current row
FIRST_VALUE() :Returns the first value of the frame.
LAG(): Returns value of the row above
LEAD(): Returns value of the row below.

Let’s see a detailed example of the ROW_NUMBER() function :

Another possible solution using Analytic Function could be the following:

With this sample, we can resume all the different cases in just one query.

Conclusion

Not only will you have better performance using Analytic Function, you will also have a more elegant and brief solution. Furthermore, you don’t have to be an expert to understand these functions; it is enough to know the basis of Analytic Functions, as the code is clear and understandable.

Tags:

Jordi S

jordi.sota@clearpeaks.com

05 Jul 2011 Deduplication using Analytic Function

Tags:

Jordi S

More Links

From Our Blog