Saturday, January 8, 2022

How to use the Analytical function CUME_DIST in SQL Server

The Analytical function CUME_DIST in SQL Server
The SQL Server(2012 onwards),  has the introduced Analytical functions like LAG, LEAD, FIRST_VALUE, LAST_VALUE, CUME_DIST, PERCENT_RANK, PERCENTILE_CONT, and PERCENTILE_DISC. 

These Analytic functions also known as the window functions are mainly used to compute an aggregate value based on a group of rows or partition windows using the Over () Clause.

These window functions mainly categorized into four groups: Ranking, Aggregate, Analytic and Sequence Number generation.

We can use the analytic functions to compute moving averages, running totals, percentages or top-N results within a group.

Generally when we use aggregate functions like SUM(), COUNT(), AVG(), MAX(), MIN()..with Group By Clause, they return only one row for each group.
However, The analytical functions can return computed value along with all the rows for each group or window.

Now will discuss about one of the analytic function CUME_DIST with examples. 

CUME_DIST :
In statistics, the cumulative distribution function (CDF) is used to predict the probability of a given random variable is given by the number of values less than or equal to the number itself.
Syntax:
CUME_DIST( )  
OVER ( [ partition_by_clause ] order_by_clause )

The SQL Server, CUME_DIST function calculates the cumulative distribution of a value in a group of values. CUME_DIST computes the relative position of a specified value in a group of values. 
For example, the cumulative distribution for a given row in group is equal to the number of rows with values lower than or equal to the value of r, divided by the total number of rows n in that group. 

CUME_DIST (r) = (row number(r) in the group / no. of rows(n) in the group)

The range of values returned by CUME_DIST is greater than 0 and less than or equal to 1. It’s very helpful when we have to fetch only the top n% of the results.

Examples:
Suppose we have a sample of orders with 20 records with order details as follows :
Select 
Order_Id, Order_date, Prod_Id, Units_Sold
From tbl_Sample


Now from the above data, we wants to calculate the Cumulative Distribution or rows.

Scenario 1 :
Cumulative Distribution of Rows based on Order_Date. In this scenario, we are computing the Cumulative Distribution without Partitioning the data.
Select 
Order_Id, Order_date, Prod_Id, Units_Sold,
ROW_NUMBER ( ) Over ( Order by Order_Date) as Rec_No,
COUNT (*) Over  ( ) as Total_Rec_Count,
CUME_DIST ( ) Over (Order by Order_Date) as Cum_Dist
From tbl_Sample

# Result :

Notes :
In the above result, Rec_No is the Record Number, Total_Rec_Count is the Total rows count.
Cum_Dist = ( Rec_No / Total_Rec_Count )

Scenario 2 :
Cumulative Distribution of Rows based on Units_Sold, without Partitioning the Data. 
Select 
Order_Id, Order_date, Prod_Id, Units_Sold,
ROW_NUMBER ( ) Over ( Order by Units_Sold) as Rec_No,
COUNT (*) Over  ( ) as Total_Rec_Count,
CUME_DIST ( ) Over (Order by Units_Sold) as Cum_Dist
From tbl_Sample

# Result :


Notes:
If we carefully observe above, the Cumulative Distribution value is repeated in multiple places without continuous.
Take scenario of records 13, 14, 15. For these records the expected values in general are :
Cum_Dist (13) = ( 13 / 20 ) =  0.65
Cum_Dist (14) = ( 14 / 20 ) =  0.70
Cum_Dist (15) = ( 15 / 20 ) =  0.75

However, since the Units_Sold value '32' is same for these records, the function CUME_DIST ( ) will return the last value of that similar group to each record.

Scenario 3 :
Cumulative Distribution of Rows based on Units_Sold, with Partition By Prod_Id  and Order by Units_Sold

Select 
Order_Id, Order_date, Prod_Id, Units_Sold,
ROW_NUMBER ( ) Over (Partition By Prod_Id  Order by Units_Sold) as Partition_Rec_No,
COUNT (*) Over  (Partition By Prod_Id ) as Partition_Rec_Count,
CUME_DIST ( ) Over (Partition By Prod_Id  Order by Units_Sold) as Cum_Dist
From tbl_Sample

# Result :


Notes:
Partion_Rec_No is the Record number a Partition Window (eg. Partition of Prod_Id=1235).
eg. Partion_Rec_No for the records in the Partition of Prod_Id=1235 is 1,2,3,4,5,6,7,8.
Partion_Rec_Count is the No.of records a Partition Window.
eg. Partion_Rec_Count for the Partition of Prod_Id=1235 is 8

If we carefully observe above, the Cumulative Distribution value is repeated in multiple places without continuous.
Take scenario of records 4, 5, 6 (Rows 8, 9, 10) of the a Window(Group) where the Prod_Id= 1235. For these records the expected values in general are :
Cum_Dist (4) = ( 4 / 20 ) =  0.5
Cum_Dist (5) = ( 5 / 20 ) =  0.625
Cum_Dist (6) = ( 6 / 20 ) =  0.75

However, since the Units_Sold value '32' is same for these records, the function CUME_DIST ( ) will return the last value of that similar group to each record.

Scenario 4 :
In this Scenario, we do some additional calculations to like Cumulative Sum, Ranking for each Partition Window for our reference and understanding.

Select 
Order_Id, Order_date, Prod_Id, Units_Sold,
SUM (Units_Sold) Over  (Partition By Prod_Id Order by Units_Sold) as Partition_Units_Cum_Sum,
SUM (Units_Sold) Over  (Partition By Prod_Id ) as Partition_Units_Sum,
SUM (Units_Sold) Over  ( ) Total_Units_Sum,
COUNT (*) Over  ( ) as Total_Rec_Count,
RANK ( ) Over (Partition By Prod_Id  Order by Units_Sold) as Units_Rank_by_Prod,
DENSE_RANK ( ) Over (Partition By Prod_Id  Order by Units_Sold) as Units_DenseRank_by_Prod,
ROW_NUMBER ( ) Over (Partition By Prod_Id  Order by Units_Sold) as Partition_Rec_No,
COUNT (*) Over  (Partition By Prod_Id ) Partition_Rec_Count,
CUME_DIST ( ) Over (Partition By Prod_Id  Order by Units_Sold) as Cum_Dist
From tbl_Sample

# Result :

--------------------------------------------------------------------------------------------------------
Thanks, TAMATAM ; Business Intelligence & Analytics Professional
--------------------------------------------------------------------------------------------------------

No comments:

Post a Comment

Hi User, Thank You for visiting My Blog. Please post your genuine Feedback or comments only related to this Blog Posts. Please do not post any Spam comments or Advertising kind of comments which will be Ignored.

Featured Post from this Blog

How to compare Current Snapshot Data with Previous Snapshot in Power BI

How to Dynamically compare two Snapshots Data in Power BI Scenario: Suppose, we have a sample Sales data, which is stored with Monthly Snaps...

Popular Posts from this Blog