I come across some Joins scenarios, where I have learnt few Best Practices, which I would like to share.
1. Rank Node (Ex: Get Distinct Records By Latest Date) before Joins:
Two Tables are there and Table names are CUSTOMER and EMAIL_SUBSCRIPTION.
CUSTOMER Table contains the columns - Customer_ID and Email_Address of them.
EMAIL_OPT_IN Table contains the columns - EMAIL_Address, Subscription_Status and Modified_On.
TABLE: CUSTOMER | TABLE: EMAIL_SUBSCRIPTION |
---|---|
Subscription Status 1 Means "YES" and 0 Means "No" |
Requirement:
To Get CURRENT_SUBSCRIPTION_STATUS For Each CUSTOMER.
Solution:
Wrong Approach | Correct Approach |
---|---|
In Dimension Calculation view, Two tables are directly included in Join node. Select the required fields and activate the calculation view. Data Preview of Calculation View is, This is giving Wrong results. Customers - C1 and C2 both has subscription status - Yes and No. We could not find the Current subscription status of the Customer. So before joining these two tables, Current subscription status of Email has to be find with Latest Modified_On date and This output has to be connected to Customers Table in Join node. | In Dimension Calculation view, Include Rank node in Calculation view. Add EMAIL_SUBSCRIPTION Table into it. In Rank node, Enter the following properties. Sort Direction Descending (Top N) on MODIFIED_ON Column and Threshold as 1. So It will filter out the records based on Latest Modified On and give distinct records with latest timestamp. Next, In Join node, Add Customer Table and RANK_1 Output. Select the required fields and activate the calculation view. Let's View the Final output of Calculation view. It is giving the Correct results. Email Subscription status Of Customer C1 is NO Email Subscription status Of Customer C2 is YES Before Joins, Ranking data is giving Distinct Records By Latest Time stamp. So This approach is giving correct results. |
2. Aggregate Table before Joins:
Two Fact Tables are there. Table names are Planning_Sales and Actual_Sales.
Both Tables contain 3 columns - Customer_Id and Product_Id and Sales.
TABLE: PLANNING_SALES | TABLE: ACTUAL_SALES |
---|---|
Requirement:
PLANNING and ACTUAL Sales by Product wise.
Solution:
Wrong Approach | Correct Approach |
---|---|
In Calculation view, Two Tables are directly included in Join node. Select the required fields and activate the calculation view. Data Preview of Calculation View is, This is giving Wrong results. Lets check the Output at Join node level. The Cause of the Problem is, Both the products P1 and P2 have two entries in Planning_Sales and Actual_Sales Table. While Joins between these two tables, Each record in PLANNING table looks Each record in ACTUAL table. So the Rows_Count is doubled and After Aggregation is giving wrong results. To avoid this Problem, First Sales has to be aggregated by Product wise for both PLANNING table and ACTUAL table, Then It has to be joined. | In Calculation View, Planning_Sales Table is Included in Aggregation Node_1 and Sales Aggregated by Product wise. Actual_Sales Table is Included in Aggregation Node_2 and Sales Aggregated by Product wise. Then In Join Node, Two Aggregated Outputs - Aggregation_1 and Aggregation_2 are Joined. In Join node, Join Between Two Aggregation Nodes, Select the required fields and activate the calculation view. Let's View the Final output of Calculation view. This is giving Correct results. Joins before aggregation is giving distinct records with aggregated values. So This approach is giving correct results. |
Regards,
Muthuram
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
8 | |
5 | |
5 | |
4 | |
4 | |
4 | |
3 | |
3 | |
3 | |
3 |