Technology Blogs by Members
Explore a vibrant mix of technical expertise, industry insights, and tech buzz in member blogs covering SAP products, technology, and events. Get in the mix!
cancel
Showing results for 
Search instead for 
Did you mean: 
Former Member

Hi Folks,

I've been mulling over a series of documents for a while now and finally have the bandwidth to put some of them together.

I've decided to group several of them under the idea of "Thinking in HANA". As a massively-parallel, column-oriented, in-memory platform, HANA affords many opportunities to come up with new and creative solutions to seemingly well-understood problems. Moreover, while HANA fully supports SQL-92, a majority of analytical use cases require efficient and often complex data models that support demanding BI requirements, which lead to several challenges:

  1. Complex data flows captured in hand-coded SQL and/or SQLScript can be difficult to maintain.
  2. SQL queries against raw tables, whether hand-coded or software-generated (i.e. from a semantic layer like BOBJ Universes), often fail to leverage native high-performance engines in HANA (i.e. OLAP/Calc Engine).
  3. While the "structure" of data models corresponds closely to well-understood relational operations (different join types, filters, aggregations, calculations), there are multiple cases where HANA does not behave in a strictly relational fashion, leading to potentially confusing results.
  4. Some SQL operators can't be modeled in HANA a 1:1 fashion. For example, UNION, EXCEPT/MINUS, and INTERSECT set operators are unavailable as node types in HANA Calculation Views.

This first article will review point 4 above - how to model database set operators (UNION, EXCEPT, INTERSECT) though a straightforward Calculation View "pattern". I also intend on fleshing out some of the theoretical implications of "set operator" thinking in HANA. Also, while set operators such as MINUS and INTERSECT aren't commonly required for meeting analytical needs on HANA, they do have implications for other more frequent requirements. I'll elaborate on these in another article.

The assumption is that the readers have a basic understanding of SQL, Set Theory, and data modeling in HANA.


Sets – Brief Review


The following Venn Diagram gives a very simple example of how two sets might relate, having some elements in common but some unique to each set.



In the business world there are many cases where two data sets want to be compared - i.e. planned and actual sales figures, sales amounts and invoiced amounts, and many others.

In terms of comparing two sets of data, the following questions can be asked (and answered):

1) What are all the elements in both data sets?

2) Which elements are exclusive to SET 1?

3) Which elements are exclusive to SET 2?

4) Which elements are in both sets?

5) Which elements are in SET 1 (non-exclusive)?

6) Which elements are in SET 2 (non-exclusive)?

7) Which elements are in SET 1 or SET 2?

The questions above are answered below, with solutions in: 1) plain SQL, 2) 'alternate' SQL, and 3) Calculation View approach. The solutions are illustrated with their respective Venn diagrams and associated set operators.

'Alternate' SQL is SQL that reaches the same answer but uses the same logic flow as the Calculation View. It facilitates the "jump" from traditional SQL to HANA data modeling.

Please see the appendix of this document for the SQL to generate both data sets.

Basic Set Operators

The first four questions posed above are answered (solved) below. The solutions leverage the basic set operations of UNION, MINUS, and INTERSECT. (Here no distinction is made between UNION and UNION ALL). The last three of the seven questions above are solved in a “compound” fashion, built with more than one of the basic set operators.


1. What are the elements in both data sets?



SQL Solution


-- UNION (ALL)

SELECT *

FROM

(

       SELECT ELEMENT

       FROM SET_1

       UNION ALL

       SELECT ELEMENT

       FROM SET_2

)

ORDER BY ELEMENT;

Alternate SQL Solution


None

Calculation View Solution


The Calculation View simply requires combining both data sets via UNION node. Despite the name, the UNION behaves as UNION ALL.



2. Which elements are exclusive to SET 1?

SQL Solution

-- Complement of Set 2 with respect to Set 1

SELECT *

FROM

(

       SELECT ELEMENT

       FROM SET_1

       MINUS -- EXCEPT also works

       SELECT ELEMENT

       FROM SET_2

)

ORDER BY ELEMENT;

Alternate SQL Solution

-- Complement of Set 2 with respect to Set 1

-- UNION (ALL)

SELECT ELEMENT

FROM

(

       SELECT ELEMENT, 2 AS FLAG

       FROM SET_1

    

       UNION ALL

    

       SELECT ELEMENT, 3 AS FLAG

       FROM SET_2

)

GROUP BY ELEMENT

HAVING SUM(FLAG) = 2

ORDER BY ELEMENT;

Calculation View Solution

The Calculation View accomplishes the same logic as the Alternate SQL – adding constant values to each data set, aggregating, and then filtering for result records where aggregate value of FLAG column is equal to the constant for the desired data set.



3. Which elements are exclusive to SET 2?

SQL Solution

-- Complement of Set 1 with respect to Set 2

SELECT *

FROM

(

       SELECT ELEMENT

       FROM SET_2

       EXCEPT

       SELECT ELEMENT

       FROM SET_1

)

ORDER BY ELEMENT;

Alternate SQL Solution

-- Complement of Set 1 with respect to Set 2

SELECT ELEMENT

FROM

(

       SELECT ELEMENT, 2 AS FLAG

       FROM SET_1

    

       UNION ALL

    

       SELECT ELEMENT, 3 AS FLAG

       FROM SET_2

)

GROUP BY ELEMENT

HAVING SUM(FLAG) = 3

ORDER BY ELEMENT;

Calculation View Solution

Note that the solution approach is the same as previously except with a filter for SET 2’s constant value rather than that of SET 1


4. Which elements are in both sets?

SQL Solution

-- intersection of Set 1 and Set 2

SELECT *

FROM

(

       SELECT ELEMENT

       FROM SET_1

    

       INTERSECT

    

       SELECT ELEMENT

       FROM SET_2

)

ORDER BY ELEMENT;

Alternate SQL Solution

-- Intersection of Sets 1 and 2

SELECT ELEMENT

FROM

(

       SELECT ELEMENT, 2 AS FLAG

       FROM SET_1

    

       UNION ALL

    

       SELECT ELEMENT, 3 AS FLAG

       FROM SET_2

)

GROUP BY ELEMENT

HAVING SUM(FLAG) = 5

ORDER BY ELEMENT;

Calculation View Solution

Note that the solution approach is the same as previously except with a filter for FLAG = 5 to capture only those records that fall under both (i.e. 2+3 = 5, hence capturing only overlapping (intersecting) records).

Compound Set Operations

Solutions for question 5-7 are solved by applying more than one basic set operator. As such, these solutions are referred to as “compound set operations”.

5. Which elements are in SET 1 (non-exclusive)?

SQL Solution

SELECT *

FROM

(

       (

              SELECT ELEMENT

              FROM SET_1

              EXCEPT

              SELECT ELEMENT

              FROM SET_2

       )

       UNION ALL

       (

              SELECT ELEMENT

              FROM SET_1

              INTERSECT

              SELECT ELEMENT

              FROM SET_2   

       )

)

ORDER BY ELEMENT;



Alternate SQL Solution

SELECT ELEMENT

FROM

(

       SELECT ELEMENT, 2 AS FLAG

       FROM SET_1

    

       UNION ALL

    

       SELECT ELEMENT, 3 AS FLAG

       FROM SET_2

)

GROUP BY ELEMENT

HAVING SUM(FLAG) IN (2, 5)

ORDER BY ELEMENT;



Calculation View Solution

6. Which elements are in SET 2 (non-exclusive)?

SQL Solution

SELECT *

FROM

(

       (

              SELECT ELEMENT

              FROM SET_2

              EXCEPT

              SELECT ELEMENT

              FROM SET_1

       )

       UNION ALL

       (

              SELECT ELEMENT

              FROM SET_1

              INTERSECT

              SELECT ELEMENT

              FROM SET_2   

       )

)

ORDER BY ELEMENT;

Alternate SQL Solution

SELECT ELEMENT

FROM

(

       SELECT ELEMENT, 2 AS FLAG

       FROM SET_1

    

       UNION ALL

    

       SELECT ELEMENT, 3 AS FLAG

       FROM SET_2

)

GROUP BY ELEMENT

HAVING SUM(FLAG) IN (3, 5)

ORDER BY ELEMENT;



Calculation View Solution

7. Which elements are in SET 1 or SET 2?

SQL Solution

SELECT *

FROM

(

       (

              SELECT ELEMENT

              FROM SET_1

              EXCEPT

              SELECT ELEMENT

              FROM SET_2

       )

       UNION ALL

       (

              SELECT ELEMENT

              FROM SET_2

              EXCEPT

              SELECT ELEMENT

              FROM SET_1   

       )

)

ORDER BY ELEMENT;

Alternate SQL Solution

SELECT ELEMENT

FROM

(

       SELECT ELEMENT, 2 AS FLAG

       FROM SET_1

    

       UNION ALL

    

       SELECT ELEMENT, 3 AS FLAG

       FROM SET_2

)

GROUP BY ELEMENT

HAVING SUM(FLAG) != 5

ORDER BY ELEMENT;



Calculation View Solution

Discussion

A few points are worthy of additional discussion:

Over-engineered solution?


The astute reader will note that the solutions for problems (questions) 6 and 7 could easily be achieved by simply querying SET_1 or SET_2 by themselves. This is true. However, from a theoretical perspective, capturing duplicates would still require inspecting SET_1 and SET_2. According to the HANA SQL guide, EXCEPT/MINUS/INTERSECT can operate “as such”, or with DISTINCT keyword. Unfortunately, they still function as distinct without the DISTINCT keyword – making it so that solutions 6 or 7 can’t be properly achieved if duplicates are required for any reasons. This is a bit pedantic though – as duplicates almost never need to be reported (and they could still be captured through more complex SQL).

Empty set

The number of possible “locations” of an element is 3 – i.e. in SET_1, SET_2, or both. An element can either exist, or not exist, in those three locations. This means that 8 outcomes are possible, calculated as 2^3, where 2 is the number of states (exists or not exists) and 3 is the number of “locations”. In the above examples, only 7 questions are addressed as the “empty set” is not a relevant question in the context of analytics for this discussion.


The eight possibilities can also be captured in the following table. Empty set is grayed out:

In SET 1

In SET 2

In both sets

1

FALSE

FALSE

FALSE

2

FALSE

FALSE

TRUE

3

FALSE

TRUE

FALSE

4

FALSE

TRUE

TRUE

5

TRUE

FALSE

FALSE

6

TRUE

FALSE

TRUE

7

TRUE

TRUE

FALSE

8

TRUE

TRUE

TRUE

Identifying Possibilities (Advanced)

The following discussion is for the sake of theoretical curiosity rather than an actual pragmatic solution. The problem is partially solved, left to readers smarter than myself to answer.

Consider the following Venn diagram, particularly the sections in green.


How would you go about finding the elements associated with the green sections above? Visually we can see that it’s A, G and D – but what if each set had many elements and we had to solve this situation with SQL?

First note that set operators like UNION and INTERSECTION correlate with the Boolean operations OR and AND. Moreover, databases have helpful operations like IN and BETWEEN which essentially reduce to combinations of ANDs and ORs. In solutions highlighted above in this paper, some “Boolean logic” was implemented in set operators (UNION, MINUS, etc), and some in the filter condition to meet the final output. The question is – can all of the logic be implemented in set operations, with a filter on a single value that identifies all possible combinations? In our diagram above – rather than building confusing SQL (or Calc Views) with various set operations, and filtering further with Boolean logic – could we simply identify a single value assigned to that combination, and filter on that value? The answer is yes. It’s not pragmatic, but is interesting to note that it’s possible.

Here’s a simplified version that shows the direction of the solution – although not yet fully implemented. Note that this simplest-case becomes very complex – highlighting unpractical it is – but also illustrating the “elegance” assigned to mathematics that’s often easy to miss.

Consider the following Venn diagram:


As noted previously, there are eight possible situations for any single element. In previous solutions we assigned “constant values” to each base set, combined them, and then filtered in order to arrive at a desired result. You may have noticed that the numbers used, 2 and 3, are prime numbers – which indicates the direction of a more robust solution.

Simply aggregating numbers, even when prime, does not guarantee a unique outcome. If three sets are available, a developer were to use the first three prime numbers: 2, 3, and 5 – it’s clear that aggregating 2 and 3 would equal 5 – causing problematic results (potentially) if a similar approach were used as previously in this paper.

The solution approach (unfinished) then is as follows:

  1. 1) Assign a prime number constant value to each source set. Set the constant value of other source sets equal to 1.
  2. 2) Aggregate the sets, and add a calculated field equal to the product of the source set constant values. Since the source values are prime numbers greater to one (or one), this product is guaranteed unique.
  3. 3) Map this result to itself with a CROSS JOIN.
  4. 4) Multiply the calculated field by itself.
  5. 5) Query two fields from this set: ELEMENT (from either table in CROSS JOIN) and M_FLAG (where M_FLAG is the name I gave to the calculated field). Note that there is a distinct value assigned to each combination of records that fulfill 6 of the 8 possibilities (i.e. where values are found in one or two regions. The empty set, and the case of all three regions is not solved. Three CROSS JOINs, correlating to the three regions – solve the problem fully, but also result in redundant records).

  • [ My intuition and some playing around with numbers indicates the following: The number of “regions” is a function of the number of sets N, defined by N * (N-1) + 1. Note: Thanks to josef.bodenschatz for noting that the function should actual be N^2 - 1. This holds for up to 4 sets (4 circles) and likely more. The number of “states” of an element it seems is the square of the number of “regions” based on the examples in this document. So in the case of four sets (four circles), that would make (4*4 - 1) ^ 2 = 225 combinations. Multiplying the first 169 prime numbers would result in a massive number, again indicating how unpractical this approach is. J

Run the following SQL, and note how a filter on ID = 6 would give you the “complement” of the sets, as one example.

DROP TABLE SETS_WITH_FLAGS;

CREATE COLUMN TABLE SETS_WITH_FLAGS AS

(

SELECT ELEMENT, MAX(FLAG_1) AS FLAG_1, MAX(FLAG_2) AS FLAG_2, (MAX(FLAG_1)*MAX(FLAG_2)) AS M_FLAG

FROM

(

       SELECT 'A' AS ELEMENT, 2 AS FLAG_1, 1 AS FLAG_2 FROM DUMMY UNION ALL

       SELECT 'B' AS ELEMENT, 2 AS FLAG_1, 1 AS FLAG_2 FROM DUMMY UNION ALL

       SELECT 'B' AS ELEMENT, 1 AS FLAG_1, 3 AS FLAG_2 FROM DUMMY UNION ALL

       SELECT 'C' AS ELEMENT, 1 AS FLAG_1, 3 AS FLAG_2 FROM DUMMY

)

GROUP BY ELEMENT

);

SELECT T2.ELEMENT, T1.M_FLAG * T2.M_FLAG AS ID FROM SET_FLAG T1 CROSS JOIN SET_FLAG T2 ORDER BY T1.M_FLAG * T2.M_FLAG;

Appendix

Following is the source data SQL used for the examples in this document.

DROP TABLE SET_1;

CREATE COLUMN TABLE SET_1

(

       ELEMENT CHAR(1),

       PRIMARY KEY (ELEMENT)

);

DROP TABLE SET_2;

CREATE COLUMN TABLE SET_2

(

       ELEMENT CHAR(1),

       PRIMARY KEY (ELEMENT)

);

INSERT INTO SET_1 VALUES ('A');

INSERT INTO SET_1 VALUES ('B');

INSERT INTO SET_1 VALUES ('C');

INSERT INTO SET_1 VALUES ('D');

INSERT INTO SET_1 VALUES ('E');

INSERT INTO SET_2 VALUES ('D');

INSERT INTO SET_2 VALUES ('E');

INSERT INTO SET_2 VALUES ('F');

INSERT INTO SET_2 VALUES ('G');

INSERT INTO SET_2 VALUES ('H');



37 Comments
Labels in this area