More and more, questions arise around SAP IQ support of hugepages in the Linux kernel.

 

There are two types of hugepages in the Linux kernel.  Traditional hugepages in which the application must specifically be compiled to use and transparent hugepages which is an OS level tuning mechanism for large memory programs in which no changes are needed to the application.

 

The following sections outline SAP IQ’s use of the different types of hugepages and how they should be configured in the kernel.

 

HugePages (traditional HugePages)

Hugepages is a mechanism that allows the Linux kernel to utilize the multiple page size capabilities of modern hardware architectures. Linux uses pages as the basic unit of memory, where physical memory is partitioned and accessed using the basic page unit. The default page size is 4096 Bytes in the x86 architecture. Hugepages allows large amounts of memory to be utilized with a reduced overhead. Linux uses “Transaction Lookaside Buffers” (TLB) in the CPU architecture. These buffers contain mappings of virtual memory to actual physical memory addresses. So utilizing a huge amount of physical memory with the default page size consumes the TLB and adds processing overhead.

 

SAP IQ does not currently use HugePages.  As such, it is imperative that the Linux kernel not be configured to allocate memory for HugePages use.  Any memory that is allocated to HugePages cannot be used by SAP IQ.

 

A default configured Linux system will typically have just a few MB of memory dedicated to HugePages.  The amount can be determined by running this command:

cat /proc/meminfo |grep Hugepagesize

 

It will typically return this value:

Hugepagesize: 2048 kB

 

Should more than just a few MB of memory be allocated to HugePages, it is strongly recommended to change the kernel configuration be set back to default by removing the HugePages options from /etc/sysctl.conf.

 

For details on HugePages in the Linux kernel, refer to this write-up: https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt

 

Transparent HugePages

Starting from RHEL6 and SUSE 11, Transparent HugePages are implemented and enabled by default. They are meant to improve memory management by allowing HugePages to be allocated dynamically by the "khugepaged" kernel thread, rather than at boot time like conventional HugePages.  In certain circumstances, Transparent HugePages have shown to negatively impact SAP IQ performance. It is strongly recommended to disable Transparent HugePages for SAP IQ systems until this OS feature can be fully tested under the client specific workload.

 

The performance issue with Transparent HugePages typically manifests itself in the form of high system time.  Using tools like vmstat and top, monitor the user, system, wait, and idle CPU utilization.  The high system utilization is a result of the memory manager as described at https://www.kernel.org/doc/Documentation/vm/transhuge.txt: “As the system ages, allocating huge pages may be expensive as the system uses memory compaction to copy data around memory to free a huge page for use.”

 

Should the system CPU utilization increase over time, it is possible that the OS is having to spend more time managing the Transparent HugePages and Anonymous memory segments.  Disabling Transparent HugePages can alleviate that bottleneck.

 

Details on Transparent HugePages can be found here:

https://www.kernel.org/doc/Documentation/vm/transhuge.txt

 

Details on disabling Transparent HugePages can be found in the “== Boot parameter ==” section of the previous link.

 

The recommendation is to use the “transparent_hugepage=never” option in the Linux kernel boot loader file.

Just in case you read about this little project that my friends and I have been working on for the past few months:

 

http://global.sap.com/news-reader/index.epx?category=ALL&articleID=22468&searchmode=C&page=1&pageSize=10

 

12.1 PB of raw data compressed to just 2 PB of RAM and disk in SAP HANA and SAP IQ.

 

Not one to rest, we are trying to push the limits even further with respect to the size of the database and data load speed.  If you recall last year, we loaded 34 TB/hour into SAP IQ (http://global.sap.com/corporate-en/news.epx?PressID=21160).

 

So, stay tuned for a lot more exciting news on IQ pushing the limits of structured and unstructured data.  Along with all of this will be a lot of write-ups on what was done, tested, the data, structure, etc.

Here’s my take on encryption and what it really means in the SAP Sybase IQ world…

 

Encryption is really the means to protect against unwanted access to data.  For SAP Sybase IQ (IQ), though, this comes in many different flavors, from different access paths.

 

First, you have network encryption.  This will cover unwanted access to data in flight from server to client and back for those that happen to be sniffing the network packets. IQ transport layer encryption will cover all unwanted access to data in flight between client and server.  That ‘in-flight’ can be the traditional over the network style or same host client server communication where it is done via shared memory or by only partially traversing the network stack.

 

Second, you have the encryption of data in the database from someone who only has access to the storage. Call this ‘at rest’ encryption.  This is implemented in IQ as full database encryption (FDE).  It protects against a user who has solely has access to the storage and can somehow read direct from disk.  FDE will guarantee that what is at rest on disk cannot be read by a program outside of IQ. 

 

In order for IQ to read the data, the key is used at start-up to unlock the data for its use.  The algorithm that is used is also not published, so even access to the key won’t allow full access to the data.  FDE applies to all tables, columns, and indexes in IQ.  Consequently, all read and write operations will be impacted by FDE.  This level of encryption has no security inside the database.  If a user has SELECT permission from an object, the data can be read in plain text.

 

The one downside to full database encryption is that it must be defined when the database is created and it cannot be disabled.

 

Lastly, you have column level encryption.  This is a surgical approach to at rest encryption as described previously where one simply encrypts the columns with sensitive data, not all columns.  Column encryption also prevents unwanted access to sensitive data.  Data access is still controlled by object permissions and role based access controls; you must have SELECT permission on the column to view it.  A second security step is involved, though.  The data that is read is gibberish unless the user possesses the key to unlock the encryption algorithm.  If the user possesses the proper key, data can then be read in plain text for that user. 

 

The downside to column encryption in IQ is that it is up to the application (or DBA) to manage the keys.  In other engines (ASE) the keys are stored in the database.  In IQ, the keys must be sent from the application and used directly in the query via AES_ENCRYPT and AES_DECRYPT function calls.  These calls can be hidden within a view so that the DBA can manage the keys inside the database rather than in the application. The overhead for column encryption is only on those reads and writes that need to access encrypted columns.  Because of the focused nature of column encryption, overall system overhead is much smaller compared to full database encryption.

 

So, the answer to the question of which encryption is best and which do I need really depends on what the business needs are.

 

Do you need to protect data in flight?  You need transport layer encryption, typically with stronger 256-bit AES encryption.

 

Do they need to protect data at rest on disk against someone outside the database reading it (solely at the OS level)?  Full database encryption, defined when the database is created, is needed.


Or do they need to protect the data from all users inside or outside the database?  Column level encryption is the best choice here.  It can be turned on or off at will and as the business needs change.

 

Lastly, performance is secondary.  I have yet to experience a customer that wants security, but is willing to sacrifice that security for performance.  The customers that I have worked with make sure that security is in place first then worry about performance.

Are you new to SAP Sybase IQ?  Have you used the product but are still not clear on what indexes to use or when to use them?  Have you just migrated to SAP Sybase IQ 16 and are wondering if there are any indexing changes?  This blog will discuss indexes and strategies for SAP Sybase IQ.

 

Basic Indexes

When creating a table in IQ, a default index is placed on every column.  This index is referred to as a Fast Projection (FP) index.  In order to enable optimized FP indexes, one must set the Minimize_Storage option to ON.

set option public.minimize_storage=’on’;

 

From this point, all new tables and columns will use n-bit encoding for highly optimized and compressed data storage and access.  Each FP index will be encoded as 1 to 31 bits depending on the data cardinality (2^1 to 2^31 distinct values).

 

Advanced Indexes

In addition to the basic FP indexes, IQ supports nine other indexes:

  • High Group (HG) – used for join columns and searchable columns whose data has more than 2,000 distinct values
  • Low Fast (LF) – used for searchable columns whose data has fewer than 2,000 distinct values
  • High Non Group (HNG) – used for range searches (except for date, time, and datetime data types) and for single column aggregations
  • Word (WD) – Use for whole word searching of character based data
  • Text (TEXT)– Use for full text searching of character based data
  • Date (DATE) – Use for fields that have a DATE data type (ranges and datepart queries)
  • Time (TIME) – Use for fields that have a TIME data type(ranges and datepart queries)
  • Datetime (DTTM) – Use for fields that have a DATETIME/TIMESTAMP data type (ranges and datepart queries)
  • Compare (CMP) – Use for ore-computing the comparison of two columns in the same table

 

When are indexes used?

In most relational database engines, the optimizer chooses the best single index for a given table in the query.  SAP Sybase IQ takes the optimization process to the next level by looking not only at each column in the query, but at all joins and search arguments.  For example, assume that a date column is used in a join and to search for the month datepart.  The optimizer would prefer to use the HG index on the date field for the join.  It would also prefer to use the DATE index to search on the month datepart.

 

In short, IQ will look at each join, where clause, aggregation, etc. to determine the best index for each operation.  Not just one index per table and not just one index per column.  But rather, one index for each use of each column in the query!

 

Should we be concerned about over indexing?

The short answer is an emphatic NO!  As mentioned previously, typical relation database engines pick the best single index to access data in a table.  Second, when adding indexes to most RDBMS engines the load times tend to slow down due to how data is processed.  Third, most database engines store data and indexes in different structures so that as more indexes are added the table size swells.  All of these lead to trying to minimize the total number of indexes on a table in most engines.

 

SAP Sybase IQ is not prone to these limitations.  We pick as many indexes as needed to process the entire query.  Indexes in IQ are not as large as the data being indexed.  Each of the indexes employs various levels of vectors, bitmaps, data representations, and compression to dramatically decrease the database size.  Finally, when loading data, IQ can consume all CPU resources.  We currently released a fully audited benchmark in which SAP Sybase IQ loaded data at over 34.3 TB/hour on an 80 core Linux system (http://scn.sap.com/community/developer-center/analytic-server/blog/2013/05/30/sap-sybase-iq-16-sets-guinness-world-record-with-343tb-per-hour-for-big-data-loading).

 

 

All of these SAP Sybase IQ features lead to an engine that can handle as many indexes as necessary for all query (canned or ad hoc) activity.  It is quite common to have 30-60% of all columns indexed with the advanced indexes (and in some cases multiple indexes per column).

 

When to use Indexes?

Now that we have some basics about indexes, the big question is when to use them? Here are some simple guidelines that will help cover every situation.  The general idea is to ask a series of questions about each column.  The answers to the questions will dictate whether or not indexes are needed, and if needed which one(s) are best.

 

  • Will this column be used as a JOIN clause?
    • NO: Continue to next question
    • YES: Use an HG index
  • Will this column be used as a search column in the WHERE clause?
    • NO: Continue to next question
    • YES: Use an HG index if the column has more than 2,000 distinct values.  If less than 2,000 distinct values use an LF.
  • Is the column data type a DATE, TIME, TIMESTAMP, or DATETIME?
    • NO: Continue to next question
    • YES: Use a DATE, TIME, or DTTM index.  All date/time/datetime fields should have this index as systems quite frequently use date fields for searching, filtering, and aggregations.
  • Will whole word searching be done on the character data?
    • NO: Continue to next question
    • YES: Use WD (WORD) index
  • Will text searching be done on the character data?
    • NO: Continue to next question
    • YES: Use a TEXT index
  • Will this column be compared to another column in the same table?
    • NO: Continue to next question
    • YES: Use a CMP (compare) index that includes both columns
  • Will this column be used for range searches?
    • NO: Continue to next question
    • YES: Use an HNG (high non group) index
  • Will this column be used in aggregations with no other columns?
    • NO: Continue to next question
    • YES: Use an HNG (high non group) index

 

There are two types of indexes in IQ: those that have detailed statistics and those that have minimal statistics.  The HG, LF, TEXT, and WD indexes contain detailed statistics.  The HNG, CMP, DATE, TIME, and DTTM indexes contain minimal statistics.  After walking through the above checklist, you may find that an index with minimal statistics was all that was warranted for the column.  Should that situation arise, it is also recommended that an index with detailed statistics is placed on that column.  Use an HG or LF index for these detailed statistics.  The choice will depend on the current or expected cardinality of data in the column.

 

The process to choose indexes may seem daunting especially if the data model has a significant number of tables and columns.  The process outlined above assumes that very little is known about the application and its use of the data.  This is, generally, not the case as most DBAs and data modelers do know more about the data and its use.

 

Should you know more about the model and data, we can simplify the process.  I prefer to look at columns that are known to be used in WHERE clauses of most queries and reports.  Indexes on join and search criteria are the single biggest factor in performance, so focusing solely on those will allow IQ to be indexed for nearly every situation.

 

Certainly some columns and indexes will be missed with this approach.  However, since the focus was on columns that are heavily used, the impact of missing a few minor, or less frequently used, columns should be relatively low.

 

Lastly, know that IQ provides a safety net for us!  IQ has an index advisor that can be enabled by setting the option Index_Advisor to ON.

set option public.Index_Advisor=’on’;

 

Once on, IQ will print index advice to query plans (if they are on) as well as to a system table that can be accessed via the stored procedure sp_iqindexadvice.  The index advisor output will note under what circumstance the index is needed.  The stored procedure will also aggregate the number of times the advice was made.  Should that procedure be used to trend data over a period of time (days, weeks, even months) the number of times the same advice was made can help focus efforts on the most recommended advice.