[Oracle] Myths and common misconceptions about (tr...

stefan_koehler · ‎06-17-2013

Introduction

In the past years i have worked a lot with mission critical Oracle databases in highly consolidated or centralized environments and noticed several myths and common misconceptions about the memory management for Oracle databases on Linux (mainly SLES and OEL).

This blog covers the basics of the relevant memory management for Oracle databases on Linux and tries to clarify several myths. I just start with the common ones and maybe extend this blog post with several new and interesting little details over the time. It should be something like a central and sorted collection of relevant information.

Definition and insights into huge pages and transparent huge pages

Official Linux Documentation

Huge Pages and Transparent Huge Pages

Memory is managed in blocks known as pages. A page is 4096 bytes. 1MB of memory is equal to 256 pages; 1GB of memory is equal to 256,000 pages, etc. CPUs have a built-in memory management unit that contains a list of these pages, with each page referenced through a page table entry.

There are two ways to enable the system to manage large amounts of memory:

Increase the number of page table entries in the hardware memory management unit
Increase the page size

The first method is expensive, since the hardware memory management unit in a modern processor only supports hundreds or thousands of page table entries. Additionally, hardware and memory management algorithms that work well with thousands of pages (megabytes of memory) may have difficulty performing well with millions (or even billions) of pages. This results in performance issues: when an application needs to use more memory pages than the memory management unit supports, the system falls back to slower, software-based memory management, which causes the entire system to run more slowly.

Red Hat Enterprise Linux 6 implements the second method via the use of huge pages.

Simply put, huge pages are blocks of memory that come in 2MB and 1GB sizes. The page tables used by the 2MB pages are suitable for managing multiple gigabytes of memory, whereas the page tables of 1GB pages are best for scaling to terabytes of memory.

Huge pages must be assigned at boot time. They are also difficult to manage manually, and often require significant changes to code in order to be used effectively. As such, Red Hat Enterprise Linux 6 also implemented the use of transparent huge pages (THP). THP is an abstraction layer that automates most aspects of creating, managing, and using huge pages.

THP hides much of the complexity in using huge pages from system administrators and developers. As the goal of THP is improving performance, its developers (both from the community and Red Hat) have tested and optimized THP across a wide range of systems, configurations, applications, and workloads. This allows the default settings of THP to improve the performance of most system configurations.

Note that THP can currently only map anonymous memory regions such as heap and stack space.

Huge Translation Lookaside Buffer (HugeTLB)

Physical memory addresses are translated to virtual memory addresses as part of memory management. The mapped relationship of physical to virtual addresses is stored in a data structure known as the page table. Since reading the page table for every address mapping would be time consuming and resource-expensive, there is a cache for recently-used addresses. This cache is called the Translation Lookaside Buffer (TLB).

However, the TLB can only cache so many address mappings. If a requested address mapping is not in the TLB, the page table must still be read to determine the physical to virtual address mapping. This is known as a "TLB miss". Applications with large memory requirements are more likely to be affected by TLB misses than applications with minimal memory requirements because of the relationship between their memory requirements and the size of the pages used to cache address mappings in the TLB. Since each miss involves reading the page table, it is important to avoid these misses wherever possible.

The Huge Translation Lookaside Buffer (HugeTLB) allows memory to be managed in very large segments so that more address mappings can be cached at one time. This reduces the probability of TLB misses, which in turn improves performance in applications with large memory requirements.

*** Side Note: Transparent Huge Pages (THB) support was officially announced with Linux kernel version 2.6.38.

Oracle Documentation addition

HugePages is a feature integrated into the Linux kernel with release 2.6. This feature basically provides the alternative to the 4K page size (16K for IA64) providing bigger pages.

Regarding the HugePages, there are some other similar terms that are being used like, hugetlb, hugetlbfs. Before proceeding into the details of HugePages, see the definitions below:

Page Table: A page table is the data structure of a virtual memory system in an operating system to store the mapping between virtual addresses and physical addresses. This means that on a virtual memory system, the memory is accessed by first accessing a page table and then accessing the actual memory location implicitly.
TLB: A Translation Lookaside Buffer (TLB) is a buffer (or cache) in a CPU that contains parts of the page table. This is a fixed size buffer being used to do virtual address translation faster.
hugetlb: This is an entry in the TLB that points to a HugePage (a large/big page larger than regular 4K and predefined in size). HugePages are implemented via hugetlb entries, i.e. we can say that a HugePage is handled by a "hugetlb page entry". The 'hugetlb" term is also (and mostly) used synonymously with a HugePage. In this document the term "HugePage" is going to be used but keep in mind that mostly "hugetlb" refers to the same concept.
hugetlbfs: This is a new in-memory filesystem like tmpfs and is presented by 2.6 kernel. Pages allocated on hugetlbfs type filesystem are allocated in HugePages.

Graphical illustration of regular (normal) and huge pages

When a single process works with a piece of memory, the pages that the process uses are reference in a local page table for the specific process. The entries in this table also contain references to the System-Wide Page Table which actually has references to actual physical memory addresses. So theoretically a user mode process (i.e. Oracle processes), follows its local page table to access to the system page table and then can reference the actual physical table virtually. As you can see below, it is also possible (and very common to Oracle RDBMS due to SGA use) that two different O/S processes can point to the same entry in the system-wide page table.

When HugePages are in the play, the usual page tables are employed. The very basic difference is that the entries in both process page table and the system page table has attributes about huge pages. So any page in a page table can be a huge page or a regular page. The following diagram illustrates 4096K hugepages but the diagram would be the same for any huge page size.

I guess this should be enough general information about huge pages and transparent huge pages to understand the concepts and basics of it. Please check the reference section, that it includes more detailed information, if you are interested into it (like performance comparison).

Why do we care about such memory handling at all and what are the advantages?

Well as i have previously mentioned i worked with Oracle databases in highly consolidated or centralized environments in the past years and in such environments there is a lot of thinking about "how to utilize the infrastructure and hardware in the best way". Just imagine a distributed SAP system landscape with a centralized Oracle database infrastructure (like on VMware or whatever). How can you put as much as possible databases on such a infrastructure without harming the performance of each other? "Classical" database or SQL tuning is important for reducing the I/O, CPU and memory load of course, but you can also tune the operating system to get a much better utilization and throughput.

... and so we get into memory management as well. RAM is still the most expensive and limiting hardware part and so we don't want to waste it without a valid reason. So we finally reached the use case of regular, huge and transparent huge pages for Oracle databases.

Jonathan Lewis has already written a blog post about a memory usage issue after a database migration from a 32 to 64-bit operating system and mentioned a solution called "huge pages" for it.

"A client recently upgraded from 32-bit Oracle to 64-bit Oracle because this would allow a larger SGA. At the same time they increased their SGA from about 2GB to 3GB hoping to take more advantage of their 8GB of RAM. The performance of their system did not get better – in fact it got worse.

…

It is important background information to know that they were running a version of Red Hat Linux and that there were typically 330 processes connected to the database using an average of about 4MB of PGA each.

Using small memory pages (4KB) on a 32-bit operating system the memory map for a 2GB SGA would be: 4 bytes for each of 524,288 pages, totalling 2MB per process, for a grand total of 660MB memory space used for mapping when the system has warmed up. So when the system was running at steady state, the total memory directly related to Oracle usage was: 2GB + 660MB + 1.2GB (PGA) = 3.8GB, leaving about 4.2GB for O/S and file system cache.

…

Upgrade to a 64-bit operating system and a 3GB SGA and you need 8 bytes for each page in the memory map and have 786,432 pages, for a total of 6MB per process, for a total of 1,980 MB of maps – an extra 1.3GB of memory lost to maps. Total memory directly related to Oracle usage: 3GB + 1.9GB + 1.2GB (PGA) = 6.1GB, leaving about 1.9GB for O/S and file system cache.

"

This example is about a pretty tiny SGA - now think about databases with a much larger cache size or a lot of databases with a small cache size (in a highly consolidated environment) and scale it up - i think you get the point here.

Advantages of huge pages

Larger Page Size and Less # of Pages: Default page size is 4K whereas the HugeTLB size is 2048K. That means the system would need to handle 512 times less pages.
Reduced Page Table Walking: Since a HugePage covers greater contiguous virtual address range than a regular sized page, a probability of getting a TLB hit per TLB entry with HugePages are higher than with regular pages. This reduces the number of times page tables are walked to obtain physical address from a virtual address.
Less Overhead for Memory Operations: On virtual memory systems (any modern OS) each memory operation is actually two abstract memory operations. With HugePages, since there are less number of pages to work on, the possible bottleneck on page table access is clearly avoided.
Less Memory Usage: From the Oracle Database perspective, with HugePages, the Linux kernel will use less memory to create pagetables to maintain virtual to physical mappings for SGA address range, in comparison to regular size pages. This makes more memory to be available for process-private computations or PGA usage.
No Swapping: We must avoid swapping to happen on Linux OS at all. HugePages are not swappable (whereas regular pages are). Therefore there is no page replacement mechanism overhead. HugePages are universally regarded as pinned.
No 'kswapd' Operations: kswapd will get very busy if there is a very large area to be paged (i.e. 13 million page table entries for 50GB memory) and will use an incredible amount of CPU resource. When HugePages are used, kswapd is not involved in managing them.

Myth 1 - We are running a Linux kernel version, that supports transparent huge pages and so the Oracle database already uses huge pages for the SGA

This is a common myth in newer Oracle / Linux system landscapes, but unfortunately not true at all. Starting with RedHat 6, OEL 6, SLES 11 SP 2 and UEK2 kernels, transparent huge pages are implemented and enabled (by default) in an attempt to improve the memory management, but not every kind of memory is currently supported.

The following information is grabbed from an Oracle Enterprise Linux 6.2 (2.6.39-100.7.1.el6uek.x86_64) and run with an Oracle database 11.2.0.3.2. The instance uses manual memory management (no AMM or ASMM) to keep it simple.

[root@OEL11 ~]# cat /sys/kernel/mm/transparent_hugepage/enabled

[always] madvise never

Transparent huge pages are enabled as "always" is included in brackets - so let's verify it with a running Oracle database.

*** Before database startup

[root@OEL11 ~]# cat /proc/meminfo | grep AnonHugePages

AnonHugePages:         0 kB

*** Database is started up with db_cache_size=300M, shared_pool_size=200M,

*** pga_aggregate_target=100M and pre_page_sga=TRUE

SQL> startup

ORACLE instance started.

Total System Global Area  559575040 bytes

*** After database startup

[root@OEL11 ~]# cat /proc/meminfo | grep AnonHugePages

AnonHugePages:      4096 kB

That seems to be pretty strange, right? The SGA is round about 500 MB and fully allocated, but only 4 MB of transparent huge pages are currently used. What's wrong here? Let's check each database process for its memory usage.

[root@OEL11 ~]# for PRID in $(ps -o pid -u orat11)

do

  THP=$(cat /proc/$PRID/smaps | grep AnonHugePages | awk '{sum+=$2} END  {print sum}')

  echo  "PID: $PRID - AnonHugePages: $THP"

done

PID: 11903 - AnonHugePages: 0

...

PID: 12653 - AnonHugePages: 0

PID: 12655 - AnonHugePages: 4096

PID: 12657 - AnonHugePages: 0

...

PID: 12926 - AnonHugePages: 0

[root@OEL11 ~]# ps -ef | grep 12655

orat11   12655     1  0 15:54 ?        00:00:00 ora_dbw0_T11

The DBWR is using the 4 MB of transparent huge pages only, but nothing compared to the SGA size, right?

In reality there is nothing wrong - it works as designed, if we check the kernel documentation for transparent huge pages:

[root@OEL11 ~]# cat /usr/share/doc/kernel-doc-2.6.39/Documentation/vm/transhuge.txt

...

Transparent Hugepage Support is an alternative means of using huge pages for the backing of virtual

memory with huge pages that supports the automatic promotion and demotion of page sizes and

without the shortcomings of hugetlbfs.

Currently it only works for anonymous memory mappings but in the future it can expand over the

pagecache layer starting with tmpfs.

...

.. and here we go .. transparent huge pages are currently supported for anonymous memory (like PGA heap) only and nothing else. So the SGA (shared memory) still uses the regular page size and transparent huge pages are not useful here to reduce the mapping overhead.

IMPORTANT HINT: Due to known problems - Oracle does not recommend transparent huge pages at all (even not for PGA heap) - please check the reference section (MOS ID 1557478.1 or SAPnote #1871318) for details about deactivating this feature.

"Because Transparent HugePages are known to cause unexpected node reboots and performance problems with RAC, Oracle strongly advises to disable the use of Transparent HugePages. In addition, Transparent Hugepages may cause problems even in a single-instance database environment with unexpected performance problems or delays. As such, Oracle recommends disabling Transparent HugePages on all Database servers running Oracle."

Myth 2 - Huge pages are difficult to manage in highly consolidated and critical environments

This myth was true at all (and is still true in various cases nowadays), but Oracle has improved the procedure for allocating huge pages with Oracle patchset 11.2.0.3.

Let's clarify the root problem first. Imagine a highly consolidated Oracle database system landscape on several physical or virtual hosts. You needed to calculate and define the amount of huge pages for all databases on that particular host in the pre Oracle 11.2.0.3 times. This works pretty well, if you have a stable number of instances/databases (with a fixed memory size), but what if you need to add several new instances/databases to your production server. You can not adjust the corresponding kernel parameters (and maybe need to reboot the server) manually, just because of a newly deployed instance/database. Otherwise the instance would allocate the whole SGA memory size in regular pages (4 kb), if the SGA of the new instance/database does not fit into the remaining free huge pages area. This can cause nasty paging trouble as the memory calculating is based on using large pages. Or think about automated database provisioning - of course you could size the huge page area that big, that you never run into a problem, but then we have missed the original goal of using the hardware resources as effective as possible.

Let's check out the improvements of Oracle 11.2.0.3 for the huge page handling. The following information is grabbed from an Oracle Enterprise Linux 6.2 (2.6.39-100.7.1.el6uek.x86_64) and run with an Oracle database 11.2.0.3.2. The instance uses manual memory management (no AMM or ASMM) to keep it simple and transparent huge pages are disabled.

Initial settings for every parameter setting test

[root@OEL11 ~]# cat /sys/kernel/mm/transparent_hugepage/enabled

always madvise [never]

[root@OEL11 ~]# cat /proc/meminfo  | grep Huge

HugePages_Total:     150

HugePages_Free:      150

HugePages_Rsvd:        0

HugePages_Surp:        0

Hugepagesize:       2048 kB

Transparent huge pages are disabled as "never" is included in brackets and round about 300 MB of memory is assigned to the huge pages "pool" and still free.

The SGA of my Oracle instance is still round about 500 MB - so it usually would not be able to allocate all the memory as huge pages. Let's verify the different behaviors with parameter "use_large_pages".

Parameter use_large_pages=TRUE (= Default)

Let's check the default behavior of Oracle 11.2.0.3 first.

*** Database is started up with db_cache_size=300M, shared_pool_size=200M,

*** pga_aggregate_target=100M, pre_page_sga=TRUE and use_large_pages=TRUE

SQL> startup

ORACLE instance started.

Total System Global Area  559575040 bytes

*** Alert Log

****************** Large Pages Information *****************

Total Shared Global Region in Large Pages = 300 MB (55%)

Large Pages used by this instance: 150 (300 MB)

Large Pages unused system wide = 0 (0 KB) (alloc incr 4096 KB)

Large Pages configured system wide = 150 (300 MB)

Large Page size = 2048 KB

*** After database startup

root@OEL11 ~]# cat /proc/meminfo | grep Huge

AnonHugePages:         0 kB

HugePages_Total:     150

HugePages_Free:        0

HugePages_Rsvd:        0

HugePages_Surp:        0

Hugepagesize:       2048 kB

root@OEL11 ~]# ipcs -a

------ Shared Memory Segments --------

key        shmid      owner      perms      bytes      nattch     status     

0x00000000 65537      orat11     640        12582912   25                     

0x00000000 98306      orat11     640        276824064  25                     

0x00000000 131075     orat11     640        20971520   25                     

0x00000000 163844     orat11     640        4194304    25                     

0x00000000 196613     orat11     640        247463936  25                     

0x4eb56684 229382     orat11     640        2097152    25

As you can see Oracle has used all the available huge pages first and after it run out it used regular pages for the rest. Several shared memory segments are created and used as a side effect of this enhancement.

Parameter use_large_pages=ONLY

Let's check the parameter value "use_large_pages=ONLY" and its behavior, if there are not sufficient large pages at database startup.

*** Database is started up with db_cache_size=300M, shared_pool_size=200M,

*** pga_aggregate_target=100M, pre_page_sga=TRUE and use_large_pages=ONLY

SQL> startup

ORA-27137: unable to allocate large pages to create a shared memory segment

Linux-x86_64 Error: 12: Cannot allocate memory

*** Alert Log

****************** Large Pages Information *****************

Parameter use_large_pages = ONLY

Large Pages unused system wide = 150 (300 MB) (alloc incr 4096 KB)

Large Pages configured system wide = 150 (300 MB)

Large Page size = 2048 KB

ERROR:

  Failed to allocate shared global region with large pages, unix errno = 12.

  Aborting Instance startup.

  ORA-27137: unable to allocate Large Pages to create a shared memory segment

As you can see we can also force the instance to use large pages only for the whole SGA and the startup fails with an ORA-27137 error, if not enough large pages are available. This setting is usually used to avoid an out of memory situation based on a mix of regular and large pages (like the default behavior).

Parameter use_large_pages=AUTO

This is a completely new introduced option with Oracle 11.2.0.3 - let's verify its impact, if there are not sufficient large pages at database startup.

*** Database is started up with db_cache_size=300M, shared_pool_size=200M,

*** pga_aggregate_target=100M, pre_page_sga=TRUE and use_large_pages=AUTO

SQL> startup

ORACLE instance started.

Total System Global Area  559575040 bytes

*** Alert Log

DISM started, OS id=1610

****************** Large Pages Information *****************

Parameter use_large_pages = AUTO

Total Shared Global Region in Large Pages = 538 MB (100%)

Large Pages used by this instance: 269 (538 MB)

Large Pages unused system wide = 0 (0 KB) (alloc incr 4096 KB)

Large Pages configured system wide = 269 (538 MB)

Large Page size = 2048 KB

Time taken to allocate Large Pages = 0.025895 sec

***********************************************************

*** After database startup

[root@OEL11 trace]# cat /proc/meminfo | grep Huge

HugePages_Total:     269

HugePages_Free:        0

HugePages_Rsvd:        0

HugePages_Surp:        0

Hugepagesize:       2048 kB

[root@OEL11 trace]# ipcs -a

------ Shared Memory Segments --------

key        shmid      owner      perms      bytes      nattch     status     

0x6c6c6536 0          root       600        4096       0                      

0x00000000 360449     orat11     640        12582912   24                     

0x00000000 393218     orat11     640        549453824  24                     

0x4eb56684 425987     orat11     640        2097152    24

As you can see Oracle automatically reconfigured the Linux kernel and increased the amount of huge pages (temporarily), so that the complete SGA fits in. This is possible, if you have enough free(able) memory. You will also notice an unusual startup comment like "DISM started, OS id=1610", if you look closely at the alert log snippet. DISM is responsible for such tasks like increasing the amount of huge pages or increasing the process priority. For such tasks root privileges are needed - so check the correct permissions (s-bit and owner) for the binary dism.

Myth 3 - Huge pages can not be used for Oracle instances with ASMM (Automatic Shared Memory Management)

This is the most common misconception that i am confronted with. Personally i am not a fan of automatic shared memory management, but some of my clients use it of course. I guess the root cause of this misconception is based on the naming of two similar memory features called ASMM and AMM. So let's check the official documentation about both features and the huge pages restriction first.

Automatic Shared Memory Management (ASMM)

Automatic Shared Memory Management simplifies SGA memory management. You specify the total amount of SGA memory available to an instance using the SGA_TARGET initialization parameter and Oracle Database automatically distributes this memory among the various SGA components to ensure the most effective memory utilization.

When automatic shared memory management is enabled, the sizes of the different SGA components are flexible and can adapt to the needs of a workload without requiring any additional configuration. The database automatically distributes the available memory among the various components as required, allowing the system to maximize the use of all available SGA memory.

Automatic Memory Management (AMM)

The simplest way to manage instance memory is to allow the Oracle Database instance to automatically manage and tune it for you. To do so (on most platforms), you set only a target memory size initialization parameter (MEMORY_TARGET) and optionally a maximum memory size initialization parameter (MEMORY_MAX_TARGET). The total memory that the instance uses remains relatively constant, based on the value of MEMORY_TARGET, and the instance automatically distributes memory between the system global area (SGA) and the instance program global area (instance PGA). As memory requirements change, the instance dynamically redistributes memory between the SGA and instance PGA.

When automatic memory management is not enabled, you must size both the SGA and instance PGA manually.

Restrictions for HugePages Configurations

The Automatic Memory Management (AMM) and HugePages are not compatible. With AMM the entire SGA memory is allocated by creating files under /dev/shm. When Oracle Database allocates SGA that way HugePages are not reserved. You must disable AMM on Oracle Database to use HugePages.
If you are using VLM in a 32-bit environment, then you cannot use HugePages for the Database Buffer cache. HugePages can be used for other parts of SGA like shared_pool, large_pool, and so on. Memory allocation for VLM (buffer cache) is done using shared memory file systems (ramfs/tmpfs/shmfs). HugePages does not get reserved or used by the memory file systems.
HugePages are not subject to allocation or release after system startup, unless a system administrator changes the HugePages configuration by modifying the number of pages available, or the pool size. If the space required is not reserved in memory during system startup, then HugePages allocation fails.

So basically said - both features are used for automatic memory management, but ASMM is controlling the SGA only and AMM is controlling the SGA and PGA. If you look closely at the restrictions you will see that only AMM is not compatible with huge pages, but ASMM is. AMM is not based on the "classical shared memory segment" - it is implemented by using the /dev/shm "filesystem".

Initial settings for ASMM huge pages test

[root@OEL11 ~]#  cat /sys/kernel/mm/transparent_hugepage/enabled

always madvise [never]

[root@OEL11 ~]#  cat /proc/meminfo  | grep Huge

HugePages_Total:     150

HugePages_Free:      150

HugePages_Rsvd:        0

HugePages_Surp:        0

Hugepagesize:       2048 kB

Transparent huge pages are disabled as "never" is included in brackets and round about 300 MB of memory is assigned to the huge pages "pool" and still free.

Using ASMM and check the huge pages behavior

*** Database is started up with sga_target=500M, pga_aggregate_target=100M,

*** pre_page_sga=TRUE and use_large_pages=AUTO

SQL> startup

ORACLE instance started.

Total System Global Area  521936896 bytes

*** Alert Log

DISM started, OS id=1400

****************** Large Pages Information *****************

Parameter use_large_pages = AUTO

Total Shared Global Region in Large Pages = 502 MB (100%)

Large Pages used by this instance: 251 (502 MB)

Large Pages unused system wide = 0 (0 KB) (alloc incr 4096 KB)

Large Pages configured system wide = 251 (502 MB)

Large Page size = 2048 KB

Time taken to allocate Large Pages = 0.022167 sec

***********************************************************

*** After database startup

[root@OEL11 trace]#  cat /proc/meminfo  | grep Huge

HugePages_Total:     251

HugePages_Free:        0

HugePages_Rsvd:        0

HugePages_Surp:        0

Hugepagesize:       2048 kB

[root@OEL11 trace]# ipcs -a

------ Shared Memory Segments --------

key        shmid      owner      perms      bytes      nattch     status     

0x6c6c6536 0          root       600        4096       0                      

0x00000000 65537      orat11     640        12582912   23                     

0x00000000 98306      orat11     640        511705088  23                     

0x4eb56684 131075     orat11     640        2097152    23

As you can see huge pages and ASMM is fully compatible and even works with the "new automatic huge pages extension feature".

Summary

Wow - this blog already become quite large, but unfortunately this topic is very wide open and so we needed to cover a lot of the basics first. I will keep extending this blog as soon as i notice new topics or if you ask for something specific.

If you have any further questions - please feel free to ask or get in contact directly, if you need assistance by implementing complex Oracle database landscapes or by troubleshooting Oracle (performance) issues.