cancel
Showing results for 
Search instead for 
Did you mean: 

master..monProcedureCacheModuleUsage doesn't match master..monProcedureCacheMemoryUsage

Former Member

We're running out of procedure cache and from what I can tell, this happens as MEMC_SCOMPILE_1, shown by monProcedureCacheMemoryUsage table, grows to extremely high amount (over 6Gb).

I'm looking at the monProcedureCacheMemoryUsage and monProcedureCacheModuleUsage and I was expecting the totals for Module to match each other. But they don't.

I run this

select  ModuleID, sum(Active) Active

into    #memusage

from    master..monProcedureCacheMemoryUsage

group by ModuleID

order by ModuleID

select mod.ModuleID, mod.ModuleName, mod.Active ModActive, mem.Active  MemActive

from #modusage mod, #memusage mem where mod.ModuleID = mem.ModuleID

and get

ModuleID    ModuleName                     ModActive   MemActive

----------- ------------------------------ ----------- -----------

           1 Parser                                   6           6

           2 Utilities                                9          16

           4 Optimizer                                0           0

           5 Execution                             3512     2097003

           6 Access                                   0           0

           7 Backup                                   0           0

           8 Recovery                                 0           0

           9 Replication                             55          55

          10 Procedural Objects                 2022359        7106

          11 Sort                                     5           5

          12 HK GC                                   20          20

          14 BLOB Management                          0           0

          15 Partition Conditions                   737         830

          16 Pdes Local HashTab                       9           9

          17 Statement Cache                      80192        6184

          18 CIS                                      4           4

          19 Frame Management                         8           8

          21 Network                                  0           0

          22 Procmem Control                       5850        5850

          23 Data change                            633         633

          26 Multiple Temporary Database              4           4

Many Modules do match but some (especially the large ones dont). When we run out of procedures cache we have, Module 10 in module usage showing over 3,000,000 and Module 5 of memory usage showing over 3,000,000.

Should the totals in the 2 tables match ?

If not why not ?

Do other people get the same results ?

I've seem some bug fixes in 15.7 SP136 - is this a symptom of these issues ?

Thanks

Mike

Accepted Solutions (1)

Accepted Solutions (1)

simon_ogden
Participant
0 Kudos

Hi Mike,

Have a read of:

https://scn.sap.com/message/15990655#15990655

Cheers,

Simon

Former Member
0 Kudos

Thanks - I hadn't seen that - very interesting ... but I'll need to read it again to take it all in.

Just to note - We're currently using 15.7 SP134 so some of the issues should be fixed unless they've inserted some new bugs.

Former Member
0 Kudos

Thanks again for this ...

I've read the posting and this one Spinlocks and CPU usage in SAP ASE - SAP ASE - SCN Wiki and the doc http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/208c05a1-d739-3210-f0ae-f25f343c0...

which has helped to understand the issue ... but there's something I'm not very clear on regarding the Engine Local Cache

As I understand it

The ELC is allocated 50% (default) of the free procedure cache, but what happens when a ELC has allocated its space to procs. Is the ELC initially allocated at start up ?

On our server, we have 12Gb of procedure cache and 40 engines.

Am I correct in assuming, that each ELC is initially allocated (12Gb * 0.5) / 40 = 150Mb

What happens when, one or more of the ELCs uses its initial allocation ?

Does it then grab another block of memory based on 50% of the free space ie (6Gb * 0.5) / 40 = 75Mb

and when this is used up it grabs another 50% of free space depending on how much has been allocated ?

Or does the ELC get its allocations from the global cache is smaller chunks ? ie 2k pages.

From what I can see happening on our server, once we reach 50% of procedure cache used, we have huge spinlock contention on the global procedure cache which leads me to believe that the ELC is being allocated frequently in small chunks ? If so, then why wouldn't I just allocate > 70% of the procedure cache to the ELCs ?

I'm looking at the trace flags 753 and 757 but can't see what 757 actually does. Does anyone know the details ?

Thanks

Mike

Former Member
0 Kudos

After a lot of monitoring and reading I think we've found the issue (or at the very least we've found some solutions/workarounds).

Here are some useful links

https://scn.sap.com/thread/3746247

http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/208c05a1-d739-3210-f0ae-f25f343c0...

http://www.ndm.net/bi/pdf/Managing-DBMS-Workloads-v1.0-WP.pdf

http://wiki.scn.sap.com/wiki/display/SYBASE/Spinlocks+and+CPU+usage+in+SAP+ASE

http://scn.sap.com/thread/3658742

http://mattzhang-tech.blogspot.co.uk/2013/01/stored-procedure-processing-in-sybase.html

I agree with Simon's comment that the ELC was a "little rushed" and doesn't help in every case.

We've decided on these action ... if anyone thinks this is wrong let me know - thanks.

Assuming the monProcedureCacheMemoryUsage and monProcedureCacheModuleUsage are reporting the correct values, then we're not running short of procedure cache, but we are potentially ending up with so much fragmentation that requests for 4k or more procedure cache space are having to go to the global procedure and when many processes are doing this the spinlock contention is huge. In our case we end up with over 90% fragmentation.

So far we've worked around this by clearing out the procedure cache (dbcc proc(free_unused)) - but this seems heavy handed to me and is surely only a workaround.

We also tried dbcc proc(flush_elc) and this improved responsiveness of a server, but we don't know why this would be the case. We really need to understand in detail

how the ELC and procedure cache interact.

We're going to increase the amount of memory to the procedure cache to allow for fragmentation - although 16Gb does seeem rather high

(years ago a server of 32Gb RAM was considered large)

Additionially, we're going to increase the 'engine local cache percent' to 75% to give more allocation to the ELC which we hope will also reduce contention.

According to the manual the ELC should be allocated using the calc : proc cache size *  'engine local cache percent' / engine.

That should give us 150,00 pages but the log says "Proc header memory allocated 18891 pages for engine 1 local cache".

Haven't heard back why these figures are inconsistent ?

I think a better solution is using traceflag 753, which (if I've read this correctly) will only allocate 2k pages. This will reduce the fragmentation and means a page will always be available. If 16k is required, then 8 requests will be needed so will in theory be slower. In effect we're accepting lower average performance for much less contention (and since our machine has been un-responsive at times thats ok with us)

We've also looking at traceflag 757 which might be more aggressive in its procedure cache eviction - but that seems over the top when only half the cache is full.

Thanks again Simon.

simon_ogden
Participant
0 Kudos

Hi Mike,

I'll avoid going into the nitty gritty on this one but I'm glad you have a potential solution 🙂

  1. Forget TF 757 that just avoids the scale back through the sizes when it can't get 8 pages. It is a semi-pointless halfway house toward 753.
  2. The ELC wasn't the thing that was rushed, the ELC is absolutely vital (without it you'd be hung up forever!). What I was referring to as rushed was the implementation of large scale allocations.

Lastly, your proc cache state is definitely a sure sign that 753 may help you. if you have tons of compilation going on (we force 65% of our SQL), you may need to care with it. The good thing is it is entirely dynamic, you can enable it using dbcc traceon(753) and it will immediately stop > 2KB allocations for all spids. If you have a problem, just turn it off again.

Be aware of at least one gotcha that is apparent even when running with 753. Large ad-hoc SQL statements (16KB and above) appear to seek out contiguous chunks of proc cache in order to cache the sql text, this can genuinely bring processes to a standstill and cause ELCs to flush ad infinitum. You'll also see proc bufs being flushed out of the cache for fun despite you having tons of free space (but all in small chunks of course). We ran into this on a recent upgrade. We are now running traceflag 7790 (this stops any sql statement over 16KB being cached) on all our 15.7 production servers.

Cheers,

Simon

Former Member
0 Kudos

Thanks again... agreed the rushed bit being large allocations - didn't mean to mis-quote you.

Interesting read on trace flag 7790. I'll have to try this as well (although I'm meant to be spending my time coding not dealing with production issues!!)

I'm also looking at the output from procbuf and this shows some oddities.

Have you seen anything like this...

pbname='Build'   pbprocnum=1  pbsharecnt=0  pboldplan=(nil)

<snip>

    Total # of PROCHDRs used            : 6

    Total # of pages used               : 84

    Total # of bytes used               : 160297

    Total # of lost bytes for alignment : 15364359

    Total # of pages lost for alignment : 7502

    Total percentage of memory lost     : 9584

More bytes lost than used -  Seems odd to me ?

There are some fixes/improvements to procedure cache usage in SP135 - have you tried it yet ?

simon_ogden
Participant
0 Kudos

I wouldn't start trying to make sense of that output, I can't tell you what counter it is reading. Maybe look at what the other procbufs say, is that one sticking out as odd? Is there something funky with the 'Build' proc? (assuming it's a proc of course)

Haven't looked at SP135 really yet, what CRs are you referring to?

Former Member
0 Kudos

> Haven't looked at SP135 really yet, what CRs are you referring to?


Actually, my mistake, all the improvements to procedure cache usage are in older versions so no need to upgrade to SP135



Thanks again for the help.

Former Member
0 Kudos

Simon,

We tried traceflag 753 and we still get the problem.

Over a 30s period we see

SpinlockName                         SpinlockSlotID Grabs  Spins            Waits  Contention

-------------------------------------- ------------------ ---------  ---------------- -------  --------------

Resource->rproccache_spin   1401                14129  1356281287  13822      97.82

Total amount of procedure cache used is

master..monProcedureCacheModuleUsage reports : 5070996 pages

master..monProcedureCacheMemoryUsage reports : 5070950 pages

master..monCachedProcedures reports : 9,638,510 Kb (9.6GB)

We'll try traceflag 7790 when it happens again.

Thanks

Mike

simon_ogden
Participant
0 Kudos

That is a ridiculously huge amount of spins per wait and crippling contention. Are you also seeing large amounts of re-use during these periods?

Can you pull out a dbcc stacktrace output for all spids when in this state? Maybe run the script 2 or 3 times.  Note it 'should' be safe to run on production but run it at your own risk, I cannot give any guarantees it won't cause some unexplained issue specific to your environment.

BTW, have you logged an incident with SAP?

dbcc traceon (3604)

go

select SPID spid into #t from master..monProcess

create unique index ix on #t(spid)

go

set nocount on

declare @spid int

declare @rowcount int

while 1=1

begin

select top 1 @spid=spid from #t (index ix)

select @rowcount=@@rowcount

if @rowcount<>1

break

print 'stack trace and sql text for spid %1!', @spid

dbcc stacktrace (@spid)

dbcc sqltext (@spid)

print ''

delete from #t where spid=@spid

end

go

drop table #t

go

Former Member
0 Kudos

Sorry - how do I check the " re-use" ?

Yes - we've logged a call with Sybase.

traceflag 753 (don't bother with 757)

increase statement cache

increase engine local cache percent

try setting streamlined dynamic SQL  to 1

Thanks

simon_ogden
Participant
0 Kudos

Check for Reuse_cnt for 'procedure cache' in sp_monitorconfig 'all'.

Do you see this figure increase during these periods? This is a measure of how many plans are being forced out of cache - an indication that maybe something is looking for contiguous space that it can't find. I assume you have spare proc cache being reported and it isn't just full?

Former Member
0 Kudos

Thanks

we have had a lot of reuse...

Name                            Num_free    Num_active    Pct_act   Max_Used    Reuse_cnt 

procedure cache size    4864781     1279219         20.82       5081630       28126

Not enough procedure cache ?

simon_ogden
Participant
0 Kudos

That implies your max used is only about 80% which isn't brilliant but I wouldn't expect it to be as bad as your spinlock output shows. Is it at the HWM when you have the issue?

I'd suggest running that script I posted during the times when it happens and also look at the 'MemUsageKB' field in monProcessActivity to see if it gives any clues as to who may be using the proc cache.

It is possible that the large statement scenario (resolved with TF 7790) is happening, but the stack trace outputs should confirm. You should really pursue this one through tech support if it is causing you business impact.

Former Member
0 Kudos

Thanks a lot for your help.

We have a case open so hopefully we'll have more information from them. Its not actually affecting production at the moment as we're throttling the load - but we'd like to run at full capacity - but from what we can see the capacity is a function of time since reboot.

We've been able to induce this on a test env by running 100s of procs simultaneously.

[On production we managed to see the issue with ~100 procs]

The one consistent factor is that it always happens at about 80% usage of the procedure cache.

Running your sql shows the top function of the stack trace to be

either upyield or upsleepgeneric

Not sure what I should be looking for - but I'll have a look through the 100's of spids.

We'll try TF 7790; I suspect it will help but mostly by delaying the time for the problem to occur.

We're going to increase the amount of procedure cache as well.

Thanks again for your help - its most appreciated.

Mike

Former Member
0 Kudos

Now that we've fixed our issue with procedure cache contention I thought I'd share

some findings.


I repeated ran a proc in a loop that was using a temp table in many parallel connections (150,000 proc calls)


Initial time each run was taking                                            : 1m 5s

After the procedure cache was over 50% full,

the time to run                                                                      : 3m 30s


Turned on Trace Flag 753, and re-ran                                 : 1m  5s


Also tried,

Set  "optimize temp table resolution" on, time taken            :       27s


In this particular case the proc uses a temp table but is a very simple proc.

I haven't tried different procs or different complexity and length.


Hope this helps if anyone else has the same problem.

Former Member
0 Kudos

Do you have HWM zeroed too?

Answers (0)