cancel
Showing results for 
Search instead for 
Did you mean: 

Bcp tuning

Former Member
0 Kudos

Hi

(ASE 15.7 sp60 running on windows 2008)

We are in the process of aggregating many similar databases to one big database on a new server.

Because this is time consuming, we are planing to do this in steps like this:

1. Create the new database, tables etc (but not index, triggers, constraints)

2. Bcp in some of the databases (using fast bcp)

3. create the indexes, triggers, constraints

4. Let the users in and use the new database for some days

5. Drop indexes, triggers and constraints)

6. Bcp in more databases (using fast bcp)

7. create the indexes, triggers, constraints

8. Let the users in and use the new database for some days

(continue until all databases have been bcp-ed into the new database)

Now, the problem.

When we test this, we get good speed on the first bcp-jobs (2.)

Typical we see 1000+ rows / second

BUT after creating indexes etc, and dropping them we get much slower speeds for the rest of the bcp-jobs (6.)

On the same table, we now get 80 rows / second.

What are we doing wrong?

How can I find out more about what is taking so long?

Regards

Pål

Accepted Solutions (0)

Answers (4)

Answers (4)

Former Member
0 Kudos

Hi again

Thanks Andrew, Mark and Brett for trying to help.

We have now recieved help from Ruairi Prendiville. He found that if we increased the size of Network packages from 2048 to 65024 and the number of pre-allocated extents from 2 to 32, we got very good speeds again.  We also changed our device usage from one 700GB device (located on SAN with extreem IO flashdisks), to chunks of 50GB devicefiles (located on the same SAN).  I don't know if the device change made mutch diffence, but we are now happy with the speed of bcp in again.

Regards

Pål

Former Member
0 Kudos

Today we have two more findings.

1. The decrease in speed seems to come only if we create and drop clustered indexes, and then bcp-in to a table on the same device. If we bcp-in to a table on another device, speed is not suffering so mutch.

2. A collection of sysmonwatis.

SELECT msw.WaitTime,mwei.Description FROM monSysWaits msw inner join monWaitEventInfo mwei on msw.waitEventId = mwei.WaitEventId order by msw.WaitTime des

After starting some threads with bcp-in I ran this command, and after a few minutes I ran it again. This is the differences of the two.

DescriptionWaitTime
xact coord: pause during idle loop720
hk: pause for some time276
waiting for incoming network data1028
until an engine has been offlined wait180
distributed xact co-ordinator idle loop180
checkpoint process idle loop162
Wait until heartbeat or check interval expires12
waiting for client connection request236
waiting on run queue after sleep417
waiting on run queue after yield188
waiting for last i/o on MASS to complete59
waiting for network send to complete41
wait for i/o to finish after writing last log page20
wait to acquire latch19
waiting for buf write to complete before writing12
waiting for a lock15
former_member188958
Active Contributor
0 Kudos

I suspect the decrease in performance is due to the way ASE allocates new extents to objects when doing fast bcp or creating indexes.

The allocation process uses the Global Allocation Map (GAM) structure that is stored in the sysgams system table.  Sysgams isn’t a normal table.  It is essentially a large bitmap with one bit per allocation unit (block of 256 pages, further broken down into 32 extents).  If the bit is on, all 32 extents on the allocation unit are allocated, if the bit is off, at least one extent is free.

There are two types of extents, I call them “short” and “long”.  Each allocation unit contains 32 extents.  The first page of each allocation unit contains the allocation page itself, and holds information on which objects each of the 32 extents on the allocation unit are allocated too, which pages are actually in use, and which of OAM page is tracking each extent.

Because the first page is used for the allocation page, the first extent is “short”, containing only 7 pages for table data.  The remaining 31 extents are all “long”, containing 8 pages for table data.

Because of internal optimizations, “Fast” bcp and Create Index can only allocate long extents.  In a database that has recently experienced a lot of BCP and/or Index creation, the result can be a large number of allocation units that have all 31 long extents in use but the short extent free.

As you bcp into ASE and fill the current extent, ASE goes looking for a new free extent.  It looks first on the same allocation unit the current extent was on.  Then it scans the OAM looking for an allocation unit with a free extent (an OAM hint causes the search here to start where the previous search ended).  If the OAM search comes up empty, the GAM is used to find an allocation unit with a free extent.  In this state there are many allocation units that the GAM shows as having at least one free extent, however upon reading in the allocation page ASE finds it has only the unusable free extent free, so the next allocation unit that isn’t full is read in and the process is repeated until it finally reaches an allocation unit that has a free long extent.  The GAM search always starts at the beginning of the database, and big bcp jobs need to allocate a lot of new extents, so the same all-but-full allocation pages keep getting read over and over as more extents are needed.

Typically this problem is not seen because other types of DML activity (normal inserts, expanding updates, etc.) are perfectly happy to use the short extent and as soon as they do these all-but-full allocation units become full, the GAM bit is cleared, and allocation doesn’t look at that allocation unit again (until something deallocates an extent).  But the problem is seen at times when there are heavy consumers of the long extents.

A workaround for this issue is to temporarily drop the default segment from the device fragments early in the database that have already become full – the GAM search is limited to allocation units that are part of the object’s segment.

Alternatively, create a new segment on an unused device, use sp_placeobject to cause new allocations for that object to be made on the new segment.  However, this is more likely to run out of space then the previous method of dropping the default segment on full devices.

You can test for this being the issue by issuing a number of DBCC STACKTRACE commands on the spids doing the bcp in, if most of the stacks show functions like pggam_getfree_lowhigh() and pg__allocate_extents() this is probably the issue. 

-bret

Former Member
0 Kudos

Hm.  That's an interesting note.  There are a couple of things that puzzle me in it. 

  1. If the first extent in each allocation unit it "short" as you labelled it and as such "invisible" to the allocation process that create index (clustered or any?) / fast bcp uses does it mean that there is roughly 3% of DB which is a no-man's-land for index creation / bcp process?  Consequently one may encounter a situation when there is ample space in DB for a particular object but index creation fails (get's stuck indefinitely?) since the only free space in the DB is that of short extents useless for this type of operation  (3 GB of each 100 GB of database space is in fact "invisible" for index creation)?  As a consequence of this, I'd (also) expect that index creation performance will deteriorate to the same extent as fast bcp for the same object - do you see it too, Pål? 
  2. If the first extent in each allocation unit is "useless" from index creation/fast bcp perspective, why the algorithm of locating the "useful" free extents is not made to ignore the first extent in the search for free extents necessary for index creation / fast bcp operation?  I guess the reason is that there is only one bit in the GAM that simply states free/full for its allocation units which cannot be turned on on having all of the "long" extents exhausted as this will make these same 3% wasted globally. 
  3. We are talking about performance drop from over 1K to below 100 - 90% drop due to reading sequentially all free/full allocation units in the DB - that sounds very bad.  There should definitely be a faster way to locate free "large" extents than physically browsing through all allocation units in the DB?  Another bit? 
  4. If index creation/fast bcp hits at some point a sequential read from the beginning of the DB - than fast BCP/index creation should deteriorate over time in general?  When one fills a large table (if I understand the process you outlined right) the process first tries to locate free extend on the same allocation unit with the extent it currently works on, failing to find one it moves to locating free extend on any other allocation units where the same object resides (via OAM), failing to find there it moves to GAM and starts browsing through all allocation units without reference to the current object sequentially (find free/full page - test it - nope, no large extents there - find next &c). Should not this always happen?  Filling table with fast bcp should leave gaps (short extents) as it progresses, gradually reaching a point where OAM lists no more free extents for the object and pushing the search to get to GAM.  From this point on the performance should sink consistently? What am I missing?
Former Member
0 Kudos

Hi

This looks like something we are experiencing. 

But running 9 bcp-in commands and while cpu is almost 100% all the time, I find only two instances of the function calls you mention. 

See attached file.

Regards

Pål

Former Member
0 Kudos

Hi

We see perhaps a small decrease in speed when creating indexes.  But not at all the same decrease as bcp-in.

Also, if we never create the clustered indexes, speed is fine all the way with all databases.  It is only when clustered indexes is beeing created and dropped inbetween bcp-in commands we get the slow speeds.

Former Member
0 Kudos

We have done some testing with sp_placeobject.

We find that when we place the table (and the Text/Image Index for that table !!) to another segment with another device, we get mutch better performance.

But the performance is still not as good as before we created and dropped the clustered indexes.

Why is this? 

Former Member
0 Kudos

Shouldn't be happening.  You can collect monSysWaits for both and compare.  I do a lot of BCP of late (fast & slow).  Speed should be consistent.

Former Member
0 Kudos

hi again

I have been away, so sorry for late reply.

What we have found since last time is that it looks like this has something to do with clustered indexes.  After we drop and create the clustered indexes, bcp in starts slowing down.

So my question is: what kind of side effect does it have to create and drop clustered indexes?

It looks like ASE is still trying to put the rows in "right" order even though the clustered index is dropped...

Regards

Pål