From: ajstorm on 23 Jun 2010 10:43 Mark, Rather than continue what I'm sure could be an endless debate, let me try to bring this discussion to a close. Here's what we've heard so far: - At your company you hit some very specific problems with STMM which occur only on Linux and/or when running a large number of instances/ databases on the same physical machine. - You agree that these problems have been fixed in more recent fixpacks. - Several other customers have chimed in to report that STMM has been beneficial in their environments. While I'm not trying to minimize the problems you encountered, it's always helpful to have some perspective. It's true that there were some problems with some of the initial versions of STMM (something which I have never attempted to deny). That being said, if you look at these problems in the context of the number of customers that are using STMM, the issues have only affected a handful of customers running in some very specific environments. It's unfortunate that you happened to be one of those customers and that it affected your business so much. In all my discussions with customers I can say that the overwhelming majority of those who have tried STMM have been very happy with it. I am aware of that DB2Night Show poll (I have appeared on the DB2Night Show myself), but am not sure what to make of it. It's been my experience that many customers who claim to have had "issues" running STMM have either not been using it correctly (enabled only one or two memory consumers) or have expected it to achieve an ideal memory configuration immediately (when in fact STMM works to achieve an ideal configuration over time). Without more detailed information, it would be incorrect to use those numbers to support your argument that a significant number of people have experienced severe issues as a result of STMM. If you feel that it's easy to tune the database memory on your systems, then you are correct to disable STMM. You'll get consistent performance (even if it is likely sub-optimal) and you won't have to worry about hitting some rare STMM problem. Let me reiterate though, that if you find memory configuration easy, you are in the minority. I know from speaking with hundreds of customers that most of them find memory configuration challenging, and for them, STMM is welcome relief. I'm still looking into the specific issues that you encountered and will be in touch via email with any information I can dig up. Thanks, Adam
From: Mark A on 23 Jun 2010 16:43 "ajstorm(a)ca.ibm.com" <ajstorm(a)gmail.com> wrote in message news:9c32158e-f666-4c2b-b337-9fc947257a22(a)k39g2000yqb.googlegroups.com... > Mark, > > Rather than continue what I'm sure could be an endless debate, let me > try to bring this discussion to a close. Here's what we've heard so > far: > > - At your company you hit some very specific problems with STMM which > occur only on Linux and/or when running a large number of instances/ > databases on the same physical machine. > - You agree that these problems have been fixed in more recent > fixpacks. > - Several other customers have chimed in to report that STMM has been > beneficial in their environments. > > While I'm not trying to minimize the problems you encountered, it's > always helpful to have some perspective. It's true that there were > some problems with some of the initial versions of STMM (something > which I have never attempted to deny). That being said, if you look > at these problems in the context of the number of customers that are > using STMM, the issues have only affected a handful of customers > running in some very specific environments. It's unfortunate that you > happened to be one of those customers and that it affected your > business so much. > > In all my discussions with customers I can say that the overwhelming > majority of those who have tried STMM have been very happy with it. I > am aware of that DB2Night Show poll (I have appeared on the DB2Night > Show myself), but am not sure what to make of it. It's been my > experience that many customers who claim to have had "issues" running > STMM have either not been using it correctly (enabled only one or two > memory consumers) or have expected it to achieve an ideal memory > configuration immediately (when in fact STMM works to achieve an ideal > configuration over time). Without more detailed information, it would > be incorrect to use those numbers to support your argument that a > significant number of people have experienced severe issues as a > result of STMM. > > If you feel that it's easy to tune the database memory on your > systems, then you are correct to disable STMM. You'll get consistent > performance (even if it is likely sub-optimal) and you won't have to > worry about hitting some rare STMM problem. Let me reiterate though, > that if you find memory configuration easy, you are in the minority. > I know from speaking with hundreds of customers that most of them find > memory configuration challenging, and for them, STMM is welcome > relief. > > I'm still looking into the specific issues that you encountered and > will be in touch via email with any information I can dig up. > > Thanks, > Adam I do not know if the problems with STMM have been fixed in the latest fixpacks. I said that is a possibility they have been fixed, but I have not tried STMM with anything later than 9.5.4, which had a lot of problems. I try not to jump to conclusions about what I have not personally tested, but it seems like you are misquoting me. Several IBM'ers told me it doesn't work well in 9.5, but works better in 9.7, so not sure what exactly that means as to whether the problems are fixed in 9.5.5. I also don't know if the problems with STMM are isolated to Linux. I said that is a possibility (since I have only used STMM with Linux). Again, you misquoted me. Based on the survey done on DB2Night (during same session) to poll what OS's are being used for DB2, it seems likely the STMM problems also affect other OS's (AIX, etc) since Linux is not widely used enough used to account for all the negative STMM responses. I don't believe that your statement that STMM only affecting a "handful" of customer is anywhere close to accurate. Here is just one person who had problems with STMM on DB2 for AIX: http://www.dbforums.com/db2/1646953-stmm-does-not-allocate-enough-sheapthres_shr.html I don't think that tuning of the DB CFG is particularly easy in DB2, but STMM is not easy either, especially when there are problems, and then the problems can be monumental. I noticed that you think the reason many customers complain about STMM (and have negative responses to DB2Night surveys about STMM) is that they are not using it correctly. If that is so, it just proves my point that STMM is actually much more complex than you have claimed. The DBA's who participate in the DB2Night.com surveys are typically not novice DBA's. When DBA's say that they have problems with STMM, they usually don't mean that there is a slight sub-optimal memory allocation problem. They mean that they are having serious DB2 memory problems that lead to denial of service or extremely poor response time. I don't think many would notice, or even care about, a "slightly" sub-optimal STMM so long as it was in the ballpark of acceptable performance and did not cause serious database availability problems. Your assumption that my not using STMM would result in sub-optimal memory configuration is false. The problems we have encountered with STMM are specifically that it frequently gives up memory it thinks is no longer needed at a particular moment, and then when it tries to re-acquire the memory a few seconds later when it is needed again, DB2 throws errors in the diagnosis log that say that DB2 cannot get the memory requested from the OS. This specifically happens on LOCKLIST memory and sort memory heaps. When this happens, the memory heap in question is extremely low (because STMM shrunk it previously) causing one or more of the following: - lock escalation is out of control when locklist has insufficient memory, resulting in significant locktimeouts and deadlocks. - all sorts spill to temporary tablespaces instead of memory sorts - user bufferpools cannot be allocated, and system bufferpools get 100% filled with dirty pages and SQL statements start getting error messages, - new apps cannot connect because shared memory segments cannot be allocated, - CPU reaches 90%+ because the system is extremely slow, and application SQL requests are getting stacked up in the queue faster than DB2 can process them - etc Given that we had severe problems with STMM, we decided to hard-code the following, which was quite easy and solved all our problems: LOCKLIST 16384 MAXLOCKS 30 SHEAPTHRES_SHR 40000 SORTHEAP 8000 Bufferpools (separate issue in our database which has multiple bufferpools for a reason, but normally 50% of server memory for bufferpools is fine) Given adequate information in the documentation, I think anyone could configure these even without my suggestions above (but there are no real recommendations). The numbers I provided above would work fine for 99% of databases. The sum total of the above values (not including bufferpools) is not even 250 MB, so one is hardly "wasting" any significant amount of memory by hard-coding them (and by hard-coding them one will be saving CPU time by not having STMM constantly trying to tune and adjust them). I don't have any theoretical problem with DB2 managing memory for me (not trying to protect my job), or else I wouldn't have tried STMM in the first place (maybe with the exception of bufferpools which is totally separate issue if one has multiple bufferpools for very specific reasons). I am all for trying to make DB2 easier to configure and use, but I am very concerned when IBM tries to brow-beat people into using new features before they are completely debugged. Doing that only harms IBM because one of the main advantages of DB2 over other database products (some more expensive, and some much less expensive) has been the reliability and support we get with DB2. If customers don't have that reliability level with DB2 anymore because of problems due to untested product features, then that is not in the best interests of IBM, regardless of what marketing strategy some at IBM have dreamed up to postition the product as easy to use.
From: Liam on 24 Jun 2010 09:54 On Jun 22, 6:13 pm, "Mark A" <no...(a)nowhere.com> wrote: > "Serge Rielau" <srie...(a)ca.ibm.com> wrote in message > > news:88cl59Fv01U1(a)mid.individual.net... > > > > > Mark, > > > As you said yourself earlier defaults change (or rather: should change).. > > 90% may have been better in the past and the team has learned that 200% is > > better on average now. > > I don't think that 90% is wrong or right. > > What we are talking about here are best practices and best practices > > change as products change and as experience accumulates. > > > I'm being told that not all the doc changes for the recent change from 90% > > to 200% have been rolled out yet and the doc team is working on it. > > > But this is not a HIPER APAR. Your company will not go out of business > > because you are still, obviously successfully, using 90%. > > > Cheers > > Serge > > -- > > Serge Rielau > > SQL Architect DB2 for LUW > > IBM Toronto Lab > > You are wrong on several fronts. > > When IBM released 9.5 and 9.7, 90% was unequivocally specified as the value > to use for SHMALL Linux kernel parm. I provided proof of this in the PDF > manuals (the 9.7 PDF manuals are not very old). 90% has apparently been > discovered to not be appropriate, and now 200% is recommended (as of just a > few months ago). This had nothing to do with DB2 code changes, it had to do > with problems with STMM (I was told by an IBMer that this was particularly > the case if a server had a lot of DB2 instances and was doing high volume > production). The recommendation is retroactive to 9.5.0. > > You are claiming that 90% is neither right or wrong. But the 200% setting is > now enforced in future fixpacks (already in 9.7.2 and maybe in 9.5.6), so > that tells me that someone in IBM think 90% is wrong, although admittedly, > not every customer is going to encounter a problem if their server has only > one DB2 instance and is only moderately loaded or not using STMM. > > The only reason we are successfully using 90% is because we turned off STMM. > So either the 90% is wrong, or STMM doesn't work correctly (or some > combination of the two). > > My company suffered serious customer relations problems with several large > customers because of STMM and memory problems in general, and I don't need > you to tell me what happened, since you don't know a anything about it. > Sounds to me you have working been on that Oracle compatibility code for so > long that you are starting to sound just like Oracle. Maybe you should apply > for a job with them. It sounds like I may regret jumping in on this thread, but I'll just throw in some of the rationale behind this change in recommendation.... I'll start with some background on how DB2 uses shared memory, and then get to the whole 90% vs 200% recommendations. The SHMALL kernel tuneable on Linux has always been a hindrance to any programs that make extensive use of shared memory. My personal opinion is that Linux should remove this tuneable completely (along with SHMMAX), particularly since the total amount of shared memory "created" on the box doesn't have any real impact on the system unless that memory is actually consuming either RAM or swap space. Linux (and all other OSes) do a good job of virtualizing memory to processes, so it shouldn't really matter how much "virtual" memory is consumed by the sum of all shared memory segments created on the system, as long as the total of all RAM pages (including both committed shared memory pages, and committed private memory pages) is less than the amount of memory on the box (or at least that all the swap space hasn't been consumed, but I'm sure we all agree that we don't want database servers to start swapping). As for how DB2 comes into the picture, most of DB2's memory allocations are from shared memory regions. As of 9.5 (when DB2 went threaded), it became much easier for DB2 to "grow" it's shared memory regions, by simply creating new shared memory segments (all EDUs are just threads, so are implicitly connected to the new shared memory segment). Growing is just half the battle though, STMM needs to be able to "shrink" DB2's memory footprint when it detects there is too little free memory left on the box. To accomplish this, we use APIs provided by the OS (all OSes DB2 runs on have their own flavour of APIs for this) to "decommit" portions of shared memory segments, meaning the OS will release any RAM + backing store consumed by those shared memory pages (thus, increasing the amount of free RAM on the box for other programs to use). If STMM later decides there's enough free RAM and can grow again, we will first re-commit those regions, and if we need to grow more, will then allocate new shared memory regions again. This is where I'm confused by one of your other comments saying that STMM cannot reclaim this memory.... re-committing memory on UNIX is as simple as touching those memory pages again, there is no OS API that needs to be called (the OS just faults those pages in on demand), so as long as STMM sees enough free memory on the system, we should have no issue reclaiming that memory. I'm sure Adam will contact me when he digs up the info on this particular issue :-) Note that Windows is slightly different - we need to issue an OS API to re-commit memory there, so in that case, there is a chance that the OS will deny the request, but that should not happen on any UNIX platform. Now for the SHMALL recommendation.... We really wanted SHMALL to be set out of the way, but we were reluctant to recommend customers set that to a value greater than RAM. I would wager that most DBAs are probably not intimately aware with how the various OSes implement their virtual memory managers, so it would seem odd to recommend setting SHMALL larger than RAM (i.e. doesn't that mean it will cause paging?!). So, the recommendation was to set this to a value that is sufficiently large such that a single DB2 instance can use most of the memory on the box with no issues - the assumption being that the OS, file cache, etc will use up at least 10% of free memory on the box, so total committed memory by DB2 should never be greater than 90% of RAM. This recommendation was fine prior to 9.5, and should still be acceptable in most 9.5+ systems if STMM is not enabled. However, once STMM is enabled, we start monitoring free memory and will start shrinking our committed memory footprint (RAM) when needed, however, our shared memory segment footprint that is accounted for by SHMALL stays the same. Again, with a single DB2 instance on the box, the original 90% recommendation should still be fine, since we will favor re-committing that memory prior to creating new segments, so our total shared memory footprint should stay below the 90% SHMALL limit. However, this SHMALL recommendation breaks down as soon as there are multiple instances. Consider a simple scenario where one instance starts up with STMM enabled, and that instance sees plenty of free memory, so grows DB2's shared memory regions to account for, say, 75% of memory on the box. Now, a new DB2 instance starts up, again with STMM enabled. STMM will see this new memory pressure on the box, and start releasing RAM, but will still consume 75% of the SHMALL limit. If both instances have a fairly "equal" need for memory, the new instance would try to grow it's shared memory to account for 37.5% of memory on the box, however, since the first instance is already consuming 75% of the SHMALL limit, the second instance cannot grow that much, and will be limited to at most 15% of RAM on the box due to SHMALL. This is the main reason why the new 200% recommendation is coming in - we have to bite the bullet and recommend setting SHMALL larger than RAM so that customers with more than one instance on the box, where those instances' memory consumption is controlled primary by STMM (so will grow and shrink), are less likely to hit this limit. So, although far from perfect, the new 200% recommendation is to help ensure that a larger group of customers are not affected by how SHMALL interacts with STMM. The unfortunate part is that now that we recommend setting SHMALL larger than RAM, it raises more questions, so we now have to try to explain how SHMALL interacts with RAM on the box - details that most DBAs should not need to care about (as long as we're not causing pageing, of course!). From what I've heard, this same recommendation will be applied to all 9.5+ versions of the docs, it will just take some time to get those docs updated. Hope this helps clarify this situation.... Cheers, Liam.
From: Mark A on 24 Jun 2010 22:12 "Liam" <lemonfinnie(a)gmail.com> wrote in message news:eed86f59-bf21-4530-9449-918628153378(a)i28g2000yqa.googlegroups.com... > It sounds like I may regret jumping in on this thread, but I'll just > throw in some of the rationale behind this change in > recommendation.... I'll start with some background on how DB2 uses > shared memory, and then get to the whole 90% vs 200% recommendations. I don't have a problem if IBM wants to change the recommendation. I don't really want to know "why" I just want to know what needs to be done to properly configure DB2. If DBA's are not knowledgable enough to hard-code 4-5 database config parameters, then surely they are not knowledgable about Linux kernel parms and are looking for IBM to tell them what to do (or better yet for DB2 to automatically configure the parms). DBA's usually don't have root authority to change the parms themselves anyway, since this is up to the OS admins. > ...This is where I'm confused by one of your other > comments saying that STMM cannot reclaim this memory.... re-committing > memory on UNIX is as simple as touching those memory pages again, > there is no OS API that needs to be called (the OS just faults those > pages in on demand), so as long as STMM sees enough free memory on the > system, we should have no issue reclaiming that memory. I'm sure Adam > will contact me when he digs up the info on this particular issue :-) > Note that Windows is slightly different - we need to issue an OS API > to re-commit memory there, so in that case, there is a chance that the > OS will deny the request, but that should not happen on any UNIX > platform. We observed that about 1-2 weeks after rebooting the server (for the reboot, we used HADR takeovers so that service was not interrupted), that DB2 could no longer acquire the memory it needed (especially for LOCKLIST and the sort heaps). DB2 was throwing numerous errors in the DB2 diagnosis log stating that memory allocation errors had occured trying acquire memory for these heaps. Unfortunately, I don't have the log in question (I am not the principal DBA for the application that had this problem) and it looks like the logs from 6-8 weeks ago have rolled off. But it was clear that DB2 would give up the memory and not be able to get it back when needed for both LOCKLIST and sort heaps. This was on a server with 64 GB of memory and bufferpools hardcoded at about 10 GB total for all databases, and all other parms set to the db and dbm defaults, and with STMM on. > Now for the SHMALL recommendation.... We really wanted SHMALL to be > set out of the way, but we were reluctant to recommend customers set > that to a value greater than RAM. I would wager that most DBAs are > probably not intimately aware with how the various OSes implement > their virtual memory managers, so it would seem odd to recommend > setting SHMALL larger than RAM (i.e. doesn't that mean it will cause > paging?!). So, the recommendation was to set this to a value that is > sufficiently large such that a single DB2 instance can use most of the > memory on the box with no issues - the assumption being that the OS, > file cache, etc will use up at least 10% of free memory on the box, so > total committed memory by DB2 should never be greater than 90% of > RAM. This recommendation was fine prior to 9.5, and should still be > acceptable in most 9.5+ systems if STMM is not enabled. However, once > STMM is enabled, we start monitoring free memory and will start > shrinking our committed memory footprint (RAM) when needed, however, > our shared memory segment footprint that is accounted for by SHMALL > stays the same. Again, with a single DB2 instance on the box, the > original 90% recommendation should still be fine, since we will favor > re-committing that memory prior to creating new segments, so our total > shared memory footprint should stay below the 90% SHMALL limit. > However, this SHMALL recommendation breaks down as soon as there are > multiple instances. Consider a simple scenario where one instance > starts up with STMM enabled, and that instance sees plenty of free > memory, so grows DB2's shared memory regions to account for, say, 75% > of memory on the box. Now, a new DB2 instance starts up, again with > STMM enabled. STMM will see this new memory pressure on the box, and > start releasing RAM, but will still consume 75% of the SHMALL limit. > If both instances have a fairly "equal" need for memory, the new > instance would try to grow it's shared memory to account for 37.5% of > memory on the box, however, since the first instance is already > consuming 75% of the SHMALL limit, the second instance cannot grow > that much, and will be limited to at most 15% of RAM on the box due to > SHMALL. This is the main reason why the new 200% recommendation is > coming in - we have to bite the bullet and recommend setting SHMALL > larger than RAM so that customers with more than one instance on the > box, where those instances' memory consumption is controlled primary > by STMM (so will grow and shrink), are less likely to hit this limit. This sounds reasonable to me and I appreciate the detailed explanation, but I am not (and don't want to be) a OS memory expert. Neither do most DBA's. That's why STMM sounded so attractive to begin with (so we don't have to know anything about DB2 memory heaps and OS memory kernel parms). I think the vast majority of customers want IBM to tell DB2 customers how to configure DB2 properly, with or without STMM, or better yet have DB2 set these values automatically. Here is a summary of the situation from my perspective as a customer: 1. DB2 9.5 and later has STMM enabled by default. 2. The original recommendation for SHMALL was 90% of server memory. This is documented in the PDF manuals I mentioned in a previous post for both 9.5 and 9.7 (so the 90% recommendation stood until fairly recently since it was in the 9.7 PDF doc). 3. Customers who have STMM on by default, and have a lot instances with high transaction rate applications (needing locklist and sort memory), and had SHMALL at the original recommendation 90% of server memory, are susceptible to the possibility of serious database memory problems. I have confirmed that this has happened on both Linux and AIX but searching various forums. 4. Because of problems using the 90% value with STMM (with multiple DB2 instances), IBM is now recommending 200% of server memory for SHMALL. IBM is so confident about the need to use 200% that apparently in 9.7.2 (but not before then) DB2 sets SHMALL to 200% automatically (that is what the InfoCenter doc says, although I still need to verify that this happens 9.7.2). 5. The 90% value is not a problem if one has STMM turned off, or has it turned on with a small number of instances. But STMM is turned on by default. 6. I was admonished earlier in this thread for suggesting that the change made in the SHMALL recommendation (90% to 200%) in the online InfoCenter should have been included in a Hiper APAR to make sure all customers knew about the potential problem and solution. It seems to be that IBM has been trying to hide (or minimize) the problems with STMM, for whatever reasons. I am very disappointed that IBM did not communicate this known problem sooner to its customers because it had a big impact on my company. 7. I mentioned in a previous post that it is possible that IBM has now fixed the problems with STMM (especially if one knows the recommendation for SHMALL has changed). I don't know one way or the other. But since IBM ahs not exactly being candid about the past problems, I am not going to automatically assume that everything is now fixed. I have heard from some IBM'ers that STMM works better in 9.7 than in 9.5. Others can decide for themselves how much risk vs. reward there is in using STMM at this time. 8. The db and dbm configs are still fairly complex, and it has gotten much more complex with STMM, IMO. There are many things that a DB2 DBA can set in the db and dbm config, but not many actually know that STMM only controls these: LOCKLIST PCKCACHESZ SHEAPTHRES_SHR SORTHEAP All buffer pools. 9. If a DBA hard-coded these as follows below, it would work fine for 99%+ of databases and one will not encounter the memory problems with STMM discussed above. The total memory for these parms is about 250 MB (not counting bufferpools) even when they are hard-coded (they could actually grow much higher if STMM was enabled). LOCKLIST 16384 MAXLOCKS 30 [%] SHEAPTHRES_SHR 40000 SORTHEAP 8000 PCKCACHESZ 4096 bufferpools (set total of all bufferpools in all databases to about 50% of server memory, or size of database, whichever is less) . That's all there is to it. Not that complicated, and don't need a OS Admin with root, don't need to be a OS expert or understand how DB2 shared memory works. All the other parms in the database can be set to the defaults, and you can even leave STMM on if you want (but it may not be doing anything unless two or more of the above are set to automatic--but don't quote me on this exactly because STMM is so complex very few actually understand it). 10. If IBM had published the above recommendations (or something similar) in the DB2 reference manuals (or made them the defaults--except for bufferpools), IBM would not have had to spend millions to develop STMM. Bufferpools could be configured some other way, other than dynamically changing them via STMM. > So, although far from perfect, the new 200% recommendation is to help > ensure that a larger group of customers are not affected by how SHMALL > interacts with STMM. The unfortunate part is that now that we > recommend setting SHMALL larger than RAM, it raises more questions, so > we now have to try to explain how SHMALL interacts with RAM on the box > - details that most DBAs should not need to care about (as long as > we're not causing pageing, of course!). > > From what I've heard, this same recommendation will be applied to all > 9.5+ versions of the docs, it will just take some time to get those > docs updated. > > Hope this helps clarify this situation.... > > Cheers, > Liam. Liam, thanks for the detailed explanation. I will pass this on to our OS Admins to set the kernel parms correctly. Your explanaiton may help if they baulk at the latest 200% recommendations in InfoCenter. For some of my apps, I cannot migrate to 9.7 anytime soon, so we have to change the kernel parms in Linux for DB2 9.5, but still hoping that SHMALL is set automatically by DB2 in 9.5.6 when it is available..
From: ajstorm on 25 Jun 2010 08:19
Mark, I think that is a valuable summary. While I disagree with your ninth point, you're certainly entitled to your opinion. My experience with hundreds of DB2 customers tells me that hard coded values will not work in 99% of cases, which is why we don't document any hard coded recommendations. Instead, we've tried hard with STMM to create a feature that works well for 100% of customers. Were we 100% successful with our first attempt? Has any software product ever been released entirely problem-free? No. This is why we're using subsequent fixpacks to fix all issues that have been reported by customers, such as yourself. Is STMM in the latest 9.5 and 9.7 fixpacks perfect? It would be foolish for me to claim that it is. That being said, the problems you mention above have been fixed, and should no longer be a cause for concern. From its inception, STMM has provided value for a great number of customers and it will continue to do so. Yes, there are/ were some "gotchas" when running STMM in its first few releases. We're hopeful however, that most of the serious problems have been resolved. Thanks, Adam |