From: raja on 12 Dec 2008 07:39 Hi Charles/ Steeve, Query 1 ( Query given by charles ) : select to_char(begin_interval_time,'YYYY_MM_DD HH24:MI'), dhse.instance_number, time_waited_micro - nvl(lag(time_waited_micro, 1, 0) over (partition by dhse.instance_number, dhse.event_name order by dhse.snap_id),0) as time_waited, total_waits - nvl(lag(total_waits, 1, 0) over (partition by dhse.instance_number, dhse.event_name order by dhse.snap_id),0) as total_waits from dba_hist_snapshot dhs, dba_hist_system_event dhse where dhs.snap_id = dhse.snap_id and dhs.instance_number = dhse.instance_number order by 1, 2; The output of your query is present in excel file named as : wait1 Query 2 ( Query given by steeve ) : select * from ( select event_name, wait_class,to_char(begin_interval_time,'YYYY_MM_DD HH24:MI'), dhse.instance_number, time_waited_micro - lag(time_waited_micro, 1, 0) over (order by dhse.instance_number, dhse.snap_id) as time_waited, total_waits - lag(total_waits, 1, 0) over (order by dhse.instance_number, dhse.snap_id) as total_waits from dba_hist_snapshot dhs, dba_hist_system_event dhse where dhs.snap_id = dhse.snap_id and dhs.instance_number = dhse.instance_number and begin_interval_time > sysdate - num_of_days_to_analyze order by 5 desc) where rownum < 10000; The output of your query is present in excel file named as : wait2 Here, the wait event : "PX qref latch" tops the list Since i am not able to post those files in our group's thread, i am sending those files to ur mail ids. Please check these 2 files and tell me your observations. Also, another doubt is : whether should be neglect the idle events always ? then why are they notified to us... With Regards, Raja.
From: Charles Hooper on 12 Dec 2008 08:08 On Dec 12, 7:39 am, raja <dextersu...(a)gmail.com> wrote: > Hi Charles/ Steeve, > > Query 1 ( Query given by charles ) : > select > to_char(begin_interval_time,'YYYY_MM_DD HH24:MI'), > dhse.instance_number, > time_waited_micro - nvl(lag(time_waited_micro, 1, 0) over (partition > by dhse.instance_number, dhse.event_name order by dhse.snap_id),0) as > time_waited, > total_waits - nvl(lag(total_waits, 1, 0) over (partition by > dhse.instance_number, dhse.event_name order by dhse.snap_id),0) as > total_waits > from dba_hist_snapshot dhs, > dba_hist_system_event dhse > where dhs.snap_id = dhse.snap_id > and dhs.instance_number = dhse.instance_number > order by 1, 2; > > Since i am not able to post those files in our group's thread, i am > sending those files to ur mail ids. > Please check these 2 files and tell me your observations. > Also, another doubt is : whether should be neglect the idle events > always ? then why are they notified to us... > > With Regards, > Raja. Raja, Now that I am able to see the output of the query that I suggested that you run, I see that I did not tell you to select the dhse.event_name column, so I am not able to see the wait event names associated with the various waits, also the sort order probably needs to be improved. The fixed query should look something like this: select dhse.event_name, to_char(begin_interval_time,'YYYY_MM_DD HH24:MI'), dhse.instance_number, time_waited_micro - nvl(lag(time_waited_micro, 1, 0) over (partition by dhse.instance_number, dhse.event_name order by dhse.snap_id),0) as time_waited, total_waits - nvl(lag(total_waits, 1, 0) over (partition by dhse.instance_number, dhse.event_name order by dhse.snap_id),0) as total_waits from dba_hist_snapshot dhs, dba_hist_system_event dhse where dhs.snap_id = dhse.snap_id and dhs.instance_number = dhse.instance_number order by 2, 4; The idle wait events do have value, as they tell you to what degree the database server is not causing performance problems (the user took a 2 hour nap, resulting in a SQL*Net Message from Client exceeding 2 hours). Idle waits have a limited value in Statspack and AWR reports (for instance "SQL*Net more data to client" shows 4,642,420 waits, possibly indicating that the clients are requesting a lot of data to be returned which will not fit into a single SDU size per request). However, the idle wait events are very important wait events when examining 10046 traces for sessions, as a 10046 trace is time scoped for a single session. Charles Hooper IT Manager/Oracle DBA K&M Machine-Fabricating, Inc.
From: raja on 19 Dec 2008 06:17 Hi, Kindly check the attached excel sheet. Since i am not able to post here, i am sending the output to your mail id. Query : select dhse.event_name, to_char(begin_interval_time,'YYYY_MM_DD HH24:MI'), dhse.instance_number, time_waited_micro - nvl(lag(time_waited_micro, 1, 0) over (partition by dhse.instance_number, dhse.event_name order by dhse.snap_id),0) as time_waited, total_waits - nvl(lag(total_waits, 1, 0) over (partition by dhse.instance_number, dhse.event_name order by dhse.snap_id),0) as total_waits from dba_hist_snapshot dhs, dba_hist_system_event dhse where dhs.snap_id = dhse.snap_id and dhs.instance_number = dhse.instance_number order by 2, 4; Kindly provide me your suggestions. With Regards, Raja.
From: Charles Hooper on 19 Dec 2008 10:38 On Dec 19, 6:17 am, raja <dextersu...(a)gmail.com> wrote: > Hi, > > Kindly check the attached excel sheet. > Since i am not able to post here, i am sending the output to your mail > id. > > Query : > select > dhse.event_name, > to_char(begin_interval_time,'YYYY_MM_DD HH24:MI'), > dhse.instance_number, > time_waited_micro - nvl(lag(time_waited_micro, 1, 0) over (partition > by dhse.instance_number, dhse.event_name order by dhse.snap_id),0) as > time_waited, > total_waits - nvl(lag(total_waits, 1, 0) over (partition by > dhse.instance_number, dhse.event_name order by dhse.snap_id),0) as > total_waits > from dba_hist_snapshot dhs, dba_hist_system_event dhse > where dhs.snap_id = dhse.snap_id > and dhs.instance_number = dhse.instance_number > order by 2, 4; > > Kindly provide me your suggestions. > > With Regards, > Raja. As an experiment, I examined four time intervals for a single day, and posted those numbers below. "Wait/Sec" is a calculation where I divided the original TIME_WAITED value by 1,000,000 so that the wait time was stated in seconds, and then divided that value by 3,600 (there are 3,600 seconds in an hour, as the totals were over the course of an hour). The number calculated is the number of seconds waited per second, accumulated for all sessions/processes waiting on that event (that would explain the value of 9.52 below). Between 2008_12_16 02:00 and 2008_12_16 03:00 EVENT_NAME TOTAL_WAITS Wait/Sec jobq slave wait 1,394 1.05 db file sequential read 469,456 0.59 SQL*Net break/reset to client 9,236,797 0.35 SQL*Net more data from dblink 220,844 0.28 db file parallel write 260,297 0.17 Backup: sbtwrite2 36,033 0.10 log file parallel write 8,771 0.06 log file sequential read 6,602 0.02 single-task message 489 0.01 log file sync 2,238 0.01 Log archive I/O 7,783 0.00 RMAN backup & recovery I/O 5,302 0.00 Between 2008_12_16 07:00 and 2008_12_16 08:00 EVENT_NAME TOTAL_WAITS Wait/Sec PX Deq: Execution Msg 97,375 9.52 PX Idle Wait 14,453 6.52 PX Deq Credit: send blkd 3,162,989 5.72 PX Deq: Table Q Normal 17,889,617 4.82 jobq slave wait 4,112 3.20 db file scattered read 435,471 1.61 db file sequential read 881,400 0.69 PX Deq: Execute Reply 17,145 0.58 SQL*Net break/reset to client 7,065,596 0.37 direct path write temp 10,411 0.33 read by other session 705,527 0.26 Backup: sbtwrite2 40,263 0.24 PX Deq Credit: need buffer 287,494 0.22 direct path read 1,363,276 0.16 db file parallel write 20,396 0.13 PX qref latch 204,808,981 0.10 direct path read temp 36,759 0.07 log file parallel write 9,155 0.04 single-task message 516 0.01 log file sequential read 3,559 0.01 log file sync 3,387 0.01 optimizer stats update retry 130 0.01 PX Deq: Msg Fragment 21,945 0.01 enq: TX - row lock contention 233 0.00 enq: MS - contention 3 0.00 Between 2008_12_16 14:00 and 2008_12_16 15:00 EVENT_NAME TOTAL_WAITS Wait/Sec jobq slave wait 3,792 2.95 db file sequential read 1,044,182 0.49 SQL*Net break/reset to client 9,797,440 0.33 Backup: sbtwrite2 13,535 0.04 db file scattered read 152,387 0.03 read by other session 30,474 0.02 single-task message 516 0.01 db file parallel write 3,351 0.00 log file parallel write 5,155 0.00 log file sync 2,678 0.00 Between 2008_12_16 19:00 and 2008_12_16 20:00 EVENT_NAME TOTAL_WAITS Wait/Sec Backup: sbtwrite2 352,599 3.74 jobq slave wait 3,709 2.89 SQL*Net break/reset to client 8,473,325 0.36 db file sequential read 161,017 0.21 RMAN backup & recovery I/O 13,208 0.11 log file parallel write 7,232 0.03 db file parallel write 13,223 0.03 single-task message 511 0.01 log file sequential read 3,066 0.01 log file sync 2,695 0.00 read by other session 972 0.00 The above may contain some "idle" wait events. What is possibly interesting is the consistently high number of waits on the event "SQL*Net break/reset to client", indicating that the client is submitting something, possibly a badly formed SQL statement, and the database instance is rejecting the message sent by the client. The time consumed seems to be a faily consistent 1/3 of a second per second in all four of the time intervals. You might check the per session statistics to see which sessions are involved. I will delay my analysis of the above until you have had an opportunity to research and report your findings regarding what the above may be showing. I would encourage others viewing this thread to offer suggestions or an analysis of the above in the context of this thread. Charles Hooper IT Manager/Oracle DBA K&M Machine-Fabricating, Inc.
From: raja on 23 Dec 2008 11:21
Hi, Thanks for guiding me Charles and Steve. From the wait list given by Charles, i have tried to look into the wait event descriptions and made my following observations. In general the wait list has been split into 4 parts : Part 1 -> 2am to 3am ; Part 2 -> 7am to 8am ; Part 3 -> 2pm to 3pm ; Part 4 -> 7pm to 8pm First let us Neglect following the idle events from the wait list : jobq slave wait PX Deq Credit: need buffer PX Deq: Execute Reply PX Deq: Execution Msg PX Deq: Msg Fragment PX Deq: Table Q Normal PX Idle Wait single-task message In general - about the application analysis - the following 2 observations are : a. if we see the no.of wait events, it looks like they are at : Part 2 -> 7am to 8am, which should be the peak hour of the transaction. b. the jobs, backup and recovery, archiving should be run when it is in less peak hour : Part 1 -> 2am to 3am, since there are the following events : Backup: sbtwrite2 - ( Administrative - Wait class ) Log archive I/O RMAN backup & recovery I/O. With respect to wait events, the observations are as follows : c. The following wait events are due to parallel process - it is like running parallel queries, using parallel hints. PX Deq Credit: send blkd PX qref latch solution to solve this problem is (from web) : increasing the value of parallel_execution_message_size parameter. But I feel that reducing the usage of parallel hints will reduce this problem. but this would either cause not the efficient usage of the CPU or not finishing the query execution faster !!! d. read by other session : In previous versions this wait was classified under the "buffer busy waits" event. However, in Oracle 10.1 and higher this wait time is now broken out into the "read by other session" wait event. Excessive waits for this event are typically due to several processes repeatedly reading the same blocks, e.g. many sessions scanning the same index or performing full table scans on the same table. solution to solve this problem is (from web) : ->buffer busy waits / read by other session: Increase DB_CACHE_SIZE (DB_BLOCK_BUFFERS prior to 9i)/ Analyze contention from SYS.V$BH But going through the link : http://www.confio.com/English/Tips/Read_By_Other_Session.php, i feel that tuning the queries would be the option to reduce this wait. Have to check for hot blocks/objects too ? e. In Oracle, we see two types of data block access - db file sequential read and db file scattered read : ->db file sequential read A single-block read (i.e., index fetch by ROWID). A large number of waits here could indicate poor joining orders of tables, or unselective indexing. It is normal for this number to be large for a high-transaction, well-tuned system, but it can indicate problems in some circumstances. They can also show up as direct path read/write waits. ->db file scattered read A multiblock read (a full-table scan, OPQ, sorting) A large number here indicates that your table may have missing or suppressed indexes. Although it may be more efficient in your situation to perform a full table scan than an index scan, check to ensure that full table scans are necessary when you see these waits. Try to cache small tables to avoid reading them in over and over again solution to solve this problem is (from web) : increase the RAM size of the db_cache_size to reduce disk I/O, tune the SQL to reduce disk I/ O, or invest in a faster disk I/O subsystem. ->log file sequential read - indicates that the process is waiting for blocks to be read from the online redo log into memory. This primarily occurs at instance startup and when the ARCH process archives filled online redo logs. ->db file parallel write - occurs when the process, typically DBWR, has issued multiple I/O requests in parallel to write dirty blocks from the buffer cache to disk, and is waiting for all requests to complete. ( dont know much - have to check regarding DBWR ) Looks like this is mostly due to i/o issue. The direct path read, direct path write wait events are also associated with the data block access (db file sequential read and db file scattered read ) - dont know about these wait events, have to check regarding these. Info : Starting in Oracle 10g release2, Oracle recommends not setting the db_file_multiblock_read_count parameter. solution to solve these problem are (from web) : ->db file sequential read: Tune SQL to do less I/O. Make sure all objects are analyzed. Redistribute I/O across disks. So, in general to reduce the following waits, we have to tune the sql queries (checking the indexes and joins) may give improvement in performance : db file sequential read, db file scattered read, log file sequential read, db file parallel write, direct path read, and direct path write f. The following 3 wait events comes together, since they are related to LGWR/redo log ( dont know much - have to check regarding LGWR/redo log ) ->log file sequential read - LGWR background process waits for this event while it is copying redo records from the memory Log Buffer cache to the current redo group's member logfiles on disk. ->log file sync - User session waits on this wait event while waiting for LGWR to post after commit write of dirty blocks ->log file parallel write - Waiting for the writes of redo records to the redo log files solution to solve these problem are (from web) : ->log file sync: If this event is in the top 5, you are committing too often (talk to your developers) ->log file parallel write: deals with flushing out the redo log buffer to disk. Your disks may be too slow or you have an I/O bottleneck. Conclusion from the above wait event list given by you : More waits are seen (db file sequential read, db file scattered read ), which are due to i/o's. To reduce these wait events, have to tune the sql ( check index and joins). Thanks. I have learnt something about the wait events. Please check whether the above observations are correct ? Also, Please provide your suggestions too... With Regards, Raja. |