library cache load lock - major problems [Oracle]

Prev: ORA-1113 signalled during: alter database open
Next: Oracle 10 RAC to 10G RAC standby

From: norman.dunbar on 29 Jun 2005 11:12

Morning all (or afternoon - I'm behind another firewall and using
Google !)

Oracle enterprise : 8.1.7.3 (there's nothing I can do about this I'm
afraid !)
HP-UX 11.11
8 CPUs and load average is 0.5 over 45 days.

Sessions hang on library cache load lock, all are executing the same
sql statement which makes a call to UTL_TCP.flush followed by
utl_tcp.close_connectioon (see below). This is intermittants but very
irritating for my users.

I've Googled these groups and Metalinked till I'm blue in the face -
nothing. :o(

Gruesome details.

User logs into an application - it sometimes hangs.
User calls me up to advise me of the hang, in case I don't already
know.
I log into application - it hangs for me too. (I'm on a different
network etc).
I use SQLPlus and get an instant connection - my session, and that of
my user(s) stay hung. Hence problem must be in the application.

My script to interrogate V$SESSION_WAIT & V$SESSION shows the following
:

SQL> @sessionwait

SID USERNAME OSUSER SQL_HASH_VALUE EVENT

------------ --------------- --------------- --------------
-----------------------
29 WV20 www2 1359669396 library
cache load lock
68 WV20 landrews 1359669396 library
cache load lock
70 WV20 www2 1359669396 library
cache load lock
79 WV20 www2 1359669396 library
cache load lock
90 WV20 www2 1359669396 library
cache load lock
114 WV20 www2 1359669396 library
cache load lock

6 rows selected.

www2 is a web interface (written in Java) to the application via
Apache.

The application also seems to login three sessions to the database for
every actual execution of the application. Hmm.

Everyone is waiting on the same sql_hash_value. This is a simple
statement :

SQL> @sqlbyhash_8i

Enter value for hash_value: 1359669396

SQL_TEXT
----------------------------------------------------------------
begin -- if there are errors, there's nothing we can do :( b
egin fdi_norm.Tidy; exception when others then null;
end; end;

Now, I've tracked the above code to one place, in a logoff trigger for
a specific schema that we are not (directly) logging into - it is
possible that the application is logging into it in the background
though. The comments helped in tracing the code !

The code being executed, FDI_NORM.TIDY, can and is executed from
elsewhere though. I've never caught it hanging when called from
elsewhere, it's always from the above statement which is in the
trigger.

The actual code being executed from the trigger looks like this :

procedure Tidy is
begin
utl_tcp.flush(c);
utl_tcp.close_connection(c);
Connected := false;
SessionID := -1;
exception
when OTHERS then null;
end Tidy;

I have both FDI_NORM and UTL_TCP pinned - it makes no difference. (By
the way, FDI_NORM has nothing to do with me, only the names are the
same - this code is from a third party application.)

Now I've tested this procedure in isolation when the system is running
fine - it takes 2.5 seconds on the first run and zero seconds on all
subsequent runs during this session.

When the system is hanging, it takes around 4 minutes to run and times
out. The connection is to another server here and that server is
running happily as is the network - we've had it checked by out
networking guys & gals.

Metalink note 122793.1 - How To Find The Session Holding A Library
Cache Lock - has two scripts to list holders and waiters which select
from SYS.X$KGLLK. When the system is hung, these scripts return no rows
selected for holders and waiters. SessionWait still shows waiters on
L.C.L.L.

DBA_LOCK_INTERNAL shows that each waiting session has the following :

MODE_HELD = Share and MODE_REQUESTED = None for one session, plus two
entries with MODE_HELD = Null and MODE_REQUESTED = None. All have the
same SESSION_ID but LOCK_ID1 is different and LOCK_ID2 is null for
each. The lock_type is a 'Cursor definition lock' for the two Null
mode_held entries and 'Cursor Definition pin' for the 'share' entry.

One of the waiters in V$SessionWait is a background task (no OSUSER in
V$SESSION) and has the name 'SNP0', description 'Job Queue Process 0'
in V$BGPROCESS.

Shared pool is 250 Mb. There is an extract of the statspack report at
the end of this posting, but I have a few worries :

Response times (as analysed by oraperf.com) were 99.93% wait time and
0.07% cpu.

L.C.L.L was responsible for 98.69% of the total wait time.

63,432,885 gets on the Library Cache latches was requested in 75
seconds. Interesting. This was 99.19% of the latch free waits. Of
these, 1,723 slept. The latch sleeps were mostly on kglic (690),
kglpnal: child:before pro (308), kglpnal: child: alloc spac (211) and
kglhdgn:child (156). The rest dwindled from 78 sleeps down to 1.

Session_cached_cursors is not set.

Anyone got any ideas on tracking this down ?
Please & thanks in advance.

Oh, by the way, the application *does* use bind variables :o)

Cheers,
Norm.

Here's the SP report stuff :

A statspack report for 75 seconds, which is a 'just before' the lockup
to 'just after', shows the following (extracts) :

Snap Id Snap Time Sessions
------- ------------------ --------
Begin Snap: 16114 29-Jun-05 14:40:25 5,303
End Snap: 16116 29-Jun-05 14:41:40 5,303
Elapsed: 1.25 (mins)

Load Profile
~~~~~~~~~~~~ Per Second Per
Transaction
---------------
---------------
Redo size: 11,115.84
21,376.62
Logical reads: 1,961.79
3,772.67
Block changes: 18.31
35.21
Physical reads: 4.20
8.08
Physical writes: 3.76
7.23
User calls: 6.59
12.67
Parses: 17.05
32.79
Hard parses: 1.63
3.13
Sorts: 64.03
123.13
Logons: 0.16
0.31
Executes: 258.83
497.74
Transactions: 0.52

% Blocks changed per Read: 0.93 Recursive Call %: 99.19
Rollback per transaction %: 2.56 Rows per Sort: 6.89

Instance Efficiency Percentages (Target 100%)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Buffer Nowait %: 100.00 Redo NoWait %: 100.00
Buffer Hit %: 99.79 In-memory Sort %: 99.98
Library Hit %: 96.80 Soft Parse %: 90.46
Execute to Parse %: 93.41 Latch Hit %: 99.99
Parse CPU to Parse Elapsd %: 80.41 % Non-Parse CPU: 100.00

Shared Pool Statistics Begin End
------ ------
Memory Usage %: 84.47 91.13
% SQL with executions>1: 91.79 89.32
% Memory for SQL w/exec>1: 89.60 94.41

(Hey, I must have a great performing database, my buffer hit ratio is
99.79% !!!)

Top 5 Wait Events
~~~~~~~~~~~~~~~~~ Wait
% Total
Event Waits Time (cs)
Wt Time
-------------------------------------------- ------------ ------------
-------
library cache load lock 268 30,205
98.29
db file sequential read 92 175
.57
log file parallel write 75 129
.42
enqueue 1 81
.26
latch free 1,764 54
.18

Wait Events for DB: OLIBPR Instance: olibpr Snaps: 16114 -16116
-> cs - centisecond - 100th of a second
-> ms - millisecond - 1000th of a second
-> ordered by wait time desc, waits desc (idle events last)

Avg
Total Wait wait
Waits
Event Waits Timeouts Time (cs) (ms)
/txn
---------------------------- ------------ ---------- ----------- ------
------
library cache load lock 268 87 30,205 1127
6.9
db file sequential read 92 0 175 19
2.4
log file parallel write 75 0 129 17
1.9
enqueue 1 0 81 810
0.0
latch free 1,764 943 54 0
45.2
control file parallel write 24 0 44 18
0.6
library cache pin 18 0 16 9
0.5
log file sync 17 0 11 6
0.4
log buffer space 6 0 11 18
0.2

Instance Activity Stats for DB: OLIBPR Instance: olibpr Snaps: 16114
-16116

Statistic Total per Second per
Trans
--------------------------------- ---------------- ------------
------------
CPU used by this session 754,871 10,065.0
19,355.7
CPU used when call started 2,017 26.9
51.7

parse count (hard) 122 1.6
3.1
parse count (total) 1,279 17.1
32.8
parse time cpu 119 1.6
3.1
parse time elapsed 148 2.0
3.8

Latch Activity for DB: OLIBPR Instance: olibpr Snaps: 16114 -16116
->"Get Requests", "Pct Get Miss" and "Avg Slps/Miss" are statistics for
willing-to-wait latch get requests
->"NoWait Requests", "Pct NoWait Miss" are for no-wait latch get
requests
->"Pct Misses" for both should be very close to 0.0

Pct Avg
Pct
Get Get Slps NoWait
NoWait
Latch Name Requests Miss /Miss Requests
Miss
----------------------------- -------------- ------ ------ ------------
------
JOX SGA heap latch 869 0.1 1.0 0
Token Manager 2 0.0 5
0.0
active checkpoint queue latch 30 0.0 5
0.0
cache buffer handles 1,942 0.0 0
cache buffers chains 254,304 0.1 0.1 136
0.0
cache buffers lru chain 47 0.0 123
0.0
channel handle pool latch 17 0.0 0
channel operations parent lat 22 0.0 0
checkpoint queue latch 444 0.0 0
dml lock allocation 226 0.0 0
enqueue hash chains 551 0.0 0
enqueues 892 0.0 0
event group latch 5 0.0 0
job_queue_processes parameter 1 0.0 0
latch wait list 1,434 1.3 0.1 1,425
0.4
library cache 63,432,885 0.0 0.2 431
3.9
library cache load lock 1,589 2.3 0.0 0
list of block allocation 129 0.0 0
loader state object freelist 6 0.0 0
longop free list 2 0.0 0
messages 464 0.0 0
mostly latch-free SCN 139 0.0 0
ncodef allocation latch 1 0.0 0
process allocation 5 0.0 5
0.0
process group creation 17 0.0 0
redo allocation 882 0.2 0.0 0
redo writing 357 0.0 0
row cache objects 17,425 0.6 0.0 8
0.0
sequence cache 57 0.0 0
session allocation 5,417 0.0 0.0 0
session idle bit 1,140 0.0 0
session switching 1 0.0 0
shared java pool 47 0.0 0
shared pool 21,489 0.2 0.2 0
sort extent pool 12 0.0 0
transaction allocation 194 0.0 0
transaction branch allocation 1 0.0 0
undo global data 339 0.0 0
user lock 222 0.0 0
-------------------------------------------------------------
Latch Sleep breakdown for DB: OLIBPR Instance: olibpr Snaps: 16114
-16116
-> ordered by misses desc

Get
Spin &
Latch Name Requests Misses Sleeps
Sleeps 1->4
-------------------------- -------------- ----------- -----------
------------
library cache 63,432,885 9,048 1,723
7666/1065/29

5/22/0
cache buffers chains 254,304 180 14
168/10/2/0/0
row cache objects 17,425 102 2
100/2/0/0/0
shared pool 21,489 49 11
44/2/2/1/0
latch wait list 1,434 19 1
18/1/0/0/0
JOX SGA heap latch 869 1 1
0/1/0/0/0
-------------------------------------------------------------
Latch Miss Sources for DB: OLIBPR Instance: olibpr Snaps: 16114
-16116
-> only latches with sleeps are shown
-> ordered by name, sleeps desc

NoWait
Waiter
Latch Name Where Misses Sleeps
Sleeps
------------------------ -------------------------- ------- ----------
-------
cache buffers chains kcbgtcr: kslbegin 0 13
9
cache buffers chains kcbrls: kslbegin 0 1
5
latch wait list No latch 0 1
1
latch wait list kslfre 2 1
1
library cache kglic 0 690
1,551
library cache kglpnal: child: before pro 0 308
14
library cache kglpnal: child: alloc spac 0 211
18
library cache kglhdgn: child: 0 156
23
library cache kglobpn: child: 0 78
19
library cache kglhdgh: child: 0 32
24
library cache kgldti: 2child 0 31
0
library cache kglhdgc: child: 0 31
5
library cache kgllkdl: child: cleanup 0 26
12
library cache kglrtl 0 23
0
library cache kglupc: child 0 16
12
library cache kgldtld: 2child 0 15
0
library cache kglati 0 13
7
library cache kglpin 0 13
3
library cache kglget: child: KGLDSBRD 0 7
0
library cache kgllkdl: child: free pin 0 5
10
library cache kglget: child: KGLDSBYD 0 4
1
library cache kglpnc: child 0 4
1
library cache kglpnal: child: check gran 0 3
1
library cache kgldtld: parent 0 1
0
library cache kglini: parent 0 1
0
library cache kglpnp: child 0 1
2
row cache objects kqrpre: find obj 0 2
0
shared pool kghfrunp: clatch: nowait 0 32
0
shared pool kghfrunp: clatch: wait 0 19
3
shared pool kghfrunp: alloc: clatch no 0 5
0
shared pool kghfrunp: alloc: wait 0 4
0
shared pool kghalp 0 1
0
shared pool kghupr1 0 1
2
-------------------------------------------------------------
Dictionary Cache Stats for DB: OLIBPR Instance: olibpr Snaps: 16114
-16116
->"Pct Misses" should be very low (< 2% in most cases)
->"Cache Usage" is the number of cache entries being used
->"Pct SGA" is the ratio of usage to allocated size for that cache

Get Pct Scan Pct Mod
Final Pct
Cache Requests Miss Requests Miss Req
Usage SGA
---------------------- ------------ ------ -------- ----- --------
------ ----
dc_constraints 0 0 0
0 0
dc_database_links 0 0 0
0 0
dc_files 0 0 0
10 91
dc_free_extents 0 0 0
11 92
dc_global_oids 0 0 0
5 83
dc_histogram_data 0 0 0
0 0
dc_histogram_data_valu 0 0 0
0 0
dc_histogram_defs 1,614 17.0 0 0
507 100
dc_object_ids 1,090 6.2 0 0
183 96
dc_objects 860 15.3 0 0
310 99
dc_outlines 0 0 0
0 0
dc_profiles 22 0.0 0 0
2 40
dc_rollback_segments 0 0 0
17 94
dc_segments 414 31.6 0 0
123 97
dc_sequence_grants 0 0 0
1 50
dc_sequences 19 0.0 0 14
8 67
dc_synonyms 88 25.0 0 0
52 93
dc_tablespace_quotas 0 0 0
1 50
dc_tablespaces 4 25.0 0 0
2 20
dc_used_extents 0 0 0
5 83
dc_user_grants 20 0.0 0 0
16 42
dc_usernames 572 0.0 0 0
12 52
dc_users 951 0.0 0 0
30 77
ifs_acl_cache_entries 0 0 0
0 0
-------------------------------------------------------------

Library Cache Activity for DB: OLIBPR Instance: olibpr Snaps: 16114
-16116
->"Pct Misses" should be very low

Get Pct Pin Pct
Invali-
Namespace Requests Miss Requests Miss Reloads
dations
--------------- ------------ ------ -------------- ------ ----------
--------
BODY 81 1.2 83 8.4 5
0
CLUSTER 277 0.4 32 6.3 0
0
INDEX 0 0 0
0
OBJECT 0 0 0
0
PIPE 0 0 0
0
SQL AREA 1,500 10.7 22,161 0.8 72
8
TABLE/PROCEDURE 2,160 3.4 7,376 10.2 271
0
TRIGGER 12 0.0 12 0.0 0
0
-------------------------------------------------------------
SGA Memory Summary for DB: OLIBPR Instance: olibpr Snaps: 16114
-16116

SGA regions Size in Bytes
------------------------------ ----------------
Database Buffers 104,857,600
Fixed Size 104,936
Redo Buffers 77,824
Variable Size 102,891,520
----------------
sum 207,931,880

<END OF EXTRACT>

From: DA Morgan on 29 Jun 2005 15:04

norman.dunbar(a)lfs.co.uk wrote:
> Morning all (or afternoon - I'm behind another firewall and using
> Google !)
>
> Oracle enterprise : 8.1.7.3 (there's nothing I can do about this I'm
> afraid !)

What prevents patching to 8.1.7.4?
--
Daniel A. Morgan
http://www.psoug.org
damorgan(a)x.washington.edu
(replace x with u to respond)

From: Joel Garry on 29 Jun 2005 20:46

norman.dunbar(a)lfs.co.uk wrote:
> Morning all (or afternoon - I'm behind another firewall and using
> Google !)
>
> Oracle enterprise : 8.1.7.3 (there's nothing I can do about this I'm
> afraid !)
> HP-UX 11.11
> 8 CPUs and load average is 0.5 over 45 days.
>
>
> Sessions hang on library cache load lock, all are executing the same
> sql statement which makes a call to UTL_TCP.flush followed by
> utl_tcp.close_connectioon (see below). This is intermittants but very
> irritating for my users.
>
> I've Googled these groups and Metalinked till I'm blue in the face -
> nothing. :o(

Well, there's this old thing, which will probably tell you what you
already know:

spool tuning_stats.txt

ttitle 'SYSTEM STATISTICS FOR ORACLE7'

select 'LIBRARY CACHE STATISTICS:' from dual;

ttitle off

select 'PINS - # of times an item in the library cache was executed -
'||
sum(pins),
'RELOADS - # of library cache misses on execution steps - '||
sum (reloads)||chr(10)||chr(10),
'RELOADS / PINS * 100 = '||round((sum(reloads) / sum(pins) *
100),2)||'%'
from v$librarycache
/

prompt Increase memory until RELOADS is near 0 but watch out for
prompt Paging/swapping
prompt To increase library cache, increase SHARED_POOL_SIZE
prompt
prompt ** NOTE: Increasing SHARED_POOL_SIZE will increase the SGA size.
prompt
prompt Library Cache Misses indicate that the Shared Pool is not big
prompt enough to hold the shared SQL area for all concurrently open
cursors.
prompt If you have no Library Cache misses (PINS = 0), you may get a
small
prompt increase in performance by setting CURSOR_SPACE_FOR_TIME = TRUE
which
prompt prevents ORACLE from deallocating a shared SQL area while an
prompt application
prompt cursor associated with it is open.
prompt
prompt For Multi-threaded server, add 1K to SHARED_POOL_SIZE per user.
prompt
prompt------------------------------------------------------------------------

>
>
> Gruesome details.
>
> User logs into an application - it sometimes hangs.
> User calls me up to advise me of the hang, in case I don't already
> know.
> I log into application - it hangs for me too. (I'm on a different
> network etc).
> I use SQLPlus and get an instant connection - my session, and that of
> my user(s) stay hung. Hence problem must be in the application.
>
> My script to interrogate V$SESSION_WAIT & V$SESSION shows the following
> :
>
> SQL> @sessionwait
>
> SID USERNAME OSUSER SQL_HASH_VALUE EVENT
>
> ------------ --------------- --------------- --------------
> -----------------------
> 29 WV20 www2 1359669396 library
> cache load lock
> 68 WV20 landrews 1359669396 library
> cache load lock
> 70 WV20 www2 1359669396 library
> cache load lock
> 79 WV20 www2 1359669396 library
> cache load lock
> 90 WV20 www2 1359669396 library
> cache load lock
> 114 WV20 www2 1359669396 library
> cache load lock
>
> 6 rows selected.
>
> www2 is a web interface (written in Java) to the application via
> Apache.
>
> The application also seems to login three sessions to the database for
> every actual execution of the application. Hmm.

Wild guess - the thing is shooting itself in the foot requesting
something in the shared pool that it itself is causing to be reloaded.

>
>
> Everyone is waiting on the same sql_hash_value. This is a simple
> statement :
>
> SQL> @sqlbyhash_8i
>
> Enter value for hash_value: 1359669396
>
> SQL_TEXT
> ----------------------------------------------------------------
> begin -- if there are errors, there's nothing we can do :( b
> egin fdi_norm.Tidy; exception when others then null;
> end; end;
>
> Now, I've tracked the above code to one place, in a logoff trigger for
> a specific schema that we are not (directly) logging into - it is
> possible that the application is logging into it in the background
> though. The comments helped in tracing the code !
>
> The code being executed, FDI_NORM.TIDY, can and is executed from
> elsewhere though. I've never caught it hanging when called from
> elsewhere, it's always from the above statement which is in the
> trigger.
>
> The actual code being executed from the trigger looks like this :
>
> procedure Tidy is
> begin
> utl_tcp.flush(c);
> utl_tcp.close_connection(c);
> Connected := false;
> SessionID := -1;
> exception
> when OTHERS then null;
> end Tidy;
>
> I have both FDI_NORM and UTL_TCP pinned - it makes no difference. (By
> the way, FDI_NORM has nothing to do with me, only the names are the
> same - this code is from a third party application.)

TCP stuff often has timeouts. Do a netstat -a |grep FIN_WAIT . If you
see lots of FIN_WAIT2, google for that, you may need to play with hp-ux
timeout settings. When that happens, it's like a slow drain in the
bathtub for it to go away. And you can then blame the networking folk.

Check to be sure what you think is pinned really is.

>
> Now I've tested this procedure in isolation when the system is running
> fine - it takes 2.5 seconds on the first run and zero seconds on all
> subsequent runs during this session.
>
> When the system is hanging, it takes around 4 minutes to run and times
> out. The connection is to another server here and that server is
> running happily as is the network - we've had it checked by out
> networking guys & gals.
>

jg
--
@home.com is bogus.
"It exempts - it is everything what it could bleed This is why it never
goes will be White - naked naked to the night Alone looking for for
some game little Has another girl who wants to conduct the world Any
hour or place and when it enters in its head You knows it is there for
being You want this It you are acquiring this Chambers of molly you
iram to change its mind It you are acquiring Its pistol Slow - It is
burning in its soul With whispers in its ear Is certain I will give
this in any way Has little me arrives outside daqui You will plead -
you low in its knees For have little another taste well and when you
think that they leave to it inside This Is when it to diminish You want
this It you are acquiring this Chambers of molly you iram to change its
mind It you are acquiring Its pistol You want this It you are acquiring
this Chambers of molly you iram to change its mind It you are acquiring
Its pistol" - Bablefish translation of Portugese version of Molly's
Chambers.

From: norman.dunbar on 30 Jun 2005 05:44

Morning Daniel,

it's a third party app and not validated for 8174. We are in the
processes of migrating a number of apps to 9i even as I type, but we
have a number of legacy apps still on 7.3.4 and 8.1.7.0-8.1.7.4 and
even one at 8.0.4 !

Cheers,
Norm.

From: DA Morgan on 30 Jun 2005 10:09

norman.dunbar(a)lfs.co.uk wrote:
> Morning Daniel,
>
> it's a third party app and not validated for 8174. We are in the
> processes of migrating a number of apps to 9i even as I type, but we
> have a number of legacy apps still on 7.3.4 and 8.1.7.0-8.1.7.4 and
> even one at 8.0.4 !
>
> Cheers,
> Norm.

You have my sympathy. But I've never heard of anyone validating to
8.1.7.3 and not 8.1.7.4. The usual validation is to 8iR3 meaning 8.1.7
without specifying the final digit.

It might be useful to name names. In other words publish the names of
any vendor that is forcing you to stay with Oracle 7.x or 8.0.x so that
others will know that this vendor does not give a damn about its
customers and hopefully will stay clear of future purhases from them.
--
Daniel A. Morgan
http://www.psoug.org
damorgan(a)x.washington.edu
(replace x with u to respond)

| Next | Last
Pages: 1 2
Prev: ORA-1113 signalled during: alter database open
Next: Oracle 10 RAC to 10G RAC standby