Safe to bypass fsck? [Solaris]

Prev: How to detect if solaris is running as a guest machine
Next: job

From: Casper H.S. Dik on 23 Apr 2010 04:14

"Mr. Nice Guy" <aaron(a)mcs-partners.com> writes:

>Hi. I have a Solaris 10 system with a partition that has journaling
>turned off - that's the way the admin originally set it up. Recently,
>this system booted and stopped before loading the OS, prompting with
>"Type control-d to proceed with normal startup, (or give root
>password for system maintenance)".

>Typically, I would login as root and run fsck on the partition to fix
>it. But another guy says I don't need to do that. He said that what he
>does is type the <stop-a> sequence to get the OBP prompt, and then
>does a "reset-all", and that, according to him, fixes the problem with
>the file-system. That goes against logic, and experience, so I wanted
>to get a third perspective to see what others think.

Nope; if the system refuses to boot, then you will need to run
fix the problems which caused it not to boot.

Is it possible to run ufs without log? (There needs to be a "nologging"
flah in /etc/vfstab; logging is the default)

Casper
--
Expressed in this posting are my opinions. They are in no way related
to opinions held by my employer, Sun Microsystems.
Statements on Sun products included here are not gospel and may
be fiction rather than truth.

From: chuckers on 25 Apr 2010 19:28

On Apr 24, 4:29 am, "Colin B." <cbi...(a)somewhereelse.shaw.ca> wrote:
> Richard B. Gilbert <rgilber...(a)comcast.net> wrote:
>
> > Colin B. wrote:
>
> ...
>
>
>
> >> With journalling turned off, you will definitely want to do an fsck.
> >> Breaking and rebooting will cause more problems, not fix them. It *may*
> >> be able to hide filesystem corruption, so you're not as healthy as you
> >> think. That's not a good thing.
>
> >> The "other guy" is an idiot who doesn't know what he's doing. Ignore him.
>
> >> Colin
>
> > Is there some reason not to turn on journaling?
>
> > Journaling can make things run slower. Basically it writes "I'm going
> > to update block 24,736" to a file, does the update and, on successful
> > completion, deletes the "I'm going to update. . . ." and is done. If
> > the system crashes while doing the update it reboots and finds that it
> > was in the process of doing an update, had not completed it, and applies
> > the failed transaction.
>
> I noticed that the OP was not using logging, which is why I mentioned it.
>
> Years ago, when logging first became available, it could be very slow in
> some circumstances, due (mostly) to bugs. I haven't seen that for ages
> though. I'd say that if you're so close to the edge that you need the
> performance gain from disabling logging, then you're too close to the
> edge. I can't imagine anyone deliberately disabling logging anymore, unless
> they're misguided (or using ZFS).
>
> Colin

We have an HP Box running as NFS home directory mounts for our
developers. Solaris 10u1.
Due to some weird drivers that don't seem to work in later version of
Solaris 10 for some reason
we are unable to get it to a later update of Solaris 10. (It was
politics and someone else's bright
idea to go with this box without looking into the details of it.)

Our developers have a tendency of creating HUGE data log files (well
North of 2GB) which
eats into the disk space. Fine, just delete them to reclaim space and
move on.

Unfortunately, when we have tried deleting files this large with
logging on, the machine grinds
to a halt trying to update the logging table as it removes the multi-
GB file from the system.
The NFS mounts become unusable for ANYBODY and hangs for hours and
hours and hours.
We have never had it come back to life without rebooting it. And
since it was rebooted, the
file doesn't get deleted because it was still in UFS journal table.
Booting to single user mode and
having root delete the file doesn't help. So the file has sit there,
taking up valuable space because
it can't be deleted without taking out the NFS file system.

The ONLY way we have been able to work around this is to remount the
file system as nologging.
That allows us to remove these monstrous files without problems. We
pretty much have to leave
it as nologging because there is no telling when another of these
silly log files will be created and
someone realises they ought not be using that much space and deletes
the file, bringing down the
entire home directories of everyone else around them.

From: chuckers on 25 Apr 2010 23:58

On Apr 26, 11:56 am, Michael Vilain <vil...(a)NOspamcop.net> wrote:
> In article
> <f02605e2-5664-4dc2-8b33-3581837db...(a)w32g2000prc.googlegroups.com>,
>
>
>
> chuckers <chucker...(a)gmail.com> wrote:
[edit]
>
> > We have an HP Box running as NFS home directory mounts for our
> > developers. Solaris 10u1.
> > Due to some weird drivers that don't seem to work in later version of
> > Solaris 10 for some reason
> > we are unable to get it to a later update of Solaris 10. (It was
> > politics and someone else's bright
> > idea to go with this box without looking into the details of it.)
>
> > Our developers have a tendency of creating HUGE data log files (well
> > North of 2GB) which
> > eats into the disk space. Fine, just delete them to reclaim space and
> > move on.
>
> > Unfortunately, when we have tried deleting files this large with
> > logging on, the machine grinds
> > to a halt trying to update the logging table as it removes the multi-
> > GB file from the system.
> > The NFS mounts become unusable for ANYBODY and hangs for hours and
> > hours and hours.
> > We have never had it come back to life without rebooting it. And
> > since it was rebooted, the
> > file doesn't get deleted because it was still in UFS journal table.
> > Booting to single user mode and
> > having root delete the file doesn't help. So the file has sit there,
> > taking up valuable space because
> > it can't be deleted without taking out the NFS file system.
>
> > The ONLY way we have been able to work around this is to remount the
> > file system as nologging.
> > That allows us to remove these monstrous files without problems. We
> > pretty much have to leave
> > it as nologging because there is no telling when another of these
> > silly log files will be created and
> > someone realises they ought not be using that much space and deletes
> > the file, bringing down the
> > entire home directories of everyone else around them.
>
> You're going to have to protect the data some other way than with
> logging. Mirroring, RAID5, or ZFS are options. Otherwise, without
> logging rebooting the hung machine and running fsck will take _days_.
> If you don't do fsck, you're going to end up with corrupted files.
>
> You have a OS performance problem here clear and simple. I don't see a
> way around this short of moving to ZFS. Otherwise, you should be using
> that service contract you're paying for to have Sun f-ing fix this
> problem.
>
> Don't have a contract? Why are you bothering to run a NFS server with
> company-important files?
>

Those decisions are being made by someone above my pay grade that
won't
listen to us little people in infrastructure. They are of the opinion
that it is cheaper
to buy a new box that pay for support on what we have.

Fortunately, they don't get to make the same sort of decisions for any
of our customer
facing things. They rail about it but are shot down.

From: Colin B. on 26 Apr 2010 12:38

chuckers <chuckersjp(a)gmail.com> wrote:

> We have an HP Box running as NFS home directory mounts for our
> developers. Solaris 10u1.
> Due to some weird drivers that don't seem to work in later version of
> Solaris 10 for some reason
> we are unable to get it to a later update of Solaris 10. (It was
> politics and someone else's bright
> idea to go with this box without looking into the details of it.)
>
> Our developers have a tendency of creating HUGE data log files (well
> North of 2GB) which
> eats into the disk space. Fine, just delete them to reclaim space and
> move on.
>
> Unfortunately, when we have tried deleting files this large with
> logging on, the machine grinds
> to a halt trying to update the logging table as it removes the multi-
> GB file from the system.
> The NFS mounts become unusable for ANYBODY and hangs for hours and
> hours and hours.
> We have never had it come back to life without rebooting it. And
> since it was rebooted, the
> file doesn't get deleted because it was still in UFS journal table.
> Booting to single user mode and
> having root delete the file doesn't help. So the file has sit there,
> taking up valuable space because
> it can't be deleted without taking out the NFS file system.
>
> The ONLY way we have been able to work around this is to remount the
> file system as nologging.
> That allows us to remove these monstrous files without problems. We
> pretty much have to leave
> it as nologging because there is no telling when another of these
> silly log files will be created and
> someone realises they ought not be using that much space and deletes
> the file, bringing down the
> entire home directories of everyone else around them.

Bugs. You're running into bugs, and you should fix them, rather than
working around it in risky ways.

Check out 127867-03 for a start. It's a very old patch (two years old
now), but has the following bug fixes:

6499704 statvfs64 too slow with logging UFS
6513858 deleting large file while creating another on full
UFS, spending lots of time in ufs_log_amt() loop

That was my first kick at the cat on sunsolve.
If there's any reason you can't patch a broken system, then you need to
get the hell out of there before they implode.

Colin

From: Colin B. on 27 Apr 2010 00:20

Michael Vilain <vilain(a)nospamcop.net> wrote:
> In article <6kjBn.146897$gF5.80920(a)newsfe13.iad>,
> "Colin B." <cbigam(a)somewhereelse.shaw.ca> wrote:
>
>> chuckers <chuckersjp(a)gmail.com> wrote:
>>
>> > We have an HP Box running as NFS home directory mounts for our
>> > developers. Solaris 10u1.
>> > Due to some weird drivers that don't seem to work in later version of
>> > Solaris 10 for some reason
>> > we are unable to get it to a later update of Solaris 10. (It was
>> > politics and someone else's bright
>> > idea to go with this box without looking into the details of it.)
>> >
>> > Our developers have a tendency of creating HUGE data log files (well
>> > North of 2GB) which
>> > eats into the disk space. Fine, just delete them to reclaim space and
>> > move on.
>> >
>> > Unfortunately, when we have tried deleting files this large with
>> > logging on, the machine grinds
>> > to a halt trying to update the logging table as it removes the multi-
>> > GB file from the system.
>> > The NFS mounts become unusable for ANYBODY and hangs for hours and
>> > hours and hours.
>> > We have never had it come back to life without rebooting it. And
>> > since it was rebooted, the
>> > file doesn't get deleted because it was still in UFS journal table.
>> > Booting to single user mode and
>> > having root delete the file doesn't help. So the file has sit there,
>> > taking up valuable space because
>> > it can't be deleted without taking out the NFS file system.
>> >
>> > The ONLY way we have been able to work around this is to remount the
>> > file system as nologging.
>> > That allows us to remove these monstrous files without problems. We
>> > pretty much have to leave
>> > it as nologging because there is no telling when another of these
>> > silly log files will be created and
>> > someone realises they ought not be using that much space and deletes
>> > the file, bringing down the
>> > entire home directories of everyone else around them.
>>
>> Bugs. You're running into bugs, and you should fix them, rather than
>> working around it in risky ways.
>>
>> Check out 127867-03 for a start. It's a very old patch (two years old
>> now), but has the following bug fixes:
>>
>> 6499704 statvfs64 too slow with logging UFS
>> 6513858 deleting large file while creating another on full
>> UFS, spending lots of time in ufs_log_amt() loop
>>
>> That was my first kick at the cat on sunsolve.
>> If there's any reason you can't patch a broken system, then you need to
>> get the hell out of there before they implode.
>>
>> Colin
>
> Colin, did you miss the fact this is an unsupported not-on-contract
> system and that the powers that be won't pay for support. The OP is
> supposed to fix this all by themselves. I told them to update their
> resume. Have you better suggestions short of paying for support via T&M
> out of pocket?

Well, they can still download releases, which means they can still get a
healthier system than S10u1. Of course, there were driver reasons against
that. I think the Sun Alert bundle is still available without a contract,
isn't it? This is an availability patch, so should be included in that.

However, the part of my message you left in is entirely relevant: If they
don't have support, can't get the patch, and can't upgrade, for reasons
of political insanity, they should polish their resume and get out. It's
a fact of life that politics, egos, and bureaucracy often make practical
work difficult--if they make it _impossible_ then there's no point in even
showing up for work, except to collect your paycheque.

Colin

First | Prev |
Pages: 1 2
Prev: How to detect if solaris is running as a guest machine
Next: job