From: David Fetter on
On Thu, Sep 03, 2009 at 10:24:17AM -0400, Tom Lane wrote:
> We have seen several previous reports of regression test failures
> due to division by zero causing SIGFPE, even though the code should
> never reach the division command:
>
> http://archives.postgresql.org/pgsql-bugs/2006-11/msg00180.php
> http://archives.postgresql.org/pgsql-bugs/2007-11/msg00032.php
> http://archives.postgresql.org/pgsql-bugs/2008-05/msg00148.php
> http://archives.postgresql.org/pgsql-general/2009-05/msg00774.php
>
> It's always been on non-mainstream architectures so it was hard to
> investigate. But I have finally been able to reproduce this:
> https://bugzilla.redhat.com/show_bug.cgi?id=520916
>
> While s390x is still not quite mainstream, at least I can get
> access to one ;-).

Do you also have access to z/OS with Unix System Services? IBM's
compiler, c89, is amazingly strict, and should help us flush out bugs. :)

> What turns out to be the case is that
> "simple" test cases like
> if (y == 0)
> single_function_call(...);
> z = x / y;
> do not show the problem; you need something pretty complex in the
> if-command. Like, say, an ereport() construct. So that's why the gcc
> boys haven't already had visits from mobs of villagers about this.
>
> I hope that the bug will get fixed in due course, but even if they
> respond pretty quickly it will be years before the problem disappears
> from every copy of gcc in the field. So I'm thinking that it would
> behoove us to install a workaround, now that we've characterized the
> problem sufficiently. What I am thinking is that in the three
> functions known to exhibit the bug (int24div, int28div, int48div)
> we should do something like this:
>
>
> if (arg2 == 0)
> + {
> ereport(ERROR,
> (errcode(ERRCODE_DIVISION_BY_ZERO),
> errmsg("division by zero")));
> + /* ensure compiler realizes we don't reach the division */
> + PG_RETURN_NULL();
> + }
> /* No overflow is possible */
> PG_RETURN_INT64((int64) arg1 / arg2);
>
> Thoughts?

How big would this change be? How would people know to use that
construct everywhere it's appropriate?

Cheers,
David.
--
David Fetter <david(a)fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter(a)gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Tom Lane on
David Fetter <david(a)fetter.org> writes:
> On Thu, Sep 03, 2009 at 10:24:17AM -0400, Tom Lane wrote:
>> While s390x is still not quite mainstream, at least I can get
>> access to one ;-).

> Do you also have access to z/OS with Unix System Services?

No, Red Hat's machines run RHEL ;-)

>> What I am thinking is that in the three
>> functions known to exhibit the bug (int24div, int28div, int48div)
>> we should do something like this:

> How big would this change be? How would people know to use that
> construct everywhere it's appropriate?

I'm talking about patching exactly those three functions. We don't
have any reports of trouble elsewhere. The long-term fix is in the
compiler anyway, this is just a workaround for currently-known issues.

Part of my motivation for this is to get rid of an existing hack in
the Red Hat RPMs:

# use -O1 on sparc64 and alpha
%ifarch sparc64 alpha
CFLAGS=`echo $CFLAGS| sed -e "s|-O2|-O1|g" `
%endif

I believe that that is only there to prevent exactly this problem.
Anyway I'd like to try removing it after patching as proposed, and
then we'll find out if there are other trouble spots ...

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: David Fetter on
On Thu, Sep 03, 2009 at 01:26:52PM -0400, Tom Lane wrote:
> David Fetter <david(a)fetter.org> writes:
> > On Thu, Sep 03, 2009 at 10:24:17AM -0400, Tom Lane wrote:
> >> While s390x is still not quite mainstream, at least I can get
> >> access to one ;-).
>
> > Do you also have access to z/OS with Unix System Services?
>
> No, Red Hat's machines run RHEL ;-)

I'm given to understand it's possible to run both on the same
hardware.

Cheers,
David
--
David Fetter <david(a)fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter(a)gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Bruce Momjian on
Tom Lane wrote:
> I hope that the bug will get fixed in due course, but even if they
> respond pretty quickly it will be years before the problem disappears
> from every copy of gcc in the field. So I'm thinking that it would
> behoove us to install a workaround, now that we've characterized the
> problem sufficiently. What I am thinking is that in the three
> functions known to exhibit the bug (int24div, int28div, int48div)
> we should do something like this:
>
>
> if (arg2 == 0)
> + {
> ereport(ERROR,
> (errcode(ERRCODE_DIVISION_BY_ZERO),
> errmsg("division by zero")));
> + /* ensure compiler realizes we don't reach the division */
> + PG_RETURN_NULL();
> + }

Could we document this a little better, perhaps refer to a file
somewhere or a central comment on its purpose?

--
Bruce Momjian <bruce(a)momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers