From: Fujii Masao on
On Tue, Feb 16, 2010 at 12:37 AM, Magnus Hagander <magnus(a)hagander.net> wrote:
> With the libpq fixes, I get further (more on that fix later, btw), but
> now I get stuck in this. When I do something on the master that
> generates WAL, such as insert a record, and then try to query this on
> the slave, the walreceiver process crashes with:
>
> PANIC:  XX000: could not write to log file 0, segment 9 at offset 0, length 160:
>  Invalid argument
> LOCATION:  XLogWalRcvWrite, .\src\backend\replication\walreceiver.c:487
>
> I'll keep digging at the details, but if somebody has a good idea here.. ;)

Yeah, this problem was reproduced in my (very slow :-( ) MinGW environment, too.
Though I've not idenfied the cause yet, I guess that it derives from wrong use
of the type of local variables in XLogWalRcvWrite(). I'll continue investigation
of it.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Magnus Hagander on
2010/2/16 Fujii Masao <masao.fujii(a)gmail.com>:
> On Tue, Feb 16, 2010 at 12:37 AM, Magnus Hagander <magnus(a)hagander.net> wrote:
>> With the libpq fixes, I get further (more on that fix later, btw), but
>> now I get stuck in this. When I do something on the master that
>> generates WAL, such as insert a record, and then try to query this on
>> the slave, the walreceiver process crashes with:
>>
>> PANIC:  XX000: could not write to log file 0, segment 9 at offset 0, length 160:
>>  Invalid argument
>> LOCATION:  XLogWalRcvWrite, .\src\backend\replication\walreceiver.c:487
>>
>> I'll keep digging at the details, but if somebody has a good idea here.. ;)
>
> Yeah, this problem was reproduced in my (very slow :-( ) MinGW environment, too.
> Though I've not idenfied the cause yet, I guess that it derives from wrong use
> of the type of local variables in XLogWalRcvWrite(). I'll continue investigation
> of it.

Thanks!

I will be somewhat spottily available over the next two days due to
on-site work with clients.

Let me know if you would be helped by some details of how to get a
(somewhat faster) EC2 image up and running with MSVC to test on :-)

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Fujii Masao on
On Wed, Feb 17, 2010 at 6:28 AM, Magnus Hagander <magnus(a)hagander.net> wrote:
> If you send me your amazon id, I can get you premissions on my private
> image. I plan to clean it up and make it public, just haven't gotten
> around to it yet...

Thanks for your concern! I'll send the ID when I complete the preparation.

And, fortunately?, when I set wal_sync_method to open_sync, the problem was
reproduced in the linux, too. The cause is that the data that is written by
walreceiver is not aligned, even if O_DIRECT is used. On win32, O_DIRECT is
used by default. So the problem always happened on win32.

I propose two solution ideas:

1. O_DIRECT is somewhat harmful in the standby since the data written by
walreceiver is read by the startup process immediately. So, how about
not making only walreceiver use O_DIRECT?

2. Straightforwardly observe the alignment rule. Since the received WAL
data might start at the middle of WAL block, walreceiver needs to keep
the last half-written WAL block for alignment. OTOH since the received
data might end at the middle of WAL block, walreceiver needs zero-padding.
As a result, walreceiver writes the set of the last WAL block, received
data and zero-padding.

Which is better? Or do you have another better idea?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Magnus Hagander on
On Wed, Feb 17, 2010 at 06:55, Fujii Masao <masao.fujii(a)gmail.com> wrote:
> On Wed, Feb 17, 2010 at 6:28 AM, Magnus Hagander <magnus(a)hagander.net> wrote:
>> If you send me your amazon id, I can get you premissions on my private
>> image. I plan to clean it up and make it public, just haven't gotten
>> around to it yet...
>
> Thanks for your concern! I'll send the ID when I complete the preparation.

ok.


> And, fortunately?, when I set wal_sync_method to open_sync, the problem was
> reproduced in the linux, too. The cause is that the data that is written by

Ah, that's good. It always helps if it's a cross-platform issue -
particularly in that it's not one of the funky win32 specific things
we did :)


> walreceiver is not aligned, even if O_DIRECT is used. On win32, O_DIRECT is
> used by default. So the problem always happened on win32.

Ahh. I see.


> I propose two solution ideas:
>
> 1. O_DIRECT is somewhat harmful in the standby since the data written by
>   walreceiver is read by the startup process immediately. So, how about
>   not making only walreceiver use O_DIRECT?

In that case, O_DIRECT would be counterproductive, no? It maps to
FILE_FLAG_NOI_BUFFERING, which makes sure it doesn't go into the
cache. So the read in the startup proc is actually guaranteed to
reuqire a physical read - of something we just wrote, so it'll almost
certainly end up waiting for a rotation, no?

Seems like getting rid of O_DIRECT here is the right thing to do,
regardless of this.


> 2. Straightforwardly observe the alignment rule. Since the received WAL
>   data might start at the middle of WAL block, walreceiver needs to keep
>   the last half-written WAL block for alignment. OTOH since the received
>   data might end at the middle of WAL block, walreceiver needs zero-padding.
>   As a result, walreceiver writes the set of the last WAL block, received
>   data and zero-padding.

May there be other reasons to d this as well?


--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Fujii Masao on
On Tue, Feb 16, 2010 at 7:20 PM, Magnus Hagander <magnus(a)hagander.net> wrote:
> 2010/2/16 Fujii Masao <masao.fujii(a)gmail.com>:
>> On Tue, Feb 16, 2010 at 12:37 AM, Magnus Hagander <magnus(a)hagander.net> wrote:
>>> With the libpq fixes, I get further (more on that fix later, btw), but
>>> now I get stuck in this. When I do something on the master that
>>> generates WAL, such as insert a record, and then try to query this on
>>> the slave, the walreceiver process crashes with:
>>>
>>> PANIC:  XX000: could not write to log file 0, segment 9 at offset 0, length 160:
>>>  Invalid argument
>>> LOCATION:  XLogWalRcvWrite, .\src\backend\replication\walreceiver.c:487
>>>
>>> I'll keep digging at the details, but if somebody has a good idea here... ;)
>>
>> Yeah, this problem was reproduced in my (very slow :-( ) MinGW environment, too.
>> Though I've not idenfied the cause yet, I guess that it derives from wrong use
>> of the type of local variables in XLogWalRcvWrite(). I'll continue investigation
>> of it.
>
> Thanks!
>
> I will be somewhat spottily available over the next two days due to
> on-site work with clients.
>
> Let me know if you would be helped by some details of how to get a
> (somewhat faster) EC2 image up and running with MSVC to test on :-)

Thanks! I can probably use the EC2 image by reading your great blog post.
http://blog.hagander.net/archives/151-Testing-PostgreSQL-patches-on-Windows-using-Amazon-EC2.html

But it might take some time to make my sysadmin open the port for
rdesktop for some reasons...

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers