This works, this does not... why? [FPGA]

Prev: PMC or XMC based on Altera parts (preferably Stratix)
Next: domain crossing and clock synchronisation for a high frequency timer

From: KJ on 2 Dec 2009 22:25

On Dec 2, 6:06 pm, "aleksa" <aleks...(a)gmail.com> wrote:
> I just found out that even the FDRSE
> version doesn't work on the long run.
>

Forgot to add on my previous post, that another effective technique
(in most cases it is the most effective technique) is to simply
bactrack and see where it leads. In your case, in the OP you said...

"I wrote a test prog that has failed after several seconds. READY was
set to '1' when it should have stayed '0'."

Forget the part about "it should have stayed '0'", focus on what
actually *did* happen which is that READY did go to '1'. Now
backtrack by looking at your code and see what does this imply? From
your posted code, it means that one of the two conditions must have
occurred around the rising edge of CLK.

1. ACTION='1' and ACTIONCODE="00"
2. ACTION='1' and ACTIONCODE="01" and ACTIONBIT = '1'.

Even if you think you know that neither condition could have occurred,
you must be wrong. One of those two conditions MUST have occurred
because READY did get set to 1. Now you continue the backtracking by
forming a hypothesis about what must have occurred to meet the
conditions for each path. Keep doing this for each path until you
spot a likely source of the problem. Then try to verify that this is
the case and only then do you fix it.

You can keep this backtracking as a mental exercise until you're ready
to test the most likely hypothesis. Alternatively, you can try to
verify a particular hypothesis before moving on to cut down on the
number of paths you need to analyze. To do this though you would have
to modify the design in some fashion. Even if it means bringing out
additional signals to see what is going on, that is a modification
that can make the problem 'go under'. When a problem disappears, but
you don't know why, that is the worst of all worlds because you can
end up thinking that you've somehow 'fixed' something when in fact you
haven't...and deep in your heart you know that problem is still there.

The important thing is to forget about what *should* be occurring and
simply take the facts (READY = '1') and use the source code of the
design to backtrack through what must have transpired in order to make
this event occur.

By making modifications (like the rewrite that you did) without
understanding why the original failed, you're making changes but you
can't answer the simple question of why the original didn't work.

Kevin Jennings

From: RCIngham on 3 Dec 2009 04:46

[snip /]

>
>By making modifications (like the rewrite that you did) without
>understanding why the original failed, you're making changes but you
>can't answer the simple question of why the original didn't work.
>
>Kevin Jennings
>

This is sometimes called the "stochastic design method", and is related to
the proposed simian rewrite of the complete works of Shakespeare - the
time-to-complete is utterly unpredictable...
;-)

Cheers,
Robert

---------------------------------------
This message was sent using the comp.arch.fpga web interface on
http://www.FPGARelated.com

From: aleksa on 4 Dec 2009 17:25

> 1. Verify that the timing report for Fmax is greater than the actual
> clock frequency.
> 2. Verify that the setup time requirement listed in the timing report
> for each input is actually being met in the real system.

All timings are verified, however there is one problem.

This is what I had in my 1st version of UCF:
OFFSET = OUT 15 ns AFTER "CLK"; -- ALL pins are constrained.

After viewing the reports, I've seen that not only SCODE0 is
affected by that constraint, but also the complete DBUS.

(Slave CPU writes with SCODE0, Master CPU reads with DBUS)

My thinking was/is: the master CPU will not read the data regs
until the status reg shows READY='1', so there is no need
to optimize timing from CLK to DBUS.

So I replaced my UCF with this 2nd version:
INST "SCODE0" TNM = CLK_OUT; -- constrain SCODE0 only
TIMEGRP "CLK_OUT" OFFSET = OUT 15 ns AFTER "CLK";

At first, that worked. However, after changing my READY code
things started to go wrong (and that is when I posted to this NG).

Now, I have reverted back to 1st UCF version, and the problem
is, I dare to say, gone.

Q: has that really solved my problem?

I really don't see anything wrong with my VHDL code and I had
that test prog running for hours w/o errors now.

> 3. Have the timing analyzer analyze all clock domain crossings or look
> at the final implementation for clock domain crossings
> - Does every clock domain crossing meet that requirement?
> - Did you verify that the requirement is met by viewing the final
> implementation?

I have used timing analyzer (TA) yesterday for the first time,
so I don't have much experience.

TA shows a list of constrained and unconstrained paths.

I did my best and removed almost all of the unconstrained items,
only the "Maximum Data Path: CLK to FF" have left, and they
all have the delay of only 2.2ns.

This is what I now have in my 3rd UCF:

for every global clock:
NET "CLK" TNM_NET = CLK;
TIMESPEC TS_CLK = PERIOD "CLK" 25 ns HIGH 50%;

OFFSET = IN 10 ns VALID 15 ns BEFORE "CLK";
OFFSET = OUT 15 ns AFTER "CLK";

next, all combinations of:
TIMESPEC TS_CLK1_2 = FROM "CLK1" TO "CLK2" 15 ns;

and:
TIMESPEC "TS_P2P" = FROM "PADS" TO "PADS" 15 ns;

> Have the timing analyzer analyze all clock domain crossings

How?
Like this: "TIMESPEC TS_CLK1_2 = FROM "CLK1" TO "CLK2" 15 ns;"?

Since I now know a little more than yesterday, I went back to 2nd UCF
file, hoping to see why that failed. TA did show some errors,
but nothing connected to the problem I was seeing, at least I think so.
Plenty of unconstrained items, but, again, no apparent connection..

In other words, I have it now working, but am not sure if the
problem is really solved, or I'm just currently lucky.

> Some of your comments are
> contradictory (only one clock, but there are multiple things being
> clocked, there are multiple clocks)

Well, there are three clocks, but only one (CLK) is important here:
- MASTERCLK just toggles WR0, and then CLK copies it to its domain.
- SCLK is connected to ordinary pin, and gets sampled with CLK.
- (MASTERCLK and CLK are connected to GCLK pins)

> - If multiple bits get moved from one domain to another (maybe the two
> bits of 'ACTIONCODE' as an example) what one *other* signal is there
> that tells you that it is OK to sample these signals and that they are
> guaranteed valid?

Only one bit is moved: SCODE0 to SHIFTIN when SCODE1='0' and rising SCLK.
The signal that tells me that it is OK to sample is rising SCLK with
SCODE1='1' and SCODE0='0'. Read my second post, maybe is not
commented well, but its all there.

From: aleksa on 5 Dec 2009 04:40

> This is what I had in my 1st version of UCF:
> OFFSET = OUT 15 ns AFTER "CLK"; -- ALL pins are constrained.

I forgot to mention that I also had
PERIOD and OFFSET IN for CLK and
PERIOD, OFFSET IN and OFFSET OUT for all other clocks.

From: KJ on 5 Dec 2009 12:06

On Dec 4, 5:25 pm, "aleksa" <aleks...(a)gmail.com> wrote:
>
> Now, I have reverted back to 1st UCF version, and the problem
> is, I dare to say, gone.
>
> Q: has that really solved my problem?
>

Form your original post you said...
"In real world that didn't work. I wrote a test prog that has failed
after several seconds. READY was set to '1' when it should have stayed
'0'"

Unless you can explain at least to yourself the chain of events that
allowed 'READY' to be set to 1 when it should have stayed 0, I would
say that no you haven''t really solved the problem because you really
don't quite understand the problem. There are many things that one
can change to make a problem seem to disappear, but usually they only
disappear for some period of time only to reappear later...and this
later time is usually at the most inopportune moment and you'll be
under some real heat to fix the problem.

> I really don't see anything wrong with my VHDL code and I had
> that test prog running for hours w/o errors now.
>

Try heating and cooling the various parts with cold spray and a heat
gun and see if it all still works.

Look at it this way...
- You had a failure (described in your original post)
- You haven't explained the reason for the failure
- You've put in changes that make the problem less frequent (code
changes and constraint changes)
- You currently have something that appears to be working (it hasn't
failed after several hours) but can't explain why previous versions
didn't

Now ask youself, if you were the end user rather than the designer,
would you feel confident that the issue has been put to rest and will
never come back?

>
> TA shows a list of constrained and unconstrained paths.
>
> I did my best and removed almost all of the unconstrained items,
> only the "Maximum Data Path: CLK to FF" have left, and they
> all have the delay of only 2.2ns.
>
> This is what I now have in my 3rd UCF:
>

Again, more changes without understanding why the system failed in the
first place.

>
> Since I now know a little more than yesterday, I went back to 2nd UCF
> file, hoping to see why that failed. TA did show some errors,

You went to the wrong place, put a scope or logic analyzer on the
failing hardware.

> but nothing connected to the problem I was seeing, at least I think so.
> Plenty of unconstrained items, but, again, no apparent connection..
>
> In other words, I have it now working, but am not sure if the
> problem is really solved, or I'm just currently lucky.
>

Since you don't understand why it failed, you're getting lucky. There
are also two forms of luck. It would be 'good luck' if you happened
to stumble upon the fix without understanding the failure. It will be
'bad luck' if this change has only made the problem go away on this
board (or some small set of boards) but it comes back when the design
goes into production and it resurfaces.

Design problems are like submarines, unless you target and sink them,
they will re-surface.

I'd strongly suggest reading and following the guidelines I outlined
in my second posting on December 2 regarding how to debug. That
process will lead you to understanding why your original two cracks at
it failed. From that knowledge you'll be able to know (not guess) at
whether or not your last attempt actually fixes the problem or you got
lucky.

Remember to start with your older failed attempts since they fail more
frequently (you can't debug something that appears to be working).
You need to know why something failed before you can evaluate whether
you've fixed it or covered it up.

Kevin Jennings

First | Prev | Next | Last
Pages: 1 2 3
Prev: PMC or XMC based on Altera parts (preferably Stratix)
Next: domain crossing and clock synchronisation for a high frequency timer