From: Ulf Samuelsson on
Jon Kirwan skrev:
> On Sat, 27 Mar 2010 08:15:03 +0100, Ulf Samuelsson
> <nospam.ulf(a)atmel.com> wrote:
>
>> <snip of LPC2xxx vx SAM7S discussion>
>> The 128 bit memory is overkill for thumb mode and just
>> wastes power.
>> <snip>
>
> Ulf, let me remind you of something you wrote about the SAM7:
>
> "In thumb mode, the 32 bit access gives you two
> instructions per cycle so in average this gives
> you 1 instruction per clock on the SAM7."
>
> I gather this is regarding the case where there is 1 wait
> state reading the 32-bit flash line -- so 2 clocks per line
> and thus the 1 clock per 16-bit instruction (assuming it
> executes in 1 clock.)
>
> Nico's comment about the NPX ARM, about the 128-bit wide
> flash line-width, would (I imagine) work about the same
> except that it reads at full clock rate speeds, no wait
> states. So I gather, if it works similarly, that there are
> eight thumb instructions per line (roughly.) I take it your
> point is that since each instruction (things being equal)
> cannot execute faster than 1 clock per, that it takes 8
> clocks to execute those thumb instructions.
>

Yes, the SAM7 is very nicely tuned to thumb mode.
The LPC2 provides much more bandwidth than is needed
when you run in thumb mode.
Due to the higher latency for the LPC, due to slower
flash, the SAM7 will be better at certain frequencies,
but the LPC will have a higher max clock frequency.

The real point is that you are not neccessarily
faster because you have a wide memory.
The speed of the memory counts as well.
There are a lot of parameters to take into account
if you want to get find the best part.

People with different requirements will find different
parts to be the best.

If you start to use high speed communications, then the
PDC of the SAM7 serial ports tend to even out any
difference in performance vs the LPC very quickly.


> The discussion could move between discussing instruction
> streams to discussing constant data tables and the like, but
> staying on the subject of instructions for the following....

Yes, this will have an effect.
Accessing a random word should be faster on the SAM7
and, assuming you copy sequentially a large area
having 128 bit memory will be beneficial.

>
> So the effect is that it takes the same number of clocks to
> execute 1-clock thumb instructions on either system?
> (Ignoring frequency, for now.) Or do I get that wrong?

Yes, the LPC will in certain frequencies hjave longer latency
so it will be marginally slower in thumb mode.

>
> You then discussed power consumption issues. Wouldn't it be
> the case that since the NPX ARM is accessing its flash at a
> 1/8th clock rate and the SAM7 is constantly operating its
> flash that the _average_ power consumption might very well be
> better with the NPX ARM, despite somewhat higher current when
> it is being accessed? Isn't the fact that the access cycle
> takes place far less frequently observed as a lower average?

As far as I understand the chip select for the internal flash
is always active when you run at higher frequencies
so there is a lot of wasted power.

> Perhaps the peak divided by 8, or so? (Again, keep the clock
> rates identical [downgraded to SAM7 rates in the NXP ARM
> case.]) Have you computed figures for both?

Best is to check the datasheet.
The CPU core used is another important parameter.
The SAM7S uses the ARM7TDMI while most other uses the ARM7TDMI-S
(S = synthesizable) which inherently has 33 % higher power consumption.



>
> Jon
From: Ulf Samuelsson on
Nico Coesel skrev:
> Ulf Samuelsson <ulf(a)a-t-m-e-l.com> wrote:
>
>> TheM skrev:
>>> "Nico Coesel" <nico(a)puntnl.niks> wrote in message news:4bacf169.1721173156(a)news.planet.nl...
>>>> "TheM" <DontNeedSpam(a)test.com> wrote:
>>>>
>>>>> "Spehro Pefhany" <speffSNIP(a)interlogDOTyou.knowwhat> wrote in message news:5elnq5d2ncjvs91v1cu5dmt5tbntuhefg3(a)4ax.com...
>>>>>> On Thu, 25 Mar 2010 13:19:46 -0800, "Bob Eld" <nsmontassoc(a)yahoo.com>
>>>>>> wrote:
>>>>>>
>>>>>>> "Peter" <nospam(a)nospam9876.com> wrote in message
>>>>>>> news:9lhmq5plg1gr3sduo9n52mdi5g6iiqucqc(a)4ax.com...
>>>>>>>> They have doubled their prices and the lead times are 18 weeks.
>>>>> Is this limited to EEPROM/Memory only or uCPU as well?
>>>>>
>>>>> Definitely worth considering getting out of AVR.
>>>>> Do NPX ARM come with on-chip FLASH?
>>>> Yes, all of them have 128 bit wide flash that allows zero waitstate
>>>> execution at the maximum CPU clock.
>>> Not bad, I ordered a couple books on ARM off Amazon, may get into it finally.
>>> From what I see they are same price as AVR mega, low power and much faster.
>>> And NXP is very generous with samples.
>>>
>>> M
>>>
>>>
>> The typical 32 bitters of today are implemented using advanced
>> flash technologies which allows high density memories in small chip
>> areas, but they are not low power.
>>
>> The inherent properties of the process makes for high leakage.
>> When you see power consumption in sleep of around 1-2 uA,
>> this is when the chip is turned OFF.
>> Only a small part of the chip is powered, RTC and a few other things.
>>
>> When you implement in a 0.25u process or higher, you can have the chip
>> fully initialized and ready to react on input while using
>> 1-2 uA in sleep.
>>
>> That is a big difference.
>>
>> While the NXP devices gets zero waitstate from 128 bit bus,
>> this also makes them extremely power hungry.
>> An LPC ARM7 uses about 2 x the current of a SAM7.
>> It gets higher performance in ARM mode.
>>
>> The ARM mode has a price in code size, so if you want more features,
>> then you better run in Thumb mode. The SAM7 with 32 bit flash is
>> actually faster than the LPC when running in Thumb mode,
>> (at the same frequency) since the SAM7 uses as 33 MHz flash,
>> while the LPC uses a 24 Mhz flash.
>> In thumb mode, the 32 bit access gives you two instructions
>> per cycle so in average this gives you 1 instruction per clock on the SAM7.
>
> I think this depends a lot on what method you use to measure this.
> Thumb code is expected to be slower than ARM code. You should test
> with drystone and make sure the same C library is used since drystone
> results also depend on the C library!

It is pretty clear, that if you
* execute out of flash in thumb mode
* do not access flash for data transfers
* run the chips at equivalent frequencies
* run sequential fetch at zero waitstates.

the difference will be the number of waitstates in non-sequential fetch.



>
>> Less waitstates means higher performance.
>> By copying a few 32 bit ARM routines to SRAM,
>> you can overcome that limitation.
>> You can get slightly higher top frequency out of the LPC,
>> but that again increases the power consumption.
>>
>>
>> For Cortex-M3 I did some test on the new SAM3, which can be
>> configured to use both 64 bit or 128 bit memories.
>> With a 128 bit memory, you can wring about 5% extra performance
>> out of the chip compared to 64 bit operation.
>>From a power consumption point of view it is probably better
>> to increase the clock frequency by 5% than to enable the 128 bit mode.
>> It is therefore only the most demanding applications that have
>> any use for the 128 bit memory.
>>
>> Testing on other Cortex-M3 chips indicate similar results.
>>
>> Someone told me that they tried executing out of SRAM on an STM32
>> and this was actually slower than executing out of flash.
>> Executing out of external memory also appears to be a problem,
>> since there is no cache/burst and bandwidth seems to be lower
>> than equivalent ARM7 devices.
>
> That doesn't surprise me. From my experience with STR7 and the STM32
> datasheets it seems ST does a sloppy job putting controllers together.
> They are cheap but you don't get maximum performance.
>
>> Current guess is that the AHB bus has some delays due to
>> synchronization. Also if you execute out of SRAM
>> you are going to have conflicts with data access.
>> Something which is avoided when you execute out of flash.
>
> NXP has some sort of cache between the CPU and the flash on the M3
> devices. According to the documentation NXP's LPC1700 M3 devices use a
> Harvard architecture with 3 busses so multiple data transfers
> (CPU-flash, CPU-memory and DMA) can occur simultaneously. Executing
> from RAM would occupy one bus so you'll have less memory bandwidth to
> work with.
>

The SAM3 uses the same AHB bus as the ARM9.
The "bus" is actually a series of multiplexers where each target
has a multiplexer with an input for each bus master.

As long as noone else wants to access the same target,
a bus master will get unrestricted access.

If you execute from flash, you will get full access for the instruction
bus, (with the exception for the few constants).
If you execute out of a single SRAM, you have to share access
with the data transfers, which will slow you down.

BR
Ulf Samuelsson
From: Jon Kirwan on
On Sat, 27 Mar 2010 14:14:58 +0100, Ulf Samuelsson
<nospam.ulf(a)atmel.com> wrote:

>Jon Kirwan skrev:
>> On Sat, 27 Mar 2010 08:15:03 +0100, Ulf Samuelsson
>> <nospam.ulf(a)atmel.com> wrote:
>>
>>> <snip of LPC2xxx vx SAM7S discussion>
>>> The 128 bit memory is overkill for thumb mode and just
>>> wastes power.
>>> <snip>
>>
>> Ulf, let me remind you of something you wrote about the SAM7:
>>
>> "In thumb mode, the 32 bit access gives you two
>> instructions per cycle so in average this gives
>> you 1 instruction per clock on the SAM7."
>>
>> I gather this is regarding the case where there is 1 wait
>> state reading the 32-bit flash line -- so 2 clocks per line
>> and thus the 1 clock per 16-bit instruction (assuming it
>> executes in 1 clock.)
>>
>> Nico's comment about the NPX ARM, about the 128-bit wide
>> flash line-width, would (I imagine) work about the same
>> except that it reads at full clock rate speeds, no wait
>> states. So I gather, if it works similarly, that there are
>> eight thumb instructions per line (roughly.) I take it your
>> point is that since each instruction (things being equal)
>> cannot execute faster than 1 clock per, that it takes 8
>> clocks to execute those thumb instructions.
>
>Yes, the SAM7 is very nicely tuned to thumb mode.
>The LPC2 provides much more bandwidth than is needed
>when you run in thumb mode.

I think I gathered that much and didn't disagree, just
wondered.

>Due to the higher latency for the LPC, due to slower
>flash, the SAM7 will be better at certain frequencies,
>but the LPC will have a higher max clock frequency.

I remember you writing that "SAM7 uses a 33 MHz flash, while
the LPC uses a 24 Mhz flash." It seems hard to imagine,
though, except perhaps for data fetch situations or
branching, it being actually slower. If it fetches something
like 8 thumb instructions at a time, anyway. As another
poster pointed out, the effective rate is much higher for
sequential reads no matter how you look at it. So it would
take branching or non-sequential data fetches to highlight
the difference.

One would have to do an exhaustive, stochastic analysis of
application spaces to get a good bead on all this. But
ignorant of the details as I truly am right now, not having a
particular application in mind and just guessing where I'd
put my money if betting one way or another, I'd put it on 384
mb/sec memory over 132 mb/sec memory for net throughput.

>The real point is that you are not neccessarily
>faster

Yes, but the key here is the careful "not necessarily"
wording. Not necessarily, is true enough, as one could form
specific circumstances where you'd be right. But it seems to
me they'd be more your 'corner cases' than 'run of the mill.'

>because you have a wide memory.
>The speed of the memory counts as well.

Of course. So people who seem to care about the final speed
and little else should indeed do some analysis before
deciding. But if they don't know their application well
enough to make that comparison... hmm.

>There are a lot of parameters to take into account
>if you want to get find the best part.

Yes. That seems to ever be true!

>People with different requirements will find different
>parts to be the best.

Yes, no argument. I was merely curious about something else
which you mostly didn't answer, so I suppose if I care enough
I will have to go find out on my own.... see below.

>If you start to use high speed communications, then the
>PDC of the SAM7 serial ports tend to even out any
>difference in performance vs the LPC very quickly.

Some parts have such wonderfully sophisticated peripherals.
Some of these are almost ancient (68332, for example.) So
it's not only a feature of new parts, either. Which goes
back to your point that there are a lot of parameters to take
into account, I suppose.

>> The discussion could move between discussing instruction
>> streams to discussing constant data tables and the like, but
>> staying on the subject of instructions for the following....
>
>Yes, this will have an effect.
>Accessing a random word should be faster on the SAM7
>and, assuming you copy sequentially a large area
>having 128 bit memory will be beneficial.

The 'random' part being important here. In some cases, that
may be important where the structures are 'const' and can be
stored in flash and are accessed in a way that cannot take
advantage of the 128-bit wide lines. A binary search on a
calibration table with small table entry sizes, perhaps,
might be a reasonable example that actually occurs often
enough and may show off your point well. Other examples,
such as larger element sizes (such as doubles or pairs of
doubles) for that binary search or a FIR filter table used
sequentially, might point the other way.

>> So the effect is that it takes the same number of clocks to
>> execute 1-clock thumb instructions on either system?
>> (Ignoring frequency, for now.) Or do I get that wrong?
>
>Yes, the LPC will in certain frequencies hjave longer latency
>so it will be marginally slower in thumb mode.

I find this tough to stomach, when talking about instruction
streams Unless there are lots of branches salted in the mix.
I know I must have read somewhere someone's analysis of many
programs and the upshot of this, but I think it was for the
x86 system and a product of Intel's research department some
years ago and I've no idea how well that applies to the ARM
core. I'm sure someone (perhaps you?) has access to such
anaylses and might share it here?

>> You then discussed power consumption issues. Wouldn't it be
>> the case that since the NPX ARM is accessing its flash at a
>> 1/8th clock rate and the SAM7 is constantly operating its
>> flash that the _average_ power consumption might very well be
>> better with the NPX ARM, despite somewhat higher current when
>> it is being accessed? Isn't the fact that the access cycle
>> takes place far less frequently observed as a lower average?
>
>As far as I understand the chip select for the internal flash
>is always active when you run at higher frequencies
>so there is a lot of wasted power.

By "at higher frequencies" do you have a particular number
above which your comment applies and below which it does not?

In any case, this is the answer I was looking for and you
don't appear to answer now. Why would anyone "run the flash"
when the bus isn't active? It seems.... well, bone-headed.
And I can't recall any chip design being that poor. I've
seen cases where an external board design (not done by chip
designers, but more your hobbyist designer type) that did
things like that. But it is hard for me to imagine a chip
designer being that stupid. It's almost zero work to be
smarter than that.

So this suggests you want me to go study the situation. Maybe
someone already knows, though, and can post it. I can hope.

>> Perhaps the peak divided by 8, or so? (Again, keep the clock
>> rates identical [downgraded to SAM7 rates in the NXP ARM
>> case.]) Have you computed figures for both?
>
>Best is to check the datasheet.

I wondered if you already knew the answer. I suppose not,
now.

>The CPU core used is another important parameter.
>The SAM7S uses the ARM7TDMI while most other uses the ARM7TDMI-S
>(S = synthesizable) which inherently has 33 % higher power consumption.

I'm aware of the general issue. Your use of "most other"
does NOT address itself to the subject at hand, though. It
leaves open either possibility for the LPC2. But it's a
point worth keeping in mind if you make these chips, I
suppose. For the rest of us, it's just a matter of deciding
which works better by examining the data sheet. We don't
have the option to move a -S design to a crafted ASIC.

So this leaves some more or less interesting questions.

(1) Where is a quality report or two on the subject of
instruction mix for ARM applications, broken down by
application spaces that differ substantially from each other,
and what are the results of these studies?

(2) Does the LPC2 device really operate the flash all the
time? Or not?

(3) Is the LPC2 a -S (which doesn't matter that much, but
since the topic is brought up it might be nice to put that to
bed?)

I don't know.

Jon
From: Ulf Samuelsson on
Jon Kirwan skrev:
> On Sat, 27 Mar 2010 14:14:58 +0100, Ulf Samuelsson
> <nospam.ulf(a)atmel.com> wrote:
>
>> Jon Kirwan skrev:
>>> On Sat, 27 Mar 2010 08:15:03 +0100, Ulf Samuelsson
>>> <nospam.ulf(a)atmel.com> wrote:
>>>
>>>> <snip of LPC2xxx vx SAM7S discussion>
>>>> The 128 bit memory is overkill for thumb mode and just
>>>> wastes power.
>>>> <snip>
>>> Ulf, let me remind you of something you wrote about the SAM7:
>>>
>>> "In thumb mode, the 32 bit access gives you two
>>> instructions per cycle so in average this gives
>>> you 1 instruction per clock on the SAM7."
>>>
>>> I gather this is regarding the case where there is 1 wait
>>> state reading the 32-bit flash line -- so 2 clocks per line
>>> and thus the 1 clock per 16-bit instruction (assuming it
>>> executes in 1 clock.)
>>>
>>> Nico's comment about the NPX ARM, about the 128-bit wide
>>> flash line-width, would (I imagine) work about the same
>>> except that it reads at full clock rate speeds, no wait
>>> states. So I gather, if it works similarly, that there are
>>> eight thumb instructions per line (roughly.) I take it your
>>> point is that since each instruction (things being equal)
>>> cannot execute faster than 1 clock per, that it takes 8
>>> clocks to execute those thumb instructions.
>> Yes, the SAM7 is very nicely tuned to thumb mode.
>> The LPC2 provides much more bandwidth than is needed
>> when you run in thumb mode.
>
> I think I gathered that much and didn't disagree, just
> wondered.
>
>> Due to the higher latency for the LPC, due to slower
>> flash, the SAM7 will be better at certain frequencies,
>> but the LPC will have a higher max clock frequency.
>
> I remember you writing that "SAM7 uses a 33 MHz flash, while
> the LPC uses a 24 Mhz flash." It seems hard to imagine,
> though, except perhaps for data fetch situations or
> branching, it being actually slower. If it fetches something
> like 8 thumb instructions at a time, anyway. As another
> poster pointed out, the effective rate is much higher for
> sequential reads no matter how you look at it. So it would
> take branching or non-sequential data fetches to highlight
> the difference.
>
> One would have to do an exhaustive, stochastic analysis of
> application spaces to get a good bead on all this. But
> ignorant of the details as I truly am right now, not having a
> particular application in mind and just guessing where I'd
> put my money if betting one way or another, I'd put it on 384
> mb/sec memory over 132 mb/sec memory for net throughput.

That is because you ignore the congestion caused by the fact that the
ARM7 core only fetches 16 bits per access in thumb mode.
At 33 MHz, the CPU can only use 66 MB / second,
At 66 MHz, the CPU can only use 132 MB / second.
Since you can sustain 132 MB / second with a 33 Mhz 32 bit memory,
you do not need it to be wider to keep the pipeline running
at zero waitstates for sequential fetch.
For non-sequential fetch, the width is not important.
Only the number of waitstates, and the SAM7 has same or less # of
waitstates than the LPC.

----
The 128 bit memory is really only useful for ARM mode.
For thumb mode it is more or less a waste.

>
>> The real point is that you are not neccessarily
>> faster
>
> Yes, but the key here is the careful "not necessarily"
> wording. Not necessarily, is true enough, as one could form
> specific circumstances where you'd be right. But it seems to
> me they'd be more your 'corner cases' than 'run of the mill.'

I dont think running in Thumb mode is a corner case.


>
>> because you have a wide memory.
>> The speed of the memory counts as well.
>
> Of course. So people who seem to care about the final speed
> and little else should indeed do some analysis before
> deciding. But if they don't know their application well
> enough to make that comparison... hmm.
>
>> There are a lot of parameters to take into account
>> if you want to get find the best part.
>
> Yes. That seems to ever be true!
>
>> People with different requirements will find different
>> parts to be the best.
>
> Yes, no argument. I was merely curious about something else
> which you mostly didn't answer, so I suppose if I care enough
> I will have to go find out on my own.... see below.
>
>> If you start to use high speed communications, then the
>> PDC of the SAM7 serial ports tend to even out any
>> difference in performance vs the LPC very quickly.
>
> Some parts have such wonderfully sophisticated peripherals.
> Some of these are almost ancient (68332, for example.) So
> it's not only a feature of new parts, either. Which goes
> back to your point that there are a lot of parameters to take
> into account, I suppose.
>
>>> The discussion could move between discussing instruction
>>> streams to discussing constant data tables and the like, but
>>> staying on the subject of instructions for the following....
>> Yes, this will have an effect.
>> Accessing a random word should be faster on the SAM7
>> and, assuming you copy sequentially a large area
>> having 128 bit memory will be beneficial.
>
> The 'random' part being important here. In some cases, that
> may be important where the structures are 'const' and can be
> stored in flash and are accessed in a way that cannot take
> advantage of the 128-bit wide lines. A binary search on a
> calibration table with small table entry sizes, perhaps,
> might be a reasonable example that actually occurs often
> enough and may show off your point well. Other examples,
> such as larger element sizes (such as doubles or pairs of
> doubles) for that binary search or a FIR filter table used
> sequentially, might point the other way.
>
>>> So the effect is that it takes the same number of clocks to
>>> execute 1-clock thumb instructions on either system?
>>> (Ignoring frequency, for now.) Or do I get that wrong?
>> Yes, the LPC will in certain frequencies hjave longer latency
>> so it will be marginally slower in thumb mode.
>
> I find this tough to stomach, when talking about instruction
> streams Unless there are lots of branches salted in the mix.
> I know I must have read somewhere someone's analysis of many
> programs and the upshot of this, but I think it was for the
> x86 system and a product of Intel's research department some
> years ago and I've no idea how well that applies to the ARM
> core. I'm sure someone (perhaps you?) has access to such
> anaylses and might share it here?

LPC with 1 waistates at 33 Mhz.

NOP 2 (fetches 8 instructions)
NOP 1
NOP 1
NOP 1
NOP 1
NOP 1
NOP 1
NOP 1
..........
Sum = 9

Same code with SAM7, 0 waitstate at 33 MHz.

NOP 1 (fetches 1 instruction)
NOP 1 (fetches 1 instruction)
NOP 1 (fetches 1 instruction)
NOP 1 (fetches 1 instruction)
NOP 1 (fetches 1 instruction)
NOP 1 (fetches 1 instruction)
NOP 1 (fetches 1 instruction)
NOP 1 (fetches 1 instruction)
..........
Sum = 8

It should not be to hard to grasp.


>
>>> You then discussed power consumption issues. Wouldn't it be
>>> the case that since the NPX ARM is accessing its flash at a
>>> 1/8th clock rate and the SAM7 is constantly operating its
>>> flash that the _average_ power consumption might very well be
>>> better with the NPX ARM, despite somewhat higher current when
>>> it is being accessed? Isn't the fact that the access cycle
>>> takes place far less frequently observed as a lower average?
>> As far as I understand the chip select for the internal flash
>> is always active when you run at higher frequencies
>> so there is a lot of wasted power.
>
> By "at higher frequencies" do you have a particular number
> above which your comment applies and below which it does not?

Each chip designer makes their own choices.
I know of some chips starting to strobe the flash
chip select when below 1 - 4 Mhz


>
> In any case, this is the answer I was looking for and you
> don't appear to answer now. Why would anyone "run the flash"
> when the bus isn't active? It seems.... well, bone-headed.
> And I can't recall any chip design being that poor. I've
> seen cases where an external board design (not done by chip
> designers, but more your hobbyist designer type) that did
> things like that. But it is hard for me to imagine a chip
> designer being that stupid. It's almost zero work to be
> smarter than that.

This is an automatic thing which measures the clock frequency
vs another clock frequency, and the "other" clock frequency
is often not that quick.


>
> So this suggests you want me to go study the situation. Maybe
> someone already knows, though, and can post it. I can hope.
>
>>> Perhaps the peak divided by 8, or so? (Again, keep the clock
>>> rates identical [downgraded to SAM7 rates in the NXP ARM
>>> case.]) Have you computed figures for both?
>> Best is to check the datasheet.
>
> I wondered if you already knew the answer. I suppose not,
> now.

Looking at the LPC2141 datasheet, which seems to be the part
closest to the SAM7S256 you get
57 mA @ 3.3V = 188 mW @ 60 Mhz = 3.135 mW/Mhz.

The SAM7S datasheet runs 33 mA @ 3.3 V @ 55 MHz = 1.98 mW/Mhz,
You can, on the SAM7S choose to feed VDDCORE from 1.8V.

The SAM7S is specified with USB enabled, so this
has to be used for the LPC as well for a fair comparision.

>> The CPU core used is another important parameter.
>> The SAM7S uses the ARM7TDMI while most other uses the ARM7TDMI-S
>> (S = synthesizable) which inherently has 33 % higher power consumption.
>
> I'm aware of the general issue. Your use of "most other"
> does NOT address itself to the subject at hand, though. It
> leaves open either possibility for the LPC2. But it's a
> point worth keeping in mind if you make these chips, I
> suppose. For the rest of us, it's just a matter of deciding
> which works better by examining the data sheet. We don't
> have the option to move a -S design to a crafted ASIC.
>
> So this leaves some more or less interesting questions.
>
> (1) Where is a quality report or two on the subject of
> instruction mix for ARM applications, broken down by
> application spaces that differ substantially from each other,
> and what are the results of these studies?
>
> (2) Does the LPC2 device really operate the flash all the
> time? Or not?
>

You do not have any figures in the datasheet indicating
low power mode.

> (3) Is the LPC2 a -S (which doesn't matter that much, but
> since the topic is brought up it might be nice to put that to
> bed?)

Yes it is.
It should be enough to look in the datasheet.


> I don't know.
>
> Jon

Ulf
From: Jon Kirwan on
On Sun, 28 Mar 2010 01:04:20 +0100, Ulf Samuelsson
<nospam.ulf(a)atmel.com> wrote:

>Jon Kirwan skrev:
>> On Sat, 27 Mar 2010 14:14:58 +0100, Ulf Samuelsson
>> <nospam.ulf(a)atmel.com> wrote:
>>
>>> Jon Kirwan skrev:
>>>> On Sat, 27 Mar 2010 08:15:03 +0100, Ulf Samuelsson
>>>> <nospam.ulf(a)atmel.com> wrote:
>>>>
>>>>> <snip of LPC2xxx vx SAM7S discussion>
>>>>> The 128 bit memory is overkill for thumb mode and just
>>>>> wastes power.
>>>>> <snip>
>>>> Ulf, let me remind you of something you wrote about the SAM7:
>>>>
>>>> "In thumb mode, the 32 bit access gives you two
>>>> instructions per cycle so in average this gives
>>>> you 1 instruction per clock on the SAM7."
>>>>
>>>> I gather this is regarding the case where there is 1 wait
>>>> state reading the 32-bit flash line -- so 2 clocks per line
>>>> and thus the 1 clock per 16-bit instruction (assuming it
>>>> executes in 1 clock.)
>>>>
>>>> Nico's comment about the NPX ARM, about the 128-bit wide
>>>> flash line-width, would (I imagine) work about the same
>>>> except that it reads at full clock rate speeds, no wait
>>>> states. So I gather, if it works similarly, that there are
>>>> eight thumb instructions per line (roughly.) I take it your
>>>> point is that since each instruction (things being equal)
>>>> cannot execute faster than 1 clock per, that it takes 8
>>>> clocks to execute those thumb instructions.
>>> Yes, the SAM7 is very nicely tuned to thumb mode.
>>> The LPC2 provides much more bandwidth than is needed
>>> when you run in thumb mode.
>>
>> I think I gathered that much and didn't disagree, just
>> wondered.
>>
>>> Due to the higher latency for the LPC, due to slower
>>> flash, the SAM7 will be better at certain frequencies,
>>> but the LPC will have a higher max clock frequency.
>>
>> I remember you writing that "SAM7 uses a 33 MHz flash, while
>> the LPC uses a 24 Mhz flash." It seems hard to imagine,
>> though, except perhaps for data fetch situations or
>> branching, it being actually slower. If it fetches something
>> like 8 thumb instructions at a time, anyway. As another
>> poster pointed out, the effective rate is much higher for
>> sequential reads no matter how you look at it. So it would
>> take branching or non-sequential data fetches to highlight
>> the difference.
>>
>> One would have to do an exhaustive, stochastic analysis of
>> application spaces to get a good bead on all this. But
>> ignorant of the details as I truly am right now, not having a
>> particular application in mind and just guessing where I'd
>> put my money if betting one way or another, I'd put it on 384
>> mb/sec memory over 132 mb/sec memory for net throughput.
>
>That is because you ignore the congestion caused by the fact
>that the ARM7 core only fetches 16 bits per access in thumb mode.

I'm not entirely sure I understand. If both processors are
internally clocked at the same rate, they both have exactly
the same fetch rate in thumb mode.

>At 33 MHz, the CPU can only use 66 MB / second,
>At 66 MHz, the CPU can only use 132 MB / second.

Okay. I'm with you. Except that I haven't looked at the
data sheets to check for maximum core clock rates, since that
might bear on some questions.

>Since you can sustain 132 MB / second with a 33 Mhz 32 bit memory,
>you do not need it to be wider to keep the pipeline running
>at zero waitstates for sequential fetch.

In thumb mode and only talking about instructions and
assuming 66MHz peak. Do the processors (either of them)
sport separate buses, though, which can compete for the same
memory system? (Data + Instruction paths, for example.)

>For non-sequential fetch, the width is not important.

In the case of instructions, I think I take your meaning.
Regarding data, no, I don't.

>Only the number of waitstates, and the SAM7 has same or less # of
>waitstates than the LPC.

.... In the case of non-sequential instruction fetch.

All this still fails to account for actual application mix
reports. I'm still curious (and I'm absolutely positive that
this is _done_ by chip designers because I observed the sheer
magnitude of the effort that took place at Intel during the
P2 design period) about application analysis that must have
been done on ARM (32-bit, 16-bit, and mixed modes) and should
be available somewhere. Do you have access to such reports?
It might go a long way in clarifying your points.

>----
>The 128 bit memory is really only useful for ARM mode.
>For thumb mode it is more or less a waste.
>
>>> The real point is that you are not neccessarily
>>> faster
>>
>> Yes, but the key here is the careful "not necessarily"
>> wording. Not necessarily, is true enough, as one could form
>> specific circumstances where you'd be right. But it seems to
>> me they'd be more your 'corner cases' than 'run of the mill.'
>
>I dont think running in Thumb mode is a corner case.

Actually, I meant this plural, not singular. And I don't
have a perspective on actual applications in these spaces. So
I'll just plead mostly ignorance here and hold off saying
more, as I'm mostly trying to understand, not claim, things.

>>> because you have a wide memory.
>>> The speed of the memory counts as well.
>>
>> Of course. So people who seem to care about the final speed
>> and little else should indeed do some analysis before
>> deciding. But if they don't know their application well
>> enough to make that comparison... hmm.
>>
>>> There are a lot of parameters to take into account
>>> if you want to get find the best part.
>>
>> Yes. That seems to ever be true!
>>
>>> People with different requirements will find different
>>> parts to be the best.
>>
>> Yes, no argument. I was merely curious about something else
>> which you mostly didn't answer, so I suppose if I care enough
>> I will have to go find out on my own.... see below.
>>
>>> If you start to use high speed communications, then the
>>> PDC of the SAM7 serial ports tend to even out any
>>> difference in performance vs the LPC very quickly.
>>
>> Some parts have such wonderfully sophisticated peripherals.
>> Some of these are almost ancient (68332, for example.) So
>> it's not only a feature of new parts, either. Which goes
>> back to your point that there are a lot of parameters to take
>> into account, I suppose.
>>
>>>> The discussion could move between discussing instruction
>>>> streams to discussing constant data tables and the like, but
>>>> staying on the subject of instructions for the following....
>>> Yes, this will have an effect.
>>> Accessing a random word should be faster on the SAM7
>>> and, assuming you copy sequentially a large area
>>> having 128 bit memory will be beneficial.
>>
>> The 'random' part being important here. In some cases, that
>> may be important where the structures are 'const' and can be
>> stored in flash and are accessed in a way that cannot take
>> advantage of the 128-bit wide lines. A binary search on a
>> calibration table with small table entry sizes, perhaps,
>> might be a reasonable example that actually occurs often
>> enough and may show off your point well. Other examples,
>> such as larger element sizes (such as doubles or pairs of
>> doubles) for that binary search or a FIR filter table used
>> sequentially, might point the other way.
>>
>>>> So the effect is that it takes the same number of clocks to
>>>> execute 1-clock thumb instructions on either system?
>>>> (Ignoring frequency, for now.) Or do I get that wrong?
>>> Yes, the LPC will in certain frequencies hjave longer latency
>>> so it will be marginally slower in thumb mode.
>>
>> I find this tough to stomach, when talking about instruction
>> streams Unless there are lots of branches salted in the mix.
>> I know I must have read somewhere someone's analysis of many
>> programs and the upshot of this, but I think it was for the
>> x86 system and a product of Intel's research department some
>> years ago and I've no idea how well that applies to the ARM
>> core. I'm sure someone (perhaps you?) has access to such
>> anaylses and might share it here?
>
>LPC with 1 waistates at 33 Mhz.
>
>NOP 2 (fetches 8 instructions)
>NOP 1
>NOP 1
>NOP 1
>NOP 1
>NOP 1
>NOP 1
>NOP 1
>.........
>Sum = 9
>
>Same code with SAM7, 0 waitstate at 33 MHz.
>
>NOP 1 (fetches 1 instruction)
>NOP 1 (fetches 1 instruction)
>NOP 1 (fetches 1 instruction)
>NOP 1 (fetches 1 instruction)
>NOP 1 (fetches 1 instruction)
>NOP 1 (fetches 1 instruction)
>NOP 1 (fetches 1 instruction)
>NOP 1 (fetches 1 instruction)
>.........
>Sum = 8
>
>It should not be to hard to grasp.

What you wrote is obvious. But it is completely off the
question I asked. Take a close look at my words. I am
asking about the kind of analysis I observed taking place at
Intel during the P2 development. It was quite a lot of work
getting applications, compiler tools, and so on and
generating actual code and then analyzing it before
continuing the processor family design.

Such a simple NOP case would have been laughed at, had it
been presented as representative in such meetings. I'm
looking for the thorough-going analysis that often takes
place when smart folks attack a design.

>>>> You then discussed power consumption issues. Wouldn't it be
>>>> the case that since the NPX ARM is accessing its flash at a
>>>> 1/8th clock rate and the SAM7 is constantly operating its
>>>> flash that the _average_ power consumption might very well be
>>>> better with the NPX ARM, despite somewhat higher current when
>>>> it is being accessed? Isn't the fact that the access cycle
>>>> takes place far less frequently observed as a lower average?
>>> As far as I understand the chip select for the internal flash
>>> is always active when you run at higher frequencies
>>> so there is a lot of wasted power.
>>
>> By "at higher frequencies" do you have a particular number
>> above which your comment applies and below which it does not?
>
>Each chip designer makes their own choices.
>I know of some chips starting to strobe the flash
>chip select when below 1 - 4 Mhz
>>
>> In any case, this is the answer I was looking for and you
>> don't appear to answer now. Why would anyone "run the flash"
>> when the bus isn't active? It seems.... well, bone-headed.
>> And I can't recall any chip design being that poor. I've
>> seen cases where an external board design (not done by chip
>> designers, but more your hobbyist designer type) that did
>> things like that. But it is hard for me to imagine a chip
>> designer being that stupid. It's almost zero work to be
>> smarter than that.
>
>This is an automatic thing which measures the clock frequency
>vs another clock frequency, and the "other" clock frequency
>is often not that quick.

I guess I can't follow your words, here, at all. Maybe I
didn't write well, myself. In any case, I will just leave
this with my question still hanging there for me. Someone
else may understand and perhaps answer.

>> So this suggests you want me to go study the situation. Maybe
>> someone already knows, though, and can post it. I can hope.
>>
>>>> Perhaps the peak divided by 8, or so? (Again, keep the clock
>>>> rates identical [downgraded to SAM7 rates in the NXP ARM
>>>> case.]) Have you computed figures for both?
>>> Best is to check the datasheet.
>>
>> I wondered if you already knew the answer. I suppose not,
>> now.
>
>Looking at the LPC2141 datasheet, which seems to be the part
>closest to the SAM7S256 you get
>57 mA @ 3.3V = 188 mW @ 60 Mhz = 3.135 mW/Mhz.
>
>The SAM7S datasheet runs 33 mA @ 3.3 V @ 55 MHz = 1.98 mW/Mhz,
>You can, on the SAM7S choose to feed VDDCORE from 1.8V.
>
>The SAM7S is specified with USB enabled, so this
>has to be used for the LPC as well for a fair comparision.

Again, this misses my question entirely. But it may provide
some answers to some questions not asked by me.

>>> The CPU core used is another important parameter.
>>> The SAM7S uses the ARM7TDMI while most other uses the ARM7TDMI-S
>>> (S = synthesizable) which inherently has 33 % higher power consumption.
>>
>> I'm aware of the general issue. Your use of "most other"
>> does NOT address itself to the subject at hand, though. It
>> leaves open either possibility for the LPC2. But it's a
>> point worth keeping in mind if you make these chips, I
>> suppose. For the rest of us, it's just a matter of deciding
>> which works better by examining the data sheet. We don't
>> have the option to move a -S design to a crafted ASIC.
>>
>> So this leaves some more or less interesting questions.
>>
>> (1) Where is a quality report or two on the subject of
>> instruction mix for ARM applications, broken down by
>> application spaces that differ substantially from each other,
>> and what are the results of these studies?

A question which you went around completely in the above and
which still remains...

>> (2) Does the LPC2 device really operate the flash all the
>> time? Or not?
>
>You do not have any figures in the datasheet indicating
>low power mode.

I don't think I was asking about low power modes. I think
there must be a language problem, now. Let me try this
again.

When a memory system is cycled, there is power consumption
due to state changes and load capacitance and voltage swings
based upon the current from C*dV/dt and the supply voltages
involved. When the memory system isn't clocked, when it
remains 'static', leakage current can take place but the
level is a lot less. This isn't about a low power mode. It's
simply something fairly common to memory systems. I don't
know enough about flash to know exact differences here, but I
suspect that an unclocked flash memory consumes less power
than one being clocked consistently. Let me use your
simplistic example from above:

LPC with 1 waistates at 33 Mhz.

NOP 2 (fetches 8) 1 memory cycle
NOP 1 0 memory cycles
NOP 1 0 memory cycles
NOP 1 0 memory cycles
NOP 1 0 memory cycles
NOP 1 0 memory cycles
NOP 1 0 memory cycles
NOP 1 0 memory cycles
.......................................
Sum 9 1 memory cycle

Same code with SAM7, 0 waitstate at 33 MHz.

NOP 1 (fetches 1) 1 memory cycle
NOP 1 (fetches 1) 1 memory cycle
NOP 1 (fetches 1) 1 memory cycle
NOP 1 (fetches 1) 1 memory cycle
NOP 1 (fetches 1) 1 memory cycle
NOP 1 (fetches 1) 1 memory cycle
NOP 1 (fetches 1) 1 memory cycle
NOP 1 (fetches 1) 1 memory cycle
.......................................
Sum 8 8 memory cycles

As you say, "It should not be to hard to grasp."

I am imagining that 8 cycles against the flash will cost more
power than 1. But I may not be getting this right.

>> (3) Is the LPC2 a -S (which doesn't matter that much, but
>> since the topic is brought up it might be nice to put that to
>> bed?)
>
>Yes it is.
>It should be enough to look in the datasheet.

Thanks. That's a much clearer statement than before.

Jon
First  |  Prev  |  Next  |  Last
Pages: 1 2 3 4 5 6 7 8
Prev: BT earpieces
Next: USB 3.0 implementation on FPGA