Prev: Great Public and Private undergraduate/graduate schools for Comp Arch and VLSI/Microelectronics
Next: Which is the most beautiful and memorable hardware structure in a CPU?
From: girish on 29 Mar 2010 22:27 hello. our hardware team seems to have almost concluded that the TLBs are the primary culprit. the countermeasure(s) such as parity bit, will for sure lead to cut down on some other feature, to balance the die size and all that. impacting software/kernel to some extent. please help me understand - why TLBs? is this the longest and un- checked/un-correctable path? thanks in advance. girish.gulawani PS. this is not a course assignment.
From: MitchAlsup on 29 Mar 2010 22:38 I cannot think of any particular reason that the TLB CAMs and Data canot be covered by either parity or ECC. Neither check has to be on the critical path as long as you have a means to machine check before the acdcessed data damages some permanent data structure. If you would like to understand why this is the case, contace me via e- mail. I am available for consultations. Mitch Alsup
From: "Andy "Krazy" Glew" on 30 Mar 2010 23:21 On 3/29/2010 7:27 PM, girish wrote: > hello. > our hardware team seems to have almost concluded that the TLBs are the > primary culprit. the countermeasure(s) such as parity bit, will for > sure lead to cut down on some other feature, to balance the die size > and all that. impacting software/kernel to some extent. > please help me understand - why TLBs? is this the longest and un- > checked/un-correctable path? I agree with Mitch - there is no excuse for not having EDC/ECC on your TLBs (and nearly everything else). But to address your question: why FITs in the TLB and not, say, in the cacge? it may be that your TLBs are not being accessed often enough. Some workloads simply do not access many pages. A TLB entry may be loaded, and may then be left untouched, unrefreshed, for a long time while its bits degrade. Especially if you have the equivalent of the G global bit - OS TLB entries may endure forever if not thrashed out. Especially if you have separate TLBs for small and large pages (superpages) - the latter tend to endure forever. Perhaps a periodic TLB scrub - e.g. a state machine invalidating TLB entries. ? You might test it by doing a global TLB invalidate in a timer interrupt. But a state machine would be better; and EDC/ECC better yet. Or, well, there is a long history of circuit problems in the TLB, at many companies.
From: Noob on 31 Mar 2010 05:35
Andy "Krazy" Glew wrote: > Or, well, there is a long history of circuit problems in the TLB, > at many companies. e.g. recently AMD's Barcelona core. http://en.wikipedia.org/wiki/AMD_Barcelona#TLB_Bug http://anandtech.com/show/2477/2 |