Prev: Slow pty's (was Re: libdivecomputer interfaces?)
Next: [PATCH] ipmi: Make sure drivers were registered before unregistering them
From: Brian Gordon on 10 Jun 2010 13:40 Greetings, I work in the aerospace industry and one of the considerations that occurs in aerospace is a phenomenon called Single Event Upsets (SEU). I'm not an expert on the physics behind this phenomenon, but the end result is that bits in RAM change state due to high energy particles passing through the device. This phenomenon happens more often at higher altitudes (aircraft) and is a very serious consideration for space vehicles. When these SEU can be detected some action may be taken to improve the behaviour of the system (log a fault and reset in order to refresh things from scratch?). So the first question becomes how to detect an SEU. Flash is considered somewhat safer than RAM. When executables run in linux, do the .text and .ro sections get copied into RAM? If so, can a background task monitor the RAM copy of .text and .ro for corruption? Tripwire seems to offer this kind of detection as a means for detecting tampering by a malicious attacker in the filesystem, but I am not convinced that it would detect modifications to copies of the ELF in RAM. My understanding how linux does "on-demand" loading of executables may be a problem here. But this SEU detection capability would seem to have some applicability to intrusion detection, so I have to think some mechanism already exists. Thank you to anyone for any pointers on where I can look to learn more about detecting SEU in linux. legerde at gmail com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Andi Kleen on 10 Jun 2010 14:30 Brian Gordon <legerde(a)gmail.com> writes: > I work in the aerospace industry and one of the considerations > that occurs in aerospace is a phenomenon called Single Event Upsets > (SEU). I'm not an expert on the physics behind this phenomenon, but > the end result is that bits in RAM change state due to high energy > particles passing through the device. This phenomenon happens more > often at higher altitudes (aircraft) and is a very serious > consideration for space vehicles. It's also a serious consideration for standard servers. > When these SEU can be detected some action may be taken to improve > the behaviour of the system (log a fault and reset in order to > refresh things from scratch?). So the first question becomes how to > detect an SEU. Flash is considered somewhat safer than RAM. When > executables run in linux, do the .text and .ro sections get copied > into RAM? If so, can a background task monitor the RAM copy of .text > and .ro for corruption? On server class systems with ECC memory hardware does that. The hardware stores the RAM contents using an error correcting code that can normally correct one bit errors and detect multi-bit errors. There are various more or less sophisticated variations of this around, from simple ECC, over chipkill to handle DIMMs failing, upto various variants of full memory mirroring. > Thank you to anyone for any pointers on where I can look to learn > more about detecting SEU in linux. Normally server class hardware handles this and the kernel then reports memory errors (e.g. through mcelog or through EDAC) Hardware also stops the system before it would consume corrupted data. Newer Linux also has special code that allows to recover from this in some circumstances or use predictive failure analysis with page offlining to prevent future problems. This requires suitable hardware support. Lower end systems which are optimized for cost generally ignore the problem though and any flipped bit in memory will result in a crash (if you're lucky) or silent data corruption (if you're unlucky) -Andi -- ak(a)linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Brian Gordon on 10 Jun 2010 14:40 > It's also a serious consideration for standard servers. Yes. Good point. > On server class systems with ECC memory hardware does that. > Normally server class hardware handles this and the kernel then reports > memory errors (e.g. through mcelog or through EDAC) Agreed. EDAC is a good and sane solution and most companies do this. Some do not due to naivity or cost reduction. EDAC doesn't cover processor registers and I have fairly good solutions on how to deal with that in tiny "home-grown" tasking systems. On the more exotic end, I have also seen systems that have dual redundant processors / memories. Then they add compare logic between the redundant processors that compare most pins each clock cycle. If any pins are not identical at a clock cycle, then something has gone wrong (SEU, hardware failure, etc..) > Lower end systems which are optimized for cost generally ignore the > problem though and any flipped bit in memory will result > in a crash (if you're lucky) or silent data corruption (if you're unlucky) Right! And this is the area that I am interested in. Some people insist on lowering the cost of the hardware without considering these issues. One thing I want to do is to be as diligent as possible (even in these low cost situations) and do the best job I can in spite of the low cost hardware. So, some pages of RAM are going to be read-only and the data in those pages came from some source (file system?). Can anyone describe a high level strategy to occasionaly provide some coverage of this data? So far I have thought about page descriptors adding an MD5 hash whenever they are read-only and first being "loaded/mapped?" and then a background daemon could occasionaly verify. Does tripwire accomplish this kind of detection by monitoring the underlying filesystem (I dont think so)? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Chris Friesen on 10 Jun 2010 14:40 On 06/10/2010 11:29 AM, Brian Gordon wrote: > When these SEU can be detected some action may be taken to improve > the behaviour of the system (log a fault and reset in order to > refresh things from scratch?). So the first question becomes how to > detect an SEU. I do work in telco stuff. We use ECC RAM, turn on ECC/parity on the various buses, enable error-checking in the hardware, etc. At higher abstraction levels you can checksum the data being stored and validate it when you access it. Some of the errors are "soft" and can be corrected, others are "hard" and uncorrectable. If you get enough "soft" errors in a short enough time it may be desirable to treat it as a "hard" error and reset. > Thank you to anyone for any pointers on where I can look to learn > more about detecting SEU in linux. You might start by taking a look at the "edac" code in the kernel. Linux in general doesn't normally enable all the fault detection code, so you may need to start looking at datasheets. Chris -- The author works for GENBAND Corporation (GENBAND) who is solely responsible for this email and its contents. All enquiries regarding this email should be addressed to GENBAND. Nortel has provided the use of the nortel.com domain to GENBAND in connection with this email solely for the purpose of connectivity and Nortel Networks Inc. has no liability for the email or its contents. GENBAND's web site is http://www.genband.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Brian Gordon on 10 Jun 2010 14:50
> I do work in telco stuff. �We use ECC RAM, turn on ECC/parity on the > various buses, enable error-checking in the hardware, etc. Excellent stuff when you have it. :) > At higher abstraction levels you can checksum the data being stored and > validate it when you access it. What about .ro and .text sections of an executable? I would think kernel support for that would be required. If its application data, then all sorts of things are possible like you described. Ive also seen critical ram variables be stored in triplicate and then compared/voted just to ensure no silent SEU corruption. > You might start by taking a look at the "edac" code in the kernel. > Linux in general doesn't normally enable all the fault detection code, > so you may need to start looking at datasheets. Thank you for the suggestion. If the memory device supports EDAC/ECC then definitely enabling it is a good strategy. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |