Hi, Len, This is used by APEI ERST and GEHS. But it is a generic hardware error reporting mechanism and can be used by other hardware error reporting mechanisms such as EDAC, PCIe AER, Machine Check, etc. The patchset is split from the original APEI patchset to make it explicit that this is a generic mechanism, not APEI specific bits. [PATCH 1/2] Generic hardware error reporting mechanism [PATCH 2/2] Hardware error record persistent support Best Regards, Huang Ying --
There are many hardware error detecting and reporting components in kernel, including x86 Machine Check, PCIe AER, EDAC, APEI GHES etc. Each one has its error reporting implementation, including user space interface, error record format, in kernel buffer, etc. This patch provides a generic hardware error reporting mechanism to reduce the duplicated effort and add more common services. A highly extensible generic hardware error record data structure is defined to accommodate various hardware error information from various hardware error sources. The overall structure of error record is as follow: ----------------------------------------------------------------- | rcd hdr | sec 1 hdr | sec 1 data | sec 2 hdr | sec2 data | ... ----------------------------------------------------------------- Several error sections can be incorporated into one error record to accumulate information from multiple hardware components related to one error. For example, for an error on a device on the secondary side of a PCIe bridge, it is useful to record error information from the PCIe bridge and the PCIe device. Multiple section can be used to hold both the cooked and the raw error information. So that the abstract information can be provided by the cooked one and no information will be lost because the raw one is provided too. There are "reversion" (rev) and "length" field in record header and "type" and "length" field in section header, so the user space error daemon can skip unrecognized error record or error section. This makes old version error daemon can work with the newer kernel. New error section type can be added to support new error type, error sources. The hardware error reporting mechanism designed by the patch integrates well with device model in kernel. struct dev_herr_info is defined and pointed to by "error" field of struct device. This is used to hold error reporting related information for each device. One sysfs directory "error" will be created for ...
Sorry, forget to Cc: Greg for device model part. Best Regards, Huang Ying --
Putting aside the fact that you've ignored all the requests from the last round... Yes, we need a generic error reporting format. Wait a second, this error format structure looks very much like a tracepoint record to me - it has common fields and record-specific fields. And we have all that infrastructure in the kernel and yet you've decided to reimplement it But all this APEI crap sounds suspiciously bloated - why do we need an error field for _every_ device on the system? Looks like a bunch of So how do you say which devices should report and which shouldn't report errors, from userspace with a tool? Default actions? What if you forget Right, so no need for a daemon but who does the read? cat? How are you going to collect all those errors? How do you enforce policies? How do you inject errors? How do you perform actions based on the error type like disabling or reconfiguring a hw device based on the errors it Yet another bloat winner. Why do we need a memory allocator for error records? You either get a single critical error which shuts down the system and you log it to persistent storage, if possible, or you work at those uncritical errors one at a time. IOW, do you have a real-life usecase which justifies the dire need for a And as it was said a countless times already, this whole thing, if at all accepted should go to drivers/edac/ or drivers/ras/ or whatever. -- Regards/Gruss, Boris. --
You mean "struct trace_entry"? They are quite different on design. The record format in patch can incorporate multiple sections into one record, which is meaningful for hardware error reporting. And we do not need the fancy "/sys/kernel/debug/tracing/events/<xxx>/<xxx>/format", user space error daemon only consumes all error record it recognized and blindly Because every device may report hardware errors, but not every device will do it. So just a pointer is added to "struct device" and Some summary hardware error information can be put into printk. Error daemon is needed because we need not only log the the error but the predictive recovery. If you really have no daemon, cat can be used to log the error. I don't fully understand your words, you want to We can use another device file to inject error, for example /dev/error/error_inject. Just write the needed information to this file. The format can be same as the error record defined as above, These are policies and will be done in user space error daemon. For The point is lockless not the memory allocator. The lockless memory allocator is not hardware error reporting specific, it can be used by Uncritical errors can be reported in NMI handler too. So we need You think drivers/herror is not a good name? We can rename it to "drivers/ras" if that is the consensus. Best Regards, Huang Ying --
Nobody said you needed that - the tracepoint contains all the
information you need.
So why the need to enable/disable them? Why add all that code to
enable/disable them when all devices can report hw errors but not all
do it but all should do it... (I can go on forever). Do you see the
Sorry, I misread your original statement. So it is clear that we need
Same argument as above - you can do that with tracepoints without
duplicating functionality.
Wait a second, are we talking about hardware errors or memory management
here? If you want to push your lockless memory allocator, send it in to
linux-mm and let the guys there look at it, but not in conjunction with
hw errors. That's like I'm going for a run and, btw, while I'm at it, I
Why? What's wrong with using a single struct on the stack? Are you
afraid that we might blow the NMI stack although NMIs don't nest?
[.. ]
Dude, let me save you the trouble: all everybody is trying to say is
that you can achieve all that with stuff already available in the
kernel. And HW errors are not that special to need a special subsystem
grown for them - you just need to handle them properly. The only thing
you should provide is the backend to persistent storage so that error
info can be put there - everything else is bloat.
--
Regards/Gruss,
Boris.
--
Can you use a tracepoint to output error information of multiple devices in one error record? That is useful for hardware error Because some error reporting devices itself may not work properly, we The lockless memory allocator is not in this patch. I push it in another patch. The error record allocator is just a simplest wrapper The lockless memory allocator is used not to save space on stack. It is part of the lockless data structure needed by hardware error It seems that the main different opinion between us is that you want to implement hardware error reporting inside tracepoint, but I want to implement it outside tracepoint. Your point is code sharing, my point is code simplicity. Tracepoint itself is already quite complex, adding hardware error reporting makes it even more complex. In fact, hardware error reporting is quite simple. You can see, this patch is quite small. Even the code added into tracepoint to support hardware error reporting may be more complex than this patch. As for code sharing, the main part can be shared between tracepoint and hardware error reporting is lockless data structure and some NMI handling facility. But we do not need to implement hardware error reporting inside tracepoint to share that. Maybe we can refactor lockless data structure and NMI handling facility inside tracepoint into a general one, and use that in tracepoint and hardware error reporting. For example, "irq_work" can be used in hardware error reporting too. Best Regards, Huang Ying --
Normally, corrected hardware error records will go through the kernel processing and be logged to disk or network finally. But for uncorrected errors, system may go panic directly for better error containment, disk or network is not usable in this half-working system. To avoid losing these valuable hardware error records, the error records are saved into some kind of simple persistent storage such as flash before panic, so that they can be read out after system reboot successfully. Different kind of simple persistent storage implementation mechanisms are provided on different platforms, so an abstract interface for persistent storage is defined. Different implementations of the interface can be registered. Even after successfully reboot, before being erased from the simple persistent storage, the error records should be guaranteed to be saved into disk or network firstly. Peek and clear operations on simple persistent storage is implemented to support this transaction semantics as follow: - Peek an error record from simple persistent storage - Save the error record into disk or network - Sync the disk file or get ACK from network - Clear the error record in simple persistent storage This patch is designed by Andi Kleen and Huang Ying. Signed-off-by: Huang Ying <ying.huang@intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> --- drivers/herror/Makefile | 2 drivers/herror/herr-core.c | 39 ++++++++- drivers/herror/herr-internal.h | 12 ++ drivers/herror/herr-persist.c | 174 +++++++++++++++++++++++++++++++++++++++++ include/linux/Kbuild | 1 include/linux/herror.h | 48 +++++++++++ 6 files changed, 271 insertions(+), 5 deletions(-) create mode 100644 drivers/herror/herr-internal.h create mode 100644 drivers/herror/herr-persist.c --- a/drivers/herror/Makefile +++ b/drivers/herror/Makefile @@ -1 +1 @@ -obj-y += herr-core.o +obj-y += herr-core.o herr-persist.o --- a/drivers/herror/herr-core.c +++ ...
I think this is totally the wrong thing to do. TOTALLY.
The fact is, concentrating about "hardware errors" makes this
something that I refuse to merge. It's such an idiotic approach that
it's disgusting.
Now, if this was designed to be a "hardware-backed persistent 'printk'
buffer", and was explicitly meant to save not just some special
hardware error, but catch all printk's (which may be due to hardware
errors or oopses or warnings or whatever), that would be useful.
But limiting it to just some special source of errors makes this
pointless and not ever worth merging.
Linus
--
On Fri, 19 Nov 2010 07:52:08 -0800 yep. We already have bits and pieces in place for this: kmsg_dump, ramoops, mtdoops, etc. If your hardware has a non-volatile memory then just hook it up as a backend driver for kmsg_dump. --
On Fri, Nov 19, 2010 at 11:52 PM, Linus Torvalds Yes. APEI ERST can be used to back persistent 'printk', and that is in our plan too. But APEI ERST is not limited to do that, it can be used for printk, hardware error, and maybe some other users. When we design APEI ERST support, we have multiple users in mind. Best Regards, Huang Ying --
This patchset depends on version 5 of the lockless memory allocator and list patchset as follow. [PATCH -v5 0/3] Lockless memory allocator and list Best Regards, Huang Ying --
You call it generic, does that mean the EDAC guys agree, does it work on AMD and IA64? If not, Tony could you please apply a cluebat? I thought Intel was going to sit around the table with all hardware error people and come up with a unified thing at LPC? --
I call it "generic", because it can be used by EDAC, PCIe AER, Machine Check, etc to report hardware errors, and it can work on AMD and Best Regards, Huang Ying --
1) that's not a complete answer -- I asked to the EDAC guys agree? Have you even tried talking to them? and 2) 'can' is not the right word here, I though we'd all agreed to talk about this and agree on some approach before littering the kernel with tiny special case interfaces.. which this will be if EDAC and others don't agree with you. So I want a firm agreement of all parties interested that this is the way forward, if not you already have my NAK. --
I have talked with Mauro during LPC. We all agree that we need a generic hardware error reporting interface. And now, I want to talk Why not talk about the code? Which it can do, which it can not do? Why not talk about the requirement? What do we need? Best Regards, Huang Ying --
talking with code could maybe be done.. but definitely not in the form of a 'pull' request of your approach. Also, I would think that if you do a RFC (the usual way to talk with code) is to list the various requirements and how you address them with your code and asking the other parties if the agree with the approach you've taken. You've done none of that, instead you ask Len to merge your muck without even mentioning what the other parties think. --
I lists the requirements and my solutions in my patch description of [1/2] and [2/2]. If you have any question, please let me know. Best Regards, Huang Ying --
Those talk about _your_ requirements, it does not touch upon any of the concerns of the other parties, it doesn't even mention them beyond listing their existence. It also, very specifically, does not mention if you did talk to any of those parties and what their thought on the matter was, nor does it ask them about their opinion. Instead you present your work as a fait accompli, not a work in flux and subject to change. You ask for it to be merged -- this does not come across like a discussion, much less a request for co-operation. --
I heard about that LKML likes talk with code instead of idea. Am I wrong? Best Regards, Huang Ying --
No, its your presentation of said code that's wrong. You're missing the [RFC] tags and open questions in your Changelogs like: EDAC could use it like so and so, does that sound acceptable Mauro? There is no mention you actually did talk to Mauro and Tony and what its outcome was. --
The main outcome is that we need a generic hardware error reporting interface, that can be used by all hardware error reporting mechanisms. Best Regards, Huang Ying --
Yeah, no.
Really.
We don't want some specific hardware error reporting mechanism.
Hardware errors are way less common than other errors, so making
something that is special to them just isn't very interesting.
I seriously suggest that the only _sane_ way to handle hardware errors is to
(a) admit that they are rare
(b) not try to use some odd special mechanism for them
(c) just 'printk' them so that you can use the absolutely most
standard way to report them, and one that administrators are already
used to and has support for network logging with existing tools etc.
(d) and if you want to make them persistent and NMI-safe, just do
that on the _printk_ level. That way, any NMI-safeness or persistency
helps everybody.
I really see _zero_ point to some hw-error-specific model.
Linus
--
On Fri, Nov 19, 2010 at 11:56 PM, Linus Torvalds We thought about 'printk' for hardware errors before, but it has some issues too. 1) It mixes software errors and hardware errors. When Andi Kleen maintained the Machine Check code, he found many users report the hardware errors as software bug to software vendor instead of as hardware error to hardware vendor. Having explicit hardware error reporting interface may help these users. 2) Hardware error reporting may flush other information in printk buffer. Considering one pin of your ECC DIMM is broken, tons of 1 bit corrected memory error will be reported. Although we can enforce some kind of throttling, your printk buffer may be full of the hardware error reporting eventually. 3) We need some kind of user space hardware error daemon, which is used to enforce some policy. For example, if the number of corrected memory errors reported on one page exceeds the threshold, we can offline the page to prevent some fatal error to occur in the future, because fatal error may begin with corrected errors in reality. printk is good for administrator, and may be not good enough for the hardware error daemon. But yes, printk is convenient for administrator or end user. So we plan to printk some summary information about hardware errors too. But leave the full hardware error information to the hardware error specific interface, so that the administrator can get some clue and Best Regards, Huang Ying --
On Fri, Nov 19, 2010 at 6:04 PM, huang ying
Bah. Many machine checks _were_ software errors. They were things like
the BIOS not clearing some old pending state etc.
The confusion came not from printk, but simply from ambiguous errors.
When is a machine check hardware-related? It's not at all always
obvious.
Sometimes machine checks are from uninitialized hardware state, where
Sure. That doesn't change the fact that finding the data is your
/var/log/messages and your regular logging tools is still a lot more
useful than having some random tool that is specialized and that most
IT people won't know about. And that won't be good at doing network
reporting etc etc.
The thing is, hardware errors aren't that special. Sure, hardware
people always think so. But to anybody else, a hardware error is "just
another source of issues".
Anybody who thinks that hardware errors are special and needs a
special interface is missing that point totally.
And I really do understand why people inside Intel would miss that
point. To YOU guys the hardware errors you report are magical and
special. But that's always true. To _everybody_, the errors _they_
report is special. Like snowflakes, we're all unique. And we're all
And by "we", who do you mean exactly? The fact is, "we" covers a lot
of ground, and I don't think your statement is in the least true.
Yes, IT people want to know. When they start seeing hardware errors,
they'll start replacing the machine as soon as they can. Whether that
replacement is then "in five minutes" or "four months from now" is up
to their management, their replacement policy, and based on how
critical that machine is.
IT HAS NOTHING WHAT-SO-EVER TO DO WITH HOW OFTEN THE ERRORS HAPPEN.
And yes, Intel can do guidelines, but when you say there should be
some "enforced policy" by some tool, you're simply just wrong.
Linus
--
On Sat, Nov 20, 2010 at 10:15 AM, Linus Torvalds I think the BIOS error should be reported to hardware vendor instead Yes. Hardware errors and software errors are just two types of errors. Hardware errors are not so special. So I agree that we need to report hardware error information with printk. Which is mainly human oriented interface. We need a tool oriented interface too, to let user space error daemon to do something like counting errors for hardware components, offline/hot-remove the error components based on some Because some external cause like cosmic rays and electromagnetic interference can cause hardware errors too. We need error counting to distinguish between external caused hardware errors and real hardware errors. Usually, the hardware components reporting corrected hardware errors can work for some while. But if the corrected errors reporting rate goes high, the possibility for hardware to stop work (because of some fatal error) goes high too. The error counting can help IT people to know the urgency. And user space error daemon can help IT people to do some recovery operation automatically, for example, trigger the memory or CPU Yes. The replacement policy should be determined by IT people. My previous expression is confusing. We need to provide some mechanism in user space error daemon to help IT people to do that automatically. For example, we provide error counting for each hardware components, and let IT people set the threshold. So, do you agree that we need some tool oriented interface in addition to printk? Best Regards, Huang Ying --
If you (and the code) are absolutely certain that a particular error instance is totally due to the BIOS, then stick the words "BIOS ERROR" into the printk(). Problem solved. And in the even that the diagnosis is wrong, the rest of us will still have the complete picture of what happened from dmesg, rather than seeing random kernel errors (from other code) happen later without knowing there was some kind of BIOS or hardware fault that triggered it. Having them all in one place is rather useful. And you can still configure rsyslogd to _also_ send the BIOS/hardware errors to a separate destination, if that turns out to be useful. Cheers --
I have no objection to report hardware errror with printk too. But we need a user space hardware error daemon too, which needs a tool-oriented interface. Do you think printk is a good interface for tool to extract and parse error records? I think it is mainly human oriented. Best Regards, Huang Ying --
FYI, trenn added some printk prefixes for this case a while back; #define FW_BUG "[Firmware Bug]: " #define FW_WARN "[Firmware Warn]: " #define FW_INFO "[Firmware Info]: " cheers, -Len Brown, Intel Open Source Technology Center --
Hmm. This seems to have gotten bounced by a bad smtp setup here
locally. Sorry if you get it twice..
Linus
On Sat, Nov 20, 2010 at 8:04 AM, Linus Torvalds
--
On Sun, Nov 21, 2010 at 7:57 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: Yes. They can. But people like tools. For example I can calculate, but We just provide the mechanism in the automated program, let MIS person fill in the policy. They can setup the automated program in print server just email them if error exceed threshold, and setup the Quake server to hot-remove the error DIMM if error exceed threshold. Some server machine can do more than just replace the whole machine. Some hardware components like DIMM, CPU, etc can be hot-removed, these can be done by tool instead of human. We can trigger these operations automatically in a more timely way if we have a automated tools. After error exceed threshold, administrator may need several hours to notice it, but the automated tools can trigger it almost immediately. And the user space tool can help us to identify the error hardware components too. For example, there is no common way to identify which DIMM goes error from the physical address reported by hardware. Sometimes some very tricky method is used, EDAC people use a motherboard specific table to map to the DIMM slot. On some machine, SMBIOS table can be used, but on some other machine, SMBIOS table is just crap. I think it is not good to do all these dirty and maybe I don't want to hide the information from the MIS people with the tool. I want to show the information to MIS people in a better way. For example, we can email MIS people under some situation. And we can implement a SNMP agent inside the tool, so that the MIS people can monitor the hardware status remotely. This can be integrated with the There is a "[Hardware Error]: " prefix for printk in kernel. We can Perl scripts are just another kind of user space tools for hardware errors. We just want to write a better tool for them with the help of a tool oriented error reporting interface. Best Regards, Huang Ying --
On Sat, Nov 20, 2010 at 4:42 PM, huang ying
You really don't understand, do you?
People won't even _know_ about your tool. It's too f*cking
specialized. They'll have come from other Unixes, they'll have come
from older Linux versions, they don't know, they don't care.
They _do_ know about system logs.
The most common kind of "system admin" is the random end-user. Now,
admittedly Intel seems to have its head up its arse on the whole
"regular people care about ECC and random memory corruption", and it
may be that consumer chips simply won't support the whole magic error
handling code, but the point remains: we don't want yet another
obscure error reporting tool that almost nobody knows about.
Especially for errors that are so rare that you'll never notice if you
are missing them.
Linus
--
On Sun, Nov 21, 2010 at 8:50 AM, Linus Torvalds I mean the tool can cook the raw error information from kernel and report it in a better way. Yes. You are right that the user space error daemon is not popular now. But every tool has its beginning, isn't it? I know it is impossible for this tool becomes popular in desktop users because hardware error is really rare for them. But it may become popular for server farm administrators, to them hardware I have no objection to report hardware errors in system logs too. So these people can get the information too. I just want to add another tool oriented interface too. So that some other users (like cluster For desktop users, that is true. But for cluster administrator, the hardware errors are really common. Some engineer of local search engine vendor told me that they have broken DIMM everyday. Best Regards, Huang Ying --
So, use the standard interface for the tool: syslog. Have this obscure new tool simply parse the log messages, and the send/save the data whatever way you like. No new specialized kernel API required. --
Although it may be possible to extract some information from syslog and parse it in a fault tolerant way, we can only use that human oriented Adding a new device file will be seen as a new kernel API? Best Regards, Huang Ying --
No, that sounds like the *NIX programming philosophy. You may have already noticed that most *NIX tools store and manage data in _text_ form. That makes it easy to understand, easy to parse/process, and generally better in almost every respect. Other platforms (GNOME, MS-Windows) prefer a binary format that requires special tools to view/access. Ugh. Cheers --
> ... most *NIX tools store and manage data in _text_ form. If the hardware error dump is complicated there is a trade-off between making things human readable and putting a lot of comlicated parsing code into the kernel. Maybe the kernel should just dump hex "text" in some cases and let a user-program parse the syslog? What do do if the hardware error log is very large? Is there a limit on how much is practical to send through syslog? thanks, -Len Brown, Intel Open Source Technology Center --
If you look what sysadm's do with the Unix logs, you'll see that they use either one of the following approaches: 1) have something looking at syslog (and/or serial console logs), and storing them for their analisys, in text format; 2) convert syslog errors into a SNMP object UID's, on a machine-readable code, in order to manage them via some SNMP management system. On both cases, the approach is there for a long time. If an error "magic" code is added, both ways will break, as sysadm's won't be able to understand the meaning of the magic number, and the SNMP conversion tools won't be ready to convert that magic code into something else. Of course, with time, the SNMP parsers will eventually add the needed decoders for the magic numbers, in order to convert them into a MIB representation. So, even being a number, such code is not machine readable (at least not for the right tools), as it is not an SNMP object, so, the management systems won't catch it without a parser. So, IMO, the better is to keep providing a text message. We might think on adding a way to directly output a SNMP UID from kernel, but this seems overkill to me, and anything else would just be meaningless for most sysadmins. Thanks, Mauro. --
I have no objection to text form interface. I said printk is not a tool oriented interface not because it is a text form interface but some other issues. For example, messages from different CPU/context may be interleaved; all kinds information mixed together, without overall Linux kernel uses binary format interfaces too. Best Regards, Huang Ying --
Reading the following google paper on memory errors: http://www.google.com/research/pubs/pub35162.html I suppose they weren't really reporting memory errors with printk. Because of this: "The scale of the system and the data being collected make the analysis non-trivial. Each one of many ten-thousands of machines in the fleet logs every ten minutes hundreds of parameters, adding up to many TBytes." This would add up to gigabytes of generated data, for each machine, in some minutes. It seems to me that printk isn't really suited to report large amounts of raw data. --
