Hey all-
I've been working on an issue lately involving multi socket x86_64
systems connected via hypertransport bridges. It appears that some systems,
disable the hypertransport connections during a kdump operation when all but the
crashing processor gets halted in machine_crash_shutdown. This becomes a
problem when the ioapic attempts to route interrupts to the only remaining
processor. Even though the active processor is targeted for interrupt
reception, the fact that the hypertransport connections are inactive result in
interrupts not getting delivered. The effective result is that timer interrupts
are not delivered to the running cpu, and the system hangs on reboot into the
kdump kernel during calibrate_delay. I've found that I've been able to avoid
this hang, by forcing a transition to the bios defined boot cpu during the
crashing kernel shutdown. This patch accomplished that. Tested by myself and
the origional reporter with successful results.
Regards,
Neil
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
arch/x86/kernel/crash.c | 46 ++++++++++++++++++++++++++++++++++++++--------
include/linux/kexec.h | 3 +++
init/main.c | 6 ++++++
kernel/kexec.c | 8 ++++++++
4 files changed, 55 insertions(+), 8 deletions(-)
diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index 8bb482f..0682e60 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -67,13 +67,36 @@ static int crash_nmi_callback(struct notifier_block *self,
}
#endif
crash_save_cpu(regs, cpu);
- disable_local_APIC();
- atomic_dec(&waiting_for_crash_ipi);
- /* Assume hlt works */
- halt();
- for (;;)
- cpu_relax();
-
+ if (smp_processor_id() == kexec_boot_cpu) {
+ /*
+ * This is the boot cpu. We need to:
+ * 1) Wait for the other processors to halt
+ * 2) clear our nmi interrupt
+ * 3) launch the new kernel
+ */
+ unsigned long msecs = 1000;
+ while ((atomic_read(&waiting_for_crash_ipi) > 0) && msecs) ...That would be hard to believe. Linux cannot shut down CPUs completely (put them back to SINIT state) because there is no generic interface for this and it might be impossible. They just get put into a HLT loop. And even if they were in SINIT they would still need the HT connections otherwise it would be impossible to ever wake them up. The way HT setup is normally done is that the BIOS sets it all up and then it never changes (except for some error conditions that While that may help I doubt your analysis of the source of the problem is correct. Most likely something is broken with the IO-APIC, but not related to HyperTransport. It would be better to properly root cause before applying. -Andi -
If forget off the top of my head. But my memory is that if you send the proper IPI cpus go back into SINIT. However linux doesn't do this Agreed. Eric -
If you can get to calibrate_delay hypertransport is still routing traffic. Your diagnosis of the problem is wrong. Most likely it is just an ioapic programming error in restoring the system to PIC mode. I agree that there is a problem. The reliable fix is to totally skip the PIC interrupt mode and go directly to apic mode. To make the code kexec on panic code path reliable we need to remove code not add it. Frankly I think switching cpus is one of the least reliable things that we can do in general. Eric -
What makes you say this? I don't see any need for interrupts prior to I understand the sentiment here, but its not like we're adding additional functionality with this patch. We're already sending an IPI to all the processors to halt them, we're just adding logic here so that we can detect the boot cpu and use it to jump to the kexec image instead of halting. I don't think this is any less reliable that what we have currently. Regards -- /*************************************************** *Neil Horman *Software Engineer *Red Hat, Inc. *nhorman@redhat.com *gpg keyid: 1024D / 0x92A74FA1 *http://pgp.mit.edu ***************************************************/ -
Yes. calibrate_delay() is the first place we send interrupts over hypertransport. However I/O still works. Thus hypertransport from the first cpu is working, and hypertransport itself is working. This is an interrupt specific problem not some generic hypertransport And we don't care if they halt. If they don't get the IPI we timeout. Making the IPI mandatory is a _singificant_ change. The only reason that code is on the kexec on panic code path is that It doesn't make things more reliable, and it adds code to a code path that already has to much code to be solid reliable (thus your problem). Putting the system back in PIC legacy mode on the kexec on panic path was supposed to be a short term hack until we could remove the need by always deliver interrupts in apic mode. If you can't root cause your problem and figure out how the apics are misconfigured for legacy mode let's remove the need for going into to legacy PIC mode and do what we should be able to do reliably. The reward is much higher, as we kill all possibility of restoring PIC mode wrong because we don't need to bother. Eric -
Is it possible that the hypertansport bus can be in a state where I/O would work, but not interrupt routing? I confess my knoweldge of this system bus is But how likely is a kdump kernel to work properly if an errant cpu is running unhalted while we try to boot? I understand your point regarding the significance of the need for reliable IPI's, but in fairness, I think that we I understand your suggestion, but to do that don't we need to do more than just not move the apic to legacy pic mode? It was my understanding that the ioapic delivered timer interrupts to one cpu, who's interrupt handler then distributed it to the other cpu's via IPI. That suggests to me that we will need to re-write the apic config so that the crashing processor is the target of the ioapic interrupt delivery. And if this is truly the case, I would really like to furhter understand why this isn't working on this specific system before I implment anything. Any suggestions for how to further root cause this problem? Regards -
Probably legacy mode always routes to CPU #0. Makes sense and is not really a misconfiguration of legacy mode. But if CPU #0 has interrupts disabled no interrupts get delivered. So choices are: - Move to CPU #0 - Do not use legacy mode during shutdown. - Or do not rely on interrupts after enabling legacy mode - Or do not disable interrupts on the other CPUs when they're halted. First and last option are probably unreliable for the kdump case. Second or third sound best. I suspect the real fix would be to enable IOAPIC mode really early and never use the timers in legacy mode. Then the kdump kernel wouldn't care about the legacy mode pointing to the wrong CPU. IIrc Eric even had a patch for that a long time ago, but it broke some things so it wasn't included. But perhaps it should be revisited. -Andi -
Not sure if this is applicable, but I assume not relying on interrupts in legacy mode would be equivalent to specifying irqpoll on the kdump kernel command line? If so, there seems to be a problem with that solution, as doing so still results in the same hang on the system in question. As for solution 2, that brings me to my previous question. Is that really as simple as just not moving the apic to legacy mode? It would seem some additional programming would be in order to route the interrupt in question to the proper cpu. Regards -
So, it sounds to me then, like unless I'm willing to really re-write the APIC setup code (which I don't feel qualified to do quite yet), that the immediate solution would be to not rely on interrupts in legacy mode, which was according to my understanding, what the use of the irqpoll command line option was supposed to enable. Any thoughts as to why that might not be working in this case, or suggested tests to determine a cause there? Thanks & Regards -
irqpoll won't fix the timer interrupt. -Andi -
Hmm. It looks like irqpoll expects to have at least one irq working. I wonder if we can fix that. Looking at it from the other direction what does this line from check_timer() look like when you boot a normal kernel? apic_printk(APIC_VERBOSE,KERN_INFO "..TIMER: vector=0x%02X apic1=%d pin1=%d apic2=%d pin2=%d\n", cfg->vector, apic1, pin1, apic2, pin2); I'm curious what values we see at boot for the legacy mode the first time it is setup, and possibly what chipset you are on. I know we have had a few times where we have failed to reprogram the ioapic properly at shutdown. So it can't hurt to look at that possibility one last time. For whatever it is worth my original attempt at using only the apic mode was commit: f2b36db692b7ff6972320ad9839ae656a3b0ee3e Looks like I didn't have an x86_64 version. It looks like about half the cleanups I needed was to decouple smp cpu startup and apic initialization. I think the hotplug cpu case has done a more thorough version of that now. There was a bit of reshuffling needed to get the everything initialized in the right order when we started apic mode sooner. Anyway I might just take another look at this if I can find enough moments to string together. Eric -
I'm sorry to reply to myself, but I think I actually had this backwards. The fact that the 8259 does not get reported as being routed through the ioapic, means that in disable_IO_APIC during a crash shutdown, we do _not_ go into virtual wire mode. Instead we simply disable the LVT0 on the ioapic (which as I understand it from above is the timer interrupt), and we trust the 8259 to route the interrupt to the appropriate cpu. If the 8259 is hardwired to deliver interrupts to cpu0 only, then we do in fact need to switch processors during shutdown. Perhaps what we need to do is detect if the 8259 is not routed through the ioapic on shutdown, and if it is, only them implement my patch to jump to the bsp. Thoughts? -- /*************************************************** *Neil Horman *nhorman@tuxdriver.com *gpg keyid: 1024D / 0x92A74FA1 *http://pgp.mit.edu ***************************************************/ -
Close. There are two options with virtual wire mode. - Either the local apic is in virtual wire mode, and somehow the legacy interrupts make it to the local cpu. - Or an ioapic is in virtual wire mode and the legacy interrupt controller is connected to it. So I guess fundamentally for any SMP system that only supports the cpu being in local apic mode and only routes interrupts to the boot strap processor we could be in trouble. That is what our current information about your system suggests. However most systems actually connect the i8254 PIC interrupt controller to the ioapic in virtual wire mode. As I recall the standard mapping is to ioapic 0, pin 0. With ioapic 0, pin 2 being the timer interrupt (Possibly it is the other way around). So as a test we could feed those values into ioapic_8259 and see if the kdump case works. I believe we prefer putting the ioapic into virtual wire mode over putting the cpu into virtual wire mode. We can only control which cpu receives the legacy interrupts if we are putting the ioapic in virtual wire mode. It may also be an interesting test to just enable the timer for the ioapic in early boot, as you have suggested. I don't have a clue what that will do. Eric -
I assume this is the case if the ioapic is also in virtual wire mode and the destination field for the appropriate interrupt(s) (the timer interrupt in this case) is set to either physical mode with a destination id of the lapic for the running cpu, or if it is set to logical mode and the destination id has the I thought we only had one ioapic in this system (Ben correct me if I'm wrong on that please). I thought the above printk told us that, because apic2 and pin2 are both -1, that means that the 8259 isn't physically connected to any cpu, and If that were the case, then we would need to support moving kexec boot to cpu0, at least in some limited cases. I've got a patch together that enables the handshaking I was brainstorming earlier, which should allow an attempted jump to cpu0 on a crash, with a fallback to booting on the crashing processor. If we Sorry, to split hairs here, but you mean the 8259 right? Just want to be sure I'm sorry, I can't find ioapic_8259 defined anywhere. Where is that supposed to Unfortunately nothing. We've tried using the local apic timer in a previous test, and it resulted in no change, as did transitioning the cpu to the apic timer via a call to switch_ipi_to_APIC_timer. Its possible I did something wrong however. Currently I'm writing a patch that calls setup_ioapic_dest after we call disable_IO_APIC. Looking at the implementation, it appears that calling this function should rewrite the irq routing table in the ioapic to deliver interrupts to the set of online cpus, as defined by the TARGET_CPUS macro. I asusme that if the ioapic is in virtual wire mode from the call to disablie_IO_APIC, then calling setup_ioapic_dest will force interrupts to be delivered to the crashing cpu, as it should be the only bit set in the online cpu mask. Please feel free to poke holes in this idea. Thanks & Regards -- /*************************************************** *Neil Horman *Software Engineer *Red Hat, Inc. *nhorman@redhat.com *gpg ...
Grr. Yes. The i8259. Got the timer and the PIC numbers confused Oops. The legacy configuration is the timer to the PIC to either the local Grr. That should have been ioapic_i8259. The second value we print out Well if you didn't have the local apic enabled beyond virtual wire mode that could have cause problems. Sure. I suspect the probably is that you were still in a legacy irq Just try and make certain: ioapic_i8259.pin != -1 Which should cause disable_IO_APIC to put the ioapic and not the local apic in virtual wire mode. Anything else is likely to do strange things. Eric -
That is enough for an initial approximation. Unless something has changed radically the Nvidia chipsets can put the ioapic instead of the local apic in virtual wire mode so that is worth testing. Eric -
Possible. So far I have not seen a hardware setup that would force
interrupts to cpu #0 in legacy mode. But I would not be truly
surprised if it happened that there was hardware that only worked that
(Do not use legacy mode in the kdump kernel. removing it from shutdown
Exactly. If we can work out the details that should be a much more reliable
My real problem was the failure case was obscure (a bad interaction
with ACPI on Linus's laptop) and I didn't have the time to track it
down when it showed up.
My patch had two parts. Some cleanups to enable the code to be enabled
early, and the actually early enable. I figure if we can get the
cleanups in one major kernel version and then in the next enable
the apic mode before we start getting interrupts we should be in good
shape.
I expect with x86 becoming an embedded platform with multiple cpus we
may start seeing systems that don't actually support legacy PIC mode
for interrupt delivery.
Eric
-
do you have a pointer to the old patch set? I'd like to try it out on the failing system here. Regards -
The BIOS and kernel writer's guide for Opteron explicitly states that the platform will boot on CPU0 kind of by definition. So this seems like a fair statement. I can easily see BIOS writers or hardware designers interpreting that to mean that they only have to make sure that interrupts get to the CPU that the BIOS thinks of as CPU0 when the APIC is in legacy mode. I have a query out to some SuperMicro engineers to find out if it is a hardware limitation or if it is APIC misconfiguration. Maybe we can I can agree with the fourth option being a very bad one but I really haven't seen anything in this discussion which supports the assertion that "Move to the CPU that the BIOS originally called CPU#0" is going to be unreliable. Admittedly we haven't tried this on every single x86_64 platform that we have but on the handful that we have tried so far, it hasn't been a problem. Why is everybody jumping to the assumption that -- -ben -=- -
Ben I tend to agree. I think re-enabling the APIC early in the boot process provides a greater degree of reliability in that it more quickly restores the system to a state where booting on a cpu other than cpu0 will be more likely to work, but I have to say that overall it seems like booting a secondary kernel on cpu0, when possible offers the highest degree of reliability. Perhaps what we need is a 'both solution'. Re-enabling the apic to full smp functionality early in the boot process is a good solution for the problems which we are hypothesizing here, and would be a good thing to do in general, but it doesn't preclude also attmpting to switch back to cpu0 during a crash. Perhaps it would be worthwhile to: 1) Investigate the early enablement of the ioapic for x86[_64] 2) implement my prevoiusly proposed patch with the addition of a handshake element, such that: a) when the boot cpu gets the ipi from machine_crash_shutdown it flags the fact that it is going to boot the kexec kernel with a global variable b) the crashing cpu loops waiting for either: I) a timeout of 1 second II) a reduction of the halt count to zero III) the setting of the flag mentioned in (a) c) the crashing cpu, if it sees that it is not the boot cpu AND that the flag in (III) is set, will halt itself, otherwise it will set the flag and boot the kexec image itself. With this modification, we can try to relocate to cpu0, and if we fail, we fall back to booting on the crashing processor. I'll work up a patch that implements (2), unless there are strong objections. I see no reason why we can't implment this 'both' solution. Regards Neil -- /*************************************************** *Neil Horman *Software Engineer *Red Hat, Inc. *nhorman@redhat.com *gpg keyid: 1024D / 0x92A74FA1 *http://pgp.mit.edu ***************************************************/ -
On Tue, Nov 27, 2007 at 02:42:20PM -0500, Neil Horman wrote: Hi Neil, If we implement first solution, we don't have to implement second. Problem will automatically be solved. In general adding more code in crash shutdown path is not good. We are trying to make that code path as small as possible. OTOH, I think we have not root caused this problem yet. We don't know yet why interrupts are not coming to non-boot cpu. I think we can go little deeper to compare the system state in normal boot and kdump boot and see what has changed. System state would include, LAPIC and IOAPIC entries etc. Are we putting the system back in PIC mode or virtual wire mode? I have not seen systems which support PIC mode. All latest systems seems to be having virtual wire mode. I think in case of PIC mode, interrupts can be delivered to cpu0 only. In virt wire mode, one can program IOAPIC to deliver interrupt to any of the cpus and that's what we have been relying on until and unless there is something board specific. Thanks Vivek -
Agreed, assuming: 1) The problems we have been hypothesising are accurate. As you note below, Ben and I have dug deep to find the problem, but we could try to go deeper. 2) There are no other issues with (re)booting a system on the non-boot cpu. It seems to me that if its possible to reboot on cpu0, we should try. I understand that we're trying to keep that code small for obvious reasons, but if we have a This is actually a very interesting question. Looking at disable_IO_APIC in the latest git tree, which is used to revert the APIC to a legacy mode, we do one of two things. If the on board 8259 is routed through the IOAPIC, then we configure the APIC to be in virtual wire mode, so that the interrupt is delivered via the APIC to whatever processor you want to configure. If however, the 8259 bypasses the IOAPIC, then we simply disable the LAPICS LVT0 interrupt, so that any legacy timer interrupts from the apic are ignored, ostesibly because the 8259 will assert the interrupt pin on the processor it is wired to directly. I wonder if most (almost all) modern systems use the former configuration, while the supermicro board in question in a rare exception, uses the latter. If the 8259 was routed directly to cpu0, that would explain this hang. Regards -- /*************************************************** *Neil Horman *Software Engineer *Red Hat, Inc. *nhorman@redhat.com *gpg keyid: 1024D / 0x92A74FA1 *http://pgp.mit.edu ***************************************************/ -
Yes it's probably virtual wire. For real PIC mode we would need really The code doesn't try to program anything specific, it just restores the state that was left over originally by the BIOS. -Andi -
So if the BIOS originally left the IOAPIC in a state where the timer interrupts were only going to CPU0 then by restoring that state we could be bringing this problem upon ourselves when we restore that state. Would anyone have any problems with code that simply verified that the state which we are restoring allowed interrupts to get to the processor that we are currently crashing on and if not, poked in a reasonable value. Yes this would add some complexity to the code paths where we were crashing but it could prevent the problem that we are seeing. It seems like a small fairly safe change rather than a big disruptive change like -- -ben -=- -
But longer (or not so long) term moving the IOAPIC earlier is the better option, simply because the short use of PIC mode traditionally was a source of problems on a lot of boxes. And it does not really make sense to keep this source of trouble just for a short time during boot when we could as well go directly into IO-APIC mode. This would probably also match what other OS are doing better and that is always a good idea for stable operation. -Andi -
Hi Ben,
Apart from restoring the original state (Bring APICS back to virtual wire
mode), we also reprogram IOAPIC so that timer interrupt can go to crashing
cpu (and not necessarily cpu0). Look at following code in disable_IO_APIC.
entry.dest.physical.physical_dest =
GET_APIC_ID(apic_read(APIC_ID));
Here we read the apic id of crashing cpu and program IOAPIC accordingly.
This will make sure that even in virtual wire mode, timer interrupts
will be delivered to crashing cpu APIC.
I think we need to go deeper and compare the state of system (APICS,
timer etc) during normal boot and kdump boot and see where is the
difference. This is how I solved some of the timer interrupt related
issues in the past.
Thanks
Vivek
-
Yes, but according to Bens last debug effort, the APIC printout regarding the timer setup, indicates that ioapic_i8259.pin == -1, meaning that the 8259 is not routed through the ioapic. In those cases, disable_IO_APIC does not take us through the path you reference above, and does not revert to virtual wire mode. Instead, it simply disables legacy vector 0, which if I understand this correctly, simply tells the ioapic to not handle timer interrupts, trusting that the 8259 in the system will deliver that interrupt where it needs to be. If the 8259 is wired to deliver timer interrupts to cpu0 only, then you get the problem that we have, do you? Regards -- /*************************************************** *Neil Horman *Software Engineer *Red Hat, Inc. *nhorman@redhat.com *gpg keyid: 1024D / 0x92A74FA1 *http://pgp.mit.edu ***************************************************/ -
Ok. Got it. So in this case we route the interrupts directly through LAPIC and put LVT0 in ExtInt mode and IOAPIC is bypassed. I am looking at Intel Multiprocessor specification v1.4 and as per figure 3-3 on page 3-9, 8259 is connected to LINTIN0 line, which in turn is connected to LINTIN0 pin on all processors. If that is the case, even in this mode, all the CPU should see the timer interrupts (which is coming from 8259)? Can you print the LAPIC registers (print_local_APIC) during normal boot and during kdump boot and paste here? Thanks Vivek -
Here are the ones from a normal bootup. I was unable to get info from a kdump boot. I haven't figured out why yet. With the same patch that I used to capture this, when I tried to kdump the kernel, it paused a second or two after the backtrace and then dropped to BIOS and came up normally. Here is a little trick, at the point where we are trying to get the info to print out, the kernel command line hasn't been completely parsed yet. That tricked me for part of the day. I had apic=debug on the command line but the logic in print_local_APIC saw the default value because the kernel command line had yet to be parsed. 2007-11-29 17:58:07 ***Here is the info you requested 2007-11-29 17:58:07 2007-11-29 17:58:07 printing local APIC contents on CPU#0/0: 2007-11-29 17:58:07 ... APIC ID: 00000000 (0) 2007-11-29 17:58:07 ... APIC VERSION: 80050010 2007-11-29 17:58:07 ... APIC TASKPRI: 00000000 (00) 2007-11-29 17:58:07 ... APIC ARBPRI: 00000000 (00) 2007-11-29 17:58:07 ... APIC PROCPRI: 00000000 2007-11-29 17:58:07 ... APIC EOI: 00000000 2007-11-29 17:58:07 ... APIC RRR: 00000002 2007-11-29 17:58:07 ... APIC LDR: 00000000 2007-11-29 17:58:07 ... APIC DFR: ffffffff 2007-11-29 17:58:07 ... APIC SPIV: 0000010f 2007-11-29 17:58:07 ... APIC ISR field: 2007-11-29 17:58:07 ... APIC TMR field: 2007-11-29 17:58:07 ... APIC IRR field: 2007-11-29 17:58:07 ... APIC ESR: 00000000 2007-11-29 17:58:07 ... APIC ICR: 00004630 2007-11-29 17:58:07 ... APIC ICR2: 07000000 2007-11-29 17:58:07 ... APIC LVTT: 00010000 2007-11-29 17:58:07 ... APIC LVTPC: 00010000 2007-11-29 17:58:07 ... APIC LVT0: 00000700 2007-11-29 17:58:07 ... APIC LVT1: 00000400 2007-11-29 17:58:07 ... APIC LVTERR: 0001000f 2007-11-29 17:58:07 ... APIC TMICT: 80000000 2007-11-29 17:58:07 ... APIC TMCCT: 00000000 2007-11-29 17:58:07 ... APIC TDCR: 00000000 2007-11-29 17:58:07 2007-11-29 17:58:07 number of MP IRQ sources: 15. 2007-11-29 17:58:07 number of IO-APIC #8 registers: 0. 2007-11-29 17:58:07 number of IO-APIC ...
Ok so here boot cpu LVT0 has been set to deliver any interrupt on pin LINT0 as ExtInt. We do the same thing for kdump cpu local apic too. I am not sure who is the guy in system who encodes the interrupt message from 8259 to be put on hypertransport and on what basis do they decide whether it should go to only cpu0 or a broadcast one. Looks like in this case it is going to cpu0 only. That means we are left with no choice but to work on patch to initialize IOAPIC early. Thanks Vivek -
Thats what I'm doing at the moment. I'm working on a RHEL5 patch at the moment (since thats whats on the production system thats failing), and will forward port it once its working And not to split hairs, but techically thats not our _only_ choice. We could force kdump boots on cpu0 as well ;) Thanks -- /*************************************************** *Neil Horman *Software Engineer *Red Hat, Inc. *nhorman@redhat.com *gpg keyid: 1024D / 0x92A74FA1 *http://pgp.mit.edu ***************************************************/ -
Sorry to have been quiet on this issue for a few days. Interesting news to report, though. So I was working on a patch to do early apic enabling on x86_64, and had something working for the old 2.6.18 kernel that we were origionally testing on. Unfortunately while it worked on 2.6.18 it failed miserably on 2.6.24-rc3-mm2, causing check_timer to consistently report that the timer interrupt wasn't getting received (even though we could successfully run calibrate_delay). Vivek and I were digging into this, when I ran accross the description of the hypertransport configuration register in the opteron specification. It contains a bit that, suprise, configures the ht bus to either unicast interrupts delivered accross the ht bus to a single cpu, or to broadcast it to all cpus. Since it seemed more likely that the 8259 in the nvidia southbridge was transporting legacy mode interrupts over the ht bus than directly to cpu0 via an actual wire, I wrote the attached patch to add a quirk for nvidia chipsets, which scanned for hypertransport controllers, and ensured that that broadcast bit was set. Test results indicate that this solves the problem, and kdump kernels boot just fine on the affected system. I understand that enabling the apic early in the boot process is a nice general solution, but given the wide range of apic configurations possible, and the need for large amounts of testing of such a change, this approach seems like it might be the better way to go, as it corrects a bad behavior only in systems that would be affected by this problem. It introduces no further complexity in the kdump shutdown path, and creates no additional instability in systems that would otherwise be unaffected by this bug. I think this is the best way for us to go forward. Attached patch applies cleanly against 2.6.24-rc3-mm2 and works for me on serveral systems unaffected by the kdump crash problem I origionally reported and fixes the bug on the affected system. Thanks & ...
Hi Neil, Should we disable this broadcasting feature once we are through? Otherwise in normal systems it might mean extra traffic on hypertransport. There is no need for every interrupt to be broadcasted in normal systems? Thanks Vivek --
No, I don't think thats necessecary. Once the apics are enabled, interrupts shouldn't travel accross the hypertransport bus anyway, opting instead to use the dedicated apic bus (at least thats my understanding). The only systems what you are suggesting would help with are systems that have no apic at all, which I can only imagine on 64 bit systems is rare, to say the least. The affected domain is further reduced by the fact that this quirk is only currently being applied to systems with nvidia PCI bridges, since those are the only systems that this problem has manifested on. That seems like a rather small subset, if it exists at all. I suppose we could only optionally enable the quirk if we are booting a kdump kernel (implying that we would need to do something like detect the reset_devices command line option), but I think given the limited affect this patch, its not really needed. Regards Neil -- /*************************************************** *Neil Horman *Software Engineer *Red Hat, Inc. *nhorman@redhat.com *gpg keyid: 1024D / 0x92A74FA1 *http://pgp.mit.edu ***************************************************/ --
I think all interrupt message travel on hypertransport. Even after APICS have been enabled. Look at the following document. http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24674.pdf Have a look at figure 1, figure 2 and section 3.4.2.2 and 3.4.2.3 That's a different thing that once IOAPIC has formed the vectored message, Hypertransport might not touch the destination field. Having said that, I am wondering what will happen if a system continues to operate the timer through IOAPIC in ExtInt mode. Will hypertransport keep on broadcasting that interrupt to every cpu? And every cpu will process that interrupt. Hence, I feel it is safe to restore the broadcast bit back to BIOS value once we are through calibrate_delay(). Thanks Vivek --
I don't think so. IIRC once the other cpus are started they all disable the timer interrupt, except for one cpu, opting instead to get the timer tick via ipi, So while they all might see the interrupt packet on the ht bus, only one I disagree. Looking at what Yinghai said, the default setting for the broadcast bit isn't actually to unicast the interrupt, its just to set the broadcast mask to 0xF, or to 0xFF. Its use is actually to allow cpus with an extended 8 bit apic id see interrupts. So its not so much to direct interrupts to cpu0, but rather to the first 16 cpus rather than to all 255 available cpus. From what I've seen in my testing, systems that 'work' already have this bit set by bios, and my quirk patch above does nothing to them. Disabling this bit after calibrate_dealy is going to introduce more uncertainty in systems that have been proven to work. We should leave well enough alone, and just enable the bit if its off, and we see that we are using extended apic ids via bit 18 of the same register, as Yinghai pointed out. By enabling the quirk that way, all we are really doing is bringing into alignment two bits that should arguably be set/cleared in unison anyway. Regards -- /*************************************************** *Neil Horman *Software Engineer *Red Hat, Inc. *nhorman@redhat.com *gpg keyid: 1024D / 0x92A74FA1 *http://pgp.mit.edu ***************************************************/ --
Does LAPIC allow to disable a specific vector and not accept interrupts? I don't think so. If a timer interrupt is broadcasted to every cpu I think everybody will accept it (like broadcast IPI). That's why intelligence is built into IOAPIC and direct interrupts to a cpu or group of cpu. I am just trying to understand the functionality better. Can somebody help me understand how do we make sure that same timer interrupt is not processed by Again for my understanding, I got few questions. - Why does nvidia choose not to broadcast the interrupts and still works fine? Does that mean nvidia chipse will not work the extended cpu apic ids? - IOW, why do other chipsets choose to broadcast the interrupts and nvidia chooses not to and still works well. - Why do I need to broadcast the interrupts and not target specific cpus? - If I am broadcasting interrupts, how do I make sure only one cpu picks it up. Thanks Vivek --
I understand your desire, but clearly, something prevents it. Note our earlier conversation, this bit doesn't actually force a unicast of an interrupt packet, but simply masks the destination field. When set to zero, it simply means that the ht interrupt packet destination field is restricted to 4 bits rather than 8. So its not like when its set to zero we are guaranteed that it is forced to a single processor anyway. All setting this bit does is ensure that if any apics out on a system are addresed using an extended apic id, that interrupts can reach them. Thats why it was suggested that this bit only be forcibly set if It doesn't! When booting normally getting interrupts to apics that use 4 bit apic ids is sufficient since cpu0 is in that set, but if we crash on a cpu with It doesnt! It hangs in the kdump kernel. It works well normally because Its not a forced broadcast, its a mask on the apic id. The IOAPIC still The IOAPIC handles that. Regards -- /*************************************************** *Neil Horman *Software Engineer *Red Hat, Inc. *nhorman@redhat.com *gpg keyid: 1024D / 0x92A74FA1 *http://pgp.mit.edu ***************************************************/ --
Yes. The local allows us to choose to accept ExtInt interrupts or not. We can't do it per vector but we can do by interrupt delivery path. External Interrupt. External NMI. We should only have a single cpus local apic configured to accept the interrupt. Further it would not surprise me if there was some first come first served logic with accepting the interrupt in there Yes. This sounds like a BIOS bug at present. A chipset feature This is a broadcast for legacy mode interrupts which don't have a cpu destination field. Once we are running the timer through the ioapic Well the local apics. If you are in ioapic mode and give the cpus a choice of which cpu to deliver to it is a hardware anycast irq transmission. That is the hardware magically picks one cpu of the set of allow cpus to deliver the irq to. How that happens is implementation specific. Eric --
My feel is that if it is for legacy interrupts only it should not be a problem. Let's investigate and see if we can unconditionally enable this quirk for all opteron systems. Eric --
Copy that. Thus far, I've tested it on a pure AMD engineering sample, an intel x86_64 box, and the affected system, a quad socket dual core AMD system with an nvidia chipset. That last system is currently the only system that this patch will check/enable the broadcast flag on. I did try a variant of the patch on the AMD engineering sample where it enabled the bit unconditionally. Interestingly enough, the bit was already turned on on that system. I'm wondering if most systems don't already have this bit turned on. You should be able to universally enable this bit, by moving the call to check_hypertransport_config to the top of early_quirks() Regards Neil -- /*************************************************** *Neil Horman *Software Engineer *Red Hat, Inc. *nhorman@redhat.com *gpg keyid: 1024D / 0x92A74FA1 *http://pgp.mit.edu ***************************************************/ --
i checked that bit http://www.openbios.org/viewvc/trunk/LinuxBIOSv2/src/northbridge/amd/amdk8/coherent_ht... static void enable_apic_ext_id(u8 node) { #if ENABLE_APIC_EXT_ID==1 #warning "FIXME Is the right place to enable apic ext id here?" u32 val; val = pci_read_config32(NODE_HT(node), 0x68); val |= (HTTC_APIC_EXT_SPUR | HTTC_APIC_EXT_ID | HTTC_APIC_EXT_BRD_CST); pci_write_config32(NODE_HT(node), 0x68, val); #endif } that bit only be should be set when apic id is lifted and cpu apid is using 8 bits and that mean broadcast is 0xff instead 0x0f. for example 8 socket dual core system or 4 socket quad core system,that you should make BSP start from 0x04, so cpus apic id will be [0x04, 0x13) So if you want to enable that in early_quirk, you need to make sure apic id is using 8 bits by check if the bit 16 (HTTC_APIC_ID) is set. most BIOS already did that. You may ask Supermicro fix their broken BIOS instead. YH --
it should be bit 18 (HTTC_APIC_EXT_ID) YH --
this seems reasonable, I can reroll the patch for this. As I think about it I'm also going to update the patch to make this check occur for any pci class 0600 device from vendor AMD, since its possible that more than just nvidia chipsets can be affected. I'll repost as soon as I've tested, thanks! Neil -- /*************************************************** *Neil Horman *Software Engineer *Red Hat, Inc. *nhorman@redhat.com *gpg keyid: 1024D / 0x92A74FA1 *http://pgp.mit.edu ***************************************************/ --
Ok, New patch attached. It preforms the same function as previously described,
but is more restricted in its application. As Yinghai pointed out, the
broadcast mask bit (bit 17 in the htcfg register) should only be enabled, if the
extened apic id bit (bit 18 in the same register) is also set. So this patch
now check for that bit to be turned on first. Also, this patch now adds an
independent quirk check for all AMD hypertransport host controllers, since its
possible for this misconfiguration to be present in systems other than nvidias.
The net effect of these changes is, that its now applicable to all AMD systems
containing hypertransport busses, and is only activated if extended apic ids are
in use, meaning that this quirk guarantees that all processors in a system are
elligible to receive interrupts from the ioapic, even if their apicid extends
beyond the nominal 4 bit limitation. Tested successfully by me.
Thanks & Regards
Neil
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
early-quirks.c | 83 ++++++++++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 76 insertions(+), 7 deletions(-)
diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 88bb83e..d5a7b30 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -44,6 +44,50 @@ static int __init nvidia_hpet_check(struct acpi_table_header *header)
#endif /* CONFIG_X86_IO_APIC */
#endif /* CONFIG_ACPI */
+static void __init fix_hypertransport_config(int num, int slot, int func)
+{
+ u32 htcfg;
+ /*
+ *we found a hypertransport bus
+ *make sure that are broadcasting
+ *interrupts to all cpus on the ht bus
+ *if we're using extended apic ids
+ */
+ htcfg = read_pci_config(num, slot, func, 0x68);
+ if ((htcfg & (1 << 18)) == 1) {
+ printk(KERN_INFO "Detected use of extended apic ids on hypertransport bus\n");
+ if ((htcfg & (1 << 17)) == 0) {
+ printk(KERN_INFO "Enabling hypertransport extended apic interrupt ...1. k8 northbridge always on bus 00, and dev0x18~dev0x1f, some time later it could be on bus 0xff 2. need to check for first node (BSP) bit 18 and bit 17 is already set, then don't need to go wild to check and set all over the bus. YH --
Not really sure what you're saying here. I understand that the k8 northbridge will for the forseeable future be on bus 00, but why not look for all ht hosts? If we find an ht host, and bit 18 is set while bit 17 is not, we can safely correct it. --
Sorry to reply to myself, but do we have consensus on this patch? I'd like to figure out its disposition if possible. Thanks & Regards Neil --
I agree with the approach taken. Somebody needs to review the changes done for applying early_quirks. I am not well versed with it. Thanks Vivek --
What the patch tries to do looks like the right thing. So if we can get a version that is clean and actually works we should merge it. Eric --
Ok. This test is broken. Please remove the == 1. You are looking
The rest of this quirk looks fine, include the fact it is only intended
to be applied to PCI_VENDOR_ID_AMD PCI_DEVICE_ID_AMD_K8_NB.
For what is below I don't like the way the infrastructure has been
extended as what you are doing quickly devolves into a big mess.
Please extend struct chipset to be something like:
struct chipset {
u16 vendor;
u16 device;
u32 class, class_mask;
void (*f)(void);
};
And then the test for matching the chipset can be something like:
if ((id->vendor == PCI_ANY_ID || id->vendor == dev->vendor) &&
(id->device == PCI_ANY_ID || id->device == dev->device) &&
!((id->class ^ dev->class) & id->class_mask))
Essentially a subset of pci_match_one_device from drivers/pci/pci.h
That way you don't need to increase the number of tables or the
number of passes through the pci busses, just update the early_qrk
table with a few more bits of information.
The extended form should be much more maintainable in the long
run. Given that we may want this before we enable the timer
which is very early doing this in the pci early quirks seems
to make sense.
Eric
--
New patch attached, with suggestions incorporated.
Thanks & regards
Neil
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
early-quirks.c | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++-------
1 file changed, 73 insertions(+), 9 deletions(-)
diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 88bb83e..4b0cee1 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -44,6 +44,50 @@ static int __init nvidia_hpet_check(struct acpi_table_header *header)
#endif /* CONFIG_X86_IO_APIC */
#endif /* CONFIG_ACPI */
+static void __init fix_hypertransport_config(int num, int slot, int func)
+{
+ u32 htcfg;
+ /*
+ *we found a hypertransport bus
+ *make sure that are broadcasting
+ *interrupts to all cpus on the ht bus
+ *if we're using extended apic ids
+ */
+ htcfg = read_pci_config(num, slot, func, 0x68);
+ if (htcfg & (1 << 18)) {
+ printk(KERN_INFO "Detected use of extended apic ids on hypertransport bus\n");
+ if ((htcfg & (1 << 17)) == 0) {
+ printk(KERN_INFO "Enabling hypertransport extended apic interrupt broadcast\n");
+ htcfg |= (1 << 17);
+ write_pci_config(num, slot, func, 0x68, htcfg);
+ }
+ }
+
+}
+
+static void __init check_hypertransport_config()
+{
+ int num, slot, func;
+ u32 device, vendor;
+ func = 0;
+ for (num = 0; num < 32; num++) {
+ for (slot = 0; slot < 32; slot++) {
+ vendor = read_pci_config(num,slot,func,
+ PCI_VENDOR_ID);
+ device = read_pci_config(num,slot,func,
+ PCI_DEVICE_ID);
+ vendor &= 0x0000ffff;
+ device >>= 16;
+ if ((vendor == PCI_VENDOR_ID_AMD) &&
+ (device == PCI_DEVICE_ID_AMD_K8_NB))
+ fix_hypertransport_config(num,slot,func);
+ }
+ }
+
+ return;
+
+}
+
static void __init nvidia_bugs(void)
{
#ifdef CONFIG_ACPI
@@ -83,15 +127,25 @@ static void __init ati_bugs(void)
#endif
}
+static void __init amd_host_bugs(void)
+{
+ printk(KERN_CRIT "IN ...Neil Horman <nhorman@tuxdriver.com> writes: We should not need check_hypertransport_config as the generic loop Likewise this function is unneeded and the printk is likely confusing We don't need to shift device. Although we can do: device_vendor = read_pci_config(num, slot, func, PCI_VENDOR_ID); device = device_vendor >> 16; vendor = device_vendor & 0xffff; --
Agreed.
struct chipset {
u16 vendor;
u16 device;
u32 class;
+ u32 class_mask;
void (*f)(void); ============> int (*f) (int num, int slot, int func);
===>
int status;
status = early_qrk[i].f(num, slot, bus);
if (status == 1)
break;
else if (status == 2)
return;
YH
--
cool! :)
Copy that. Fixed
I'm not so sure about this. In my testing, it was clear that I needed to do a
shift on device to make valid comparisons to the defined PCI_DEVICE_* macros.
The origional code had to do the same thing with the class field, which is
simmilarly positioned in the pci config space.
Other than that, new patch attached. Enables the detection of AMD
hypertransport functions and checks for the proper quirk just as before, and
incoporates your comments above Eric, as well as yours Yinghai.
Thanks & Regards
Neil
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
early-quirks.c | 76 +++++++++++++++++++++++++++++++++++++++++++++------------
1 file changed, 61 insertions(+), 15 deletions(-)
diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 88bb83e..e13c999 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -21,7 +21,7 @@
#include <asm/gart.h>
#endif
-static void __init via_bugs(void)
+static int __init via_bugs(int num, int slot, int func)
{
#ifdef CONFIG_GART_IOMMU
if ((end_pfn > MAX_DMA32_PFN || force_iommu) &&
@@ -32,6 +32,7 @@ static void __init via_bugs(void)
gart_iommu_aperture_disabled = 1;
}
#endif
+ return 1;
}
#ifdef CONFIG_ACPI
@@ -44,7 +45,31 @@ static int __init nvidia_hpet_check(struct acpi_table_header *header)
#endif /* CONFIG_X86_IO_APIC */
#endif /* CONFIG_ACPI */
-static void __init nvidia_bugs(void)
+static int __init fix_hypertransport_config(int num, int slot, int func)
+{
+ u32 htcfg;
+ /*
+ *we found a hypertransport bus
+ *make sure that are broadcasting
+ *interrupts to all cpus on the ht bus
+ *if we're using extended apic ids
+ */
+ htcfg = read_pci_config(num, slot, func, 0x68);
+ if (htcfg & (1 << 18)) {
+ printk(KERN_INFO "Detected use of extended apic ids on hypertransport bus\n");
+ if ((htcfg & (1 << 17)) == 0) {
+ printk(KERN_INFO "Enabling hypertransport extended apic interrupt ...Ok. I just looked at read_pci_config. It doesn't do the right thing for
a non-aligned 32bit access. (Not that I am convinced there is a right
thing we can do). Please make this read_pci_config_16 instead
and you won't need the shift.
Either that or as I earlier suggested just do a 32bit read from offset 0
and use shifts and masks to get vendor and device fields.
The current code doing a shift where none should be needed (because
we ignore the two low order bits in our read) is totally weird
You almost got YH's comment. You need return 2 for the old functions
so we don't try and apply a per chipset fixup for every device in
the system.
I'm actually inclined to remove the return magic and just do something
like:
static fix_applied;
if (fix_applied++)
return;
In those functions that should be called only once.
Hmm. I don't think we want this code positioned in the middle of the
vendor = read_pci_config_16(num, slot, func,
device = read_pci_config_16(num, slot, func,
--
it seems we need to have two tables. one for northbridge (sweep all the NB_K8) and another for SB ( like Nvidia, ati..., one touch and leave) YH --
I like Erics idea better I think. My origional patch had two tables, and it seems that it made the early quirk detection logic that much more convoluted. This way each quirk can determine if it needs to be applied to more than one pci device. --
nvidia or ati chip will come first, and then amd NB ( K8). So you need to make sure "fix_applied return" is not going to skip your fix to K8_NB. YH --
The former seems like a reasonable solution to me. Corrected in this updated
I like the latter approach better. It seems less convoluted to me.
New patch attached.
Thanks & Regards
Neil
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
early-quirks.c | 90 +++++++++++++++++++++++++++++++++++++++++++--------------
1 file changed, 69 insertions(+), 21 deletions(-)
diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 88bb83e..f307285 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -21,8 +21,13 @@
#include <asm/gart.h>
#endif
-static void __init via_bugs(void)
+static void __init via_bugs(int num, int slot, int func)
{
+ static int fix_applied = 0;
+
+ if (fix_applied++)
+ return;
+
#ifdef CONFIG_GART_IOMMU
if ((end_pfn > MAX_DMA32_PFN || force_iommu) &&
!gart_iommu_aperture_allowed) {
@@ -44,8 +49,36 @@ static int __init nvidia_hpet_check(struct acpi_table_header *header)
#endif /* CONFIG_X86_IO_APIC */
#endif /* CONFIG_ACPI */
-static void __init nvidia_bugs(void)
+static void __init fix_hypertransport_config(int num, int slot, int func)
{
+ u32 htcfg;
+ /*
+ *we found a hypertransport bus
+ *make sure that are broadcasting
+ *interrupts to all cpus on the ht bus
+ *if we're using extended apic ids
+ */
+ htcfg = read_pci_config(num, slot, func, 0x68);
+ if (htcfg & (1 << 18)) {
+ printk(KERN_INFO "Detected use of extended apic ids on hypertransport bus\n");
+ if ((htcfg & (1 << 17)) == 0) {
+ printk(KERN_INFO "Enabling hypertransport extended apic interrupt broadcast\n");
+ printk(KERN_INFO "Note this is a bios bug, please contact your hw vendor\n");
+ htcfg |= (1 << 17);
+ write_pci_config(num, slot, func, 0x68, htcfg);
+ }
+ }
+
+
+}
+
+static void __init nvidia_bugs(int num, int slot, int func)
+{
+ static int fix_applied = 0;
+
+ if (fix_applied++)
+ return;
+
#ifdef CONFIG_ACPI
#ifdef CONFIG_X86_IO_APIC
...Ok. My only remaining nit to pick is that fix_hypertransport_config is right in the middle of the nvidia quirks, which can be a bit confusing when reading through the code. Otherwise I think this is a version that we can merge. Let's get a clean description on this thing and send it to the --
Recently a kdump bug was discovered in which a system would hang inside calibrate_delay during the booting of the kdump kernel. This was caused by the fact that the jiffies counter was not being incremented during timer calibration. The root cause of this problem was found to be a bios misconfiguration of the hypertransport bus. On system affected by this hang, the bios had assigned APIC ids which used extended apic bits (more than the nominal 4 bit ids's), but failed to configure bit 17 of the hypertransport transaction config register, which indicated that the mask for the destination field of interrupt packets accross the ht bus (see section 3.3.9 of http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/26094.PDF). If a crash occurs on a cpu with an APIC id that extends beyond 4 bits, it will not recieve interrupts during the kdump kernel boot, and this hang will be the result. The fix is to add this patch, whcih add an early pci quirk check, to forcibly enable this bit in the httcfg register. This enables all cpus on a system to receive interrupts, and allows kdump kernel bootup to procede normally. Regards Neil Signed-off-by: Neil Horman <nhorman@tuxdriver.com> early-quirks.c | 90 +++++++++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 69 insertions(+), 21 deletions(-) diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c index 88bb83e..c0d0c69 100644 --- a/arch/x86/kernel/early-quirks.c +++ b/arch/x86/kernel/early-quirks.c @@ -21,8 +21,36 @@ #include <asm/gart.h> #endif -static void __init via_bugs(void) +static void __init fix_hypertransport_config(int num, int slot, int func) { + u32 htcfg; + /* + *we found a hypertransport bus + *make sure that are broadcasting + *interrupts to all cpus on the ht bus + *if we're using extended apic ids + */ + htcfg = read_pci_config(num, slot, func, 0x68); + if (htcfg & (1 << 18)) { + printk(KERN_INFO "Detected use of extended apic ...
Ben, please lets be clear about this. You say this patch doesn't help on a new system. Even thought its almost the exact same system, its not the same system. Does this patch work consistently on the system you initially reported the problem on? I've done enough work on this at this point that I'm invested in not abandoning this fix. If this solves the problem on dual core system, but not quad core, I'd much rather move forward with this fix and address your quad core problem as a separate issue. --
==>
+ { PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_K8_NB,
PCI_CLASS_BRIDGE_HOST, PCI_ANY_ID, fix_hypertransport_config },
+ { PCI_VENDOR_ID_AMD, 0x1200 , PCI_CLASS_BRIDGE_HOST, PCI_ANY_ID,
fix_hypertransport_config },
I still think good way is that you ask Supermicro to update their BIOS
to use newer code from AMD.
YH
--
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Eric --
I'm not convinced the message is correct. e.g. on a system with only a dual core not enabling This looks like the wrong place to do this. Better add a flag or something in the structure. Dito others. Also while not a problem here in general it's bad style to add potential wrapping bugs like this. Never use ++ for flags. -Andi --
I'm not sure that would be fine. In the situation you describe, not setting this bit means the second core won't receive interrupts. If we crash on that core and boot the kdump kernel with it, we get exactly the same problem that we I suppose I can, but I'm not sure what benefit that provides. Can you I can fix that up. I'll hold off though until ben redoes all his testing. He mentioned earlier this morning, that some of the results he was getting may have been caused by a kexec utility bug. He's re-confirming that this patch solves the reported problem. Once he does, I'll repost. Thanks & Regards -- /*************************************************** *Neil Horman *nhorman@tuxdriver.com *gpg keyid: 1024D / 0x92A74FA1 *http://pgp.mit.edu ***************************************************/ --
It could enable the extended APIC IDs but not use them? The code would be smaller and cleaner. -Andi --
I think this just leaves us with deciding on a mechanism for how to do single-application quirks. I take Andi's point that adding a flag set to the quirk data structure is a fine solution, but I'm really ok with static integers in individual functions. Do we have consensus on how to handle that? I'm happy either way, but I'd rather have agreement on how to handle it before I post another iteration of this patch. Thanks & Regards -- /*************************************************** *Neil Horman *nhorman@tuxdriver.com *gpg keyid: 1024D / 0x92A74FA1 *http://pgp.mit.edu ***************************************************/ --
As long as the solution is simple, small and concise I don't care. And since what will make Andi happy seems to meet those criteria, that should be fine. Eric --
Ok, new patch attached, taking into account Andi's request for a cleaner method to implement single application quirks. I've spoken with Ben, who is continuing to retest, and reports that clean methodical testing results in success with this patch. Summary: Recently a kdump bug was discovered in which a system would hang inside calibrate_delay during the booting of the kdump kernel. This was caused by the fact that the jiffies counter was not being incremented during timer calibration. The root cause of this problem was found to be a bios misconfiguration of the hypertransport bus. On system affected by this hang, the bios had assigned APIC ids which used extended apic bits (more than the nominal 4 bit ids's), but failed to configure bit 17 of the hypertransport transaction config register, which indicated that the mask for the destination field of interrupt packets accross the ht bus (see section 3.3.9 of http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/26094.PDF). If a crash occurs on a cpu with an APIC id that extends beyond 4 bits, it will not recieve interrupts during the kdump kernel boot, and this hang will be the result. The fix is to add this patch, whcih add an early pci quirk check, to forcibly enable this bit in the httcfg register. This enables all cpus on a system to receive interrupts, and allows kdump kernel bootup to procede normally. Regards Neil Signed-off-by: Neil Horman <nhorman@tuxdriver.com> early-quirks.c | 86 +++++++++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 65 insertions(+), 21 deletions(-) diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c index 88bb83e..f4ed3d1 100644 --- a/arch/x86/kernel/early-quirks.c +++ b/arch/x86/kernel/early-quirks.c @@ -21,8 +21,31 @@ #include <asm/gart.h> #endif -static void __init via_bugs(void) +static void __init fix_hypertransport_config(int num, int slot, int func) { + u32 htcfg; + /* + *we found a ...
Sorry for not noticing that earlier, but was there a specific reason this needs to be an early quirk at all? kexec can only happen after the standard quirks ran. I think it should be fine as a standard "late" quirk. -Andi --
Early quirk seemed like the right thing to do to me. Starting from boot up, this (mis)configuration by the bios can mean that come cpus just don't get interrupts. I could imagine situations like serial console not working if the serial port interrupt was routed to a cpu that used extended APIC id. I've never actually observed it happening, but making sure that all cpus were eligible to get interrupts early in the boot process made sense to me. --
Thanks. Neil in your testing please confirm the preconditions for setting the Apic Extended Broadcast flag (bit 17) are present. If that is the case it makes sense to always set that bit on conforming systems but we will also want to print a message noting that the BIOS has a bug, and we are working around it. Thanks, Eric --
However things are implemented completely differently now. I don't think the coherent hypertransport domain of AMD processors actually routes ExtINT interrupts to all cpus but instead one (the default route?) is picked. So I think for the kdump case we pretty much need to use an IOAPIC in virtual wire mode for recent AMD systems. It's worth a look. I still think we need to just use apic mode at kernel startup, and be done with it. Eric -
http://www.hypertransport.org/docs/tech/HTC20051222-0046-0008-Final-4-21-06.pdf Table 143 suggest to me that legacy interrupts should be routed to all cpus, which certainly doesn't seem to be the case in this situation. Perhaps Nvidia Certainly, this seems like the best solution long term. So I'm looking at the implementation for 64 bit system, and it seems a little cleaner than 32 bit setup. I'm wondering if we can just call setup_IO_APIC immediately after init_IRQ in start_kernel? Could it be that straightforward? -- /*************************************************** *Neil Horman *Software Engineer *Red Hat, Inc. *nhorman@redhat.com *gpg keyid: 1024D / 0x92A74FA1 *http://pgp.mit.edu ***************************************************/ -
Neil whipped up a patch to try this and evidently it worked on his test boxes but it didn't work very well on our problem tests box. It hung after the kernel printed "Ready". i.e. on a normal boot I get: <snip> 2007-11-29 13:48:29 Loading vmlinuz-2.6.18-13chaos.ben.test................................ 2007-11-29 13:48:29 Loading initrd-2.6.18-13chaos.ben.test......................................................... .............................................................................. 2007-11-29 13:48:29 Ready. 2007-11-29 13:48:30 Linux version 2.6.18-13chaos.ben.test (ben@wopri) (gcc version 4.1.2 20070626 (Red Hat 4.1.2-14 )) #10 SMP Thu Nov 29 13:11:49 PST 2007 2007-11-29 13:48:30 Command line: initrd=initrd-2.6.18-13chaos.ben.test loglevel=8 console=ttyS0,115200n8 crashkernel=128M@16M elevator=deadline swiotlb=65536 selinux=0 apic=debug BOOT_IMAGE=vmlinuz-2.6.18-13chaos.ben.test BOOTIF= 01-00-30-48-57-91-56 With Neil's patch: 2007-11-29 17:12:55 PXELINUX 2.11 2004-08-16 Copyright (C) 1994-2004 H. Peter Anvin 2007-11-29 17:12:55 Boot options [default: 2.6.18-54.el5.bz336371]: 2007-11-29 17:12:55 linux-2.6.18-13chaos.ben.test-2.6.18-54.el5.bz336371 2007-11-29 17:12:55 linux 2007-11-29 17:12:55 linux-2.6.18-54.el5.bz336371 2007-11-29 17:12:55 linux-2.6.18-52.el5 2007-11-29 17:12:55 linux-2.6.18-13chaos.ben.test-2.6.18-13chaos.ben.test 2007-11-29 17:12:55 linux-2.6.23-0.214.rc8.git2.fc8 2007-11-29 17:12:55 linux-2.6.18-8.1.14.el5 2007-11-29 17:12:55 linux-2.6.18-7chaos 2007-11-29 17:12:55 boot: 2007-11-29 17:13:02 Loading vmlinuz-2.6.18-13chaos.ben.test................................ 2007-11-29 17:13:02 Loading initrd-2.6.18-13chaos.ben.test......................................................... .............................................................................. 2007-11-29 17:13:02 Ready. (END) That's all she wrote. End of story. Had to reboot to another kernel to make get it back. Neil's patch: --- ...
Interesting can you please try an early_printk console. I expect you made it a fair ways and it just didn't show up because you didn't get as far as the normal serial port setup. You don't have any output from your linux kernel. Eric -
I've got a system here that now seems to be behaving in a way that is simmilar to what Ben describes (although I'm not sure its the same problem). early_printk shows that we're panic-ing inside check_timer, because we fail to find any way to route the timer interrupt to the cpu. Specifically, we're hitting this panic: panic("IO-APIC + timer doesn't work! Try using the 'noapic' kernel parameter\n"); This doesn't make much sense to me, as we clearly have managed to get timer interrupts at this point (since we made it through calibrate_delay).... Looking at it, I wonder if this isn't a backporting issue. Ben and I ran this test on an older kernel (since thats what the production system under test is based on). Currently, check_timer is called directly from within setup_IO_APIC, but in the 2.6.18 kernel that RHEL5 is based on its part of an initcall thats run from within init near the call site of the origional io apic init code. I wonder if this isn't just an 'old kernel' issue, and that I need to move check_timer to inside setup_IO_APIC. Ben, is it possible for you to run an upstream kernel on one of these systems so we can see if the patch works in that case? In the interim, I'll update my patch for 2.6.18 so that check_timer gets moved earlier -
there is two mode for mcp55. bios should have one option about virtul wired to LVT0 of BSP or IOAPIC pin 0. or the option like hpet route to ioapic pin 2. for kdump fix, could enable LVT0 of CPU for kdump and disable that for BSP? ben, can you send out lspci -vvxxx -s 00:1.0 YH -
That's interesting. So with BIOS options I can force timer interrupts to be routed through IOAPIC? That would enable us to get timer interrupts on any of the cpus in legacy mode. Ben, can you give it We would not know the crashing cpu in advance hence can't set it. Thanks Vivek -
Exactly. It is still interesting to test to see what happens if you plugin the normal values into ioapic_i8259 for .pin and .apic (.pin is 0 or 2 and .apic is 0) and see what happens. Having a command line parameter that could do that would be a cheap temporary solution. But this is the most likely reason why the timer interrupt is not working. Eric -
Ok, thank you for the explination, this all makes a good deal more sense to me now. Ben is near the machine, so hopefully we'll hear from him soon with the results of this test. Given that, do you think the cpu-switch test that I proposed would be a good solution now (with the fallback mechanism I described), or would a command line 8259 solution be better? I tend to think the former would be better since it would be transparent to the user, but I'd like to have that debate. Regards -- /*************************************************** *Neil Horman *Software Engineer *Red Hat, Inc. *nhorman@redhat.com *gpg keyid: 1024D / 0x92A74FA1 *http://pgp.mit.edu ***************************************************/ -
