Re: [RFC PATCH] PCI MMCONFIG: add validation against ACPI motherboard resources

Previous thread: Re: [PATCH] support PCI MCFG space on Intel i915 bridges by Robert Hancock on Sunday, April 29, 2007 - 7:10 pm. (5 messages)

Next thread: [PATCH] x86_64: support poll() on /dev/mcelog by Tim Hockin on Sunday, April 29, 2007 - 7:24 pm. (3 messages)
From: Robert Hancock
Date: Sunday, April 29, 2007 - 7:14 pm

This path adds validation of the MMCONFIG table against the ACPI reserved
motherboard resources. If the MMCONFIG table is found to be reserved in
ACPI, we don't bother checking the E820 table. The PCI Express firmware spec
apparently tells BIOS developers that reservation in ACPI is required and
E820 reservation is optional, so checking against ACPI first makes sense.
Many BIOSes don't reserve the MMCONFIG region in E820 even though it is
perfectly functional, the existing check needlessly disables MMCONFIG in
these cases.

In order to do this, MMCONFIG setup has been split into two phases. If PCI
configuration type 1 is not available (like on EFI Macs) then MMCONFIG is
enabled early as before. Otherwise, it is enabled later after the ACPI
interpreter is enabled, since we need to be able to execute control methods
in order to check the ACPI reserved resources. Presently this is just triggered
off the end of ACPI interpreter initialization.

There are a few other behavioral changes here:

-Validate all MMCONFIG configurations provided, not just the first one.

-Validate the entire required length of each configuration according to the
provided ending bus number is reserved, not just the minimum required allocation.

-Validate that the area is reserved even if we read it from the chipset directly
and not from the MCFG table. This catches the case where the BIOS didn't set the
location properly in the chipset and has mapped it over other things it shouldn't have.
This might be overly pessimistic - we might be able to instead verify that no other
reserved resources (like chipset registers) are inside this memory range.

Some testing is needed to see if this rejects MMCONFIG on all systems where it
is problematic. There were some patches floating around to read the table
location out of the chipset for Intel 915 and 965, I think the author found the
latter to be problematic since the chipset had the table mapped over top of
motherboard resources. The extra checking here may catch ...
From: Randy Dunlap
Date: Sunday, April 29, 2007 - 7:59 pm

if ((



---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
-

From: Olivier Galibert
Date: Monday, April 30, 2007 - 3:59 pm

I have a fundamental problem with that: you don't validate a higher
reliability information against a lower one.  The chipset registers
are high reliability.  Modulo unknown hardware erratas and bugs in the
code (and accepting f0000000 is in practice a bug in the code, the
docs are starting to catch up with it too), the chipset *will* decode
mmconfig at the looked up address no matter what.  On the other side,
the ACPI data is bios generated, and that is well known to be horribly
unreliable.  Hell, if it was reliable we could just use the MFCG ACPI
table without questions.

So you can check the ACPI stuff for coherency (MFCG vs. the rest), you
can validate the ACPI stuff against the results of the lookup if you
want, but validating the lookup against ACPI is nonsensical.

  OG.

-

From: Robert Hancock
Date: Monday, April 30, 2007 - 4:26 pm

The problem is that in the event the MMCONFIG table is assigned to an 
address range that conflicts with other devices, there's no guarantee 
that MMCONFIG will have the higher decode priority. Apparently this is 
exactly the case on some boards, the MMCONFIG is mapped on top of 
chipset registers, and when we think we're accessing the MMCONFIG table 
we're really scrambling random chipset registers and hosing things. So 
we can't just blindly trust the values as being usable even when they 
come directly from the chipset.

As I mentioned, though, really what we want to verify in this case is 
that no other reserved resources fall inside the range being decoded by 
the chipset. I may see if I can code something to do this. Essentially, 
though, the existing patch effectively does this, on the assumption that 
the board won't have conflicting reserved resources in the ACPI tables, 
which is probably a safe assumption as Windows would likely be terribly 
unhappy with that..

-- 
Robert Hancock      Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/


-

From: Jesse Barnes
Date: Tuesday, May 1, 2007 - 9:48 am

We need to look at the register, but we may not want to use it if it looks 
too confused.  If it doesn't agree with what we see in ACPI, we likely 
have a problem due to the issues Robert outlined in his other mail.

Jesse

-

From: Jesse Barnes
Date: Tuesday, May 1, 2007 - 7:41 pm

Now that I've read his patch closely I think you're right.

Robert, it looks like you'll trust acpi_table_parse if 
pci_mmcfg_check_hostbridge returns a failure.  I think it should be 
treated with a higher priority.  If pci_mmcfg_check_hostbridge returns a 
failure, there's no way MCFG space can work, so we should disable it 
unconditionally in that case (even if ACPI says "trust me, when have I 
ever lied to you?").

I'm testing it now on my 965...

Thanks,
Jesse
-

From: Jesse Barnes
Date: Tuesday, May 1, 2007 - 7:56 pm

Bah... nevermind Robert, I see you're doing this already in 
pci_mmcfg_reject_broken.  I'm about to reboot & test now.

Jesse
-

From: Jesse Barnes
Date: Tuesday, May 1, 2007 - 10:27 pm

Ok, I've tested a bit on my 965 (after re-adding my old patch to support 
it) and the new checks are more complete, but my BIOS still appears to be 
buggy.

The extended config space (as defined by the register) is at 0xf0000000 
(full value is 0xf0000003 indicating 128M enabled).  The ACPI MCFG table 
has this space reserved according to Robert's new code, but the machine 
hangs due to the address space aliasing Olivier mentioned awhile back.  I 
don't have a PCIe card to test with (or any devices that require extended 
config space that I know of) so I can't really tell if Windows supports 
PCIe on this platform, but if it does I don't see how it would w/o having 
a full bridge driver and sophisticated address space allocation builtin.

I'm going to try updating my BIOS, but if that doesn't solve this problem, 
I'm not sure what we can do about it.  Should pci_mmcfg_insert_resources 
check for conflicts?  Should we just blacklist certain boards?  I can try 
pinging our BIOS folks about this board to see what was intended, but I'm 
sure this won't be the only board we have problems with, so we'll need to 
address it generically somehow.

Thanks,
Jesse
-

From: Robert Hancock
Date: Wednesday, May 2, 2007 - 7:34 am

Windows XP doesn't use MMCONFIG or any extended configuration space. I 
believe Vista is supposed to, though. Not sure how they are handling 

Can you post what your board has for PNPACPI reserved resources (I 
believe they're in /sys/devices/pnp0/*/resources IIRC, don't have a 
Linux box handy right now). Full dmesg would also be useful, I think it 
dumps out those reservations at boot nowadays..

-- 
Robert Hancock      Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/

-

From: Jesse Barnes
Date: Wednesday, May 2, 2007 - 10:57 am

BIOS update didn't help.  Here's the boot log and a dump of the pnp0 
resources.

Jesse
From: Robert Hancock
Date: Wednesday, May 2, 2007 - 4:45 pm

Curious.. It looks like the ACPI resources have the correct reservation 
for the MMCONFIG window according to what the register says the location 
should be. There's no other reservations that overlap with that range 
(f000000-f7ffffff), and according to the 965 datasheet there's nothing 
that's hard-coded to occupy that memory range. I can't really see what 
this range could be conflicting with.

What happens if you take out the chipset register detection, does the 
MCFG table give you the same result? Wonder if they're doing something 
funny with start/end bus values or something in their table. There's 
some code in my patch that prints out the important data from the MCFG 
table, can you tell me what that shows with the chipset detection taken out?

If that doesn't provide any useful information, I think we may need some 
assistance from Intel chipset/motherboard people to figure out what is 
going on here..

-- 
Robert Hancock      Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/
-

From: Jesse Barnes
Date: Wednesday, May 2, 2007 - 4:54 pm

Yeah, it's strange.  Even /proc/iomem from a working boot looks ok:

d0700000-d07fffff : PCI Bus #04
d0800000-d08fffff : PCI Bus #05
f0000000-f7ffffff : pnp 00:01
fec00000-fec00fff : IOAPIC 0

Yeah, I'll look a little more closely.  It could also be that another 
register needs tweaking somewhere to actually get the bridge to decode 

I'm talking with them now, hopefully they'll shed some light on it.

Thanks,
Jesse

-

From: Jesse Barnes
Date: Friday, May 4, 2007 - 2:06 pm

I did a little more debugging this morning, and found that I can 
actually do reads from the space described by ACPI and the device 
register, but later when ACPI actually scans the root bridges, it 
hangs.  Specifically the call to pci_acpi_scan_root in 
pci_root.c:acpi_pci_root_add() never seems to return.

I'll walk through that logic when I get back to my test box, but it's 
also worth noting that Vista's MCFG on this machine apparently works ok 
too.

Jesse
-

From: Robert Hancock
Date: Friday, May 4, 2007 - 5:22 pm

I would try sticking some debug in arch/x86_64/pci/mmconfig.c at the 
beginning and end of pci_mmcfg_read and pci_mmcfg_write to print the 
seg, bus, devfn and reg for each read and write. Hopefully that will 
track down the one that is causing the lockup, if it is an actual 
MMCONFIG access that's doing it..

-- 
Robert Hancock      Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/
-

From: Jesse Barnes
Date: Monday, May 21, 2007 - 12:10 pm

I can't see how any MCFG based accesses could work on this box, but I
don't know why.  According to the boot log (with our code patched in
but disabled after checking the ACPI reserved status), the space is fine:

...
ACPI: (supports S0 S3 S4 S5)
ACPI: Using IOAPIC for interrupt routing
pciexbar lo: 0xf0000003
pciexbar hi: 0x00000000
Enabled MCFG space at 0x00000000f0000000, size 134217728
PCI: Found Intel Corporation G965 Express Memory Controller Hub with MMCONFIG support.
PCI: MCFG configuration 0: base 00000000f0000000 segment 0 buses 0 - 127
PCI: MCFG area at f0000000 reserved in ACPI motherboard resources
PCI: Not using MMCONFIG. <-- due to the 'goto reject' after
                             if (is_acpi_reserved) { ... }
PM: Adding info for acpi:acpi_system:00
PM: Adding info for acpi:button_power:00
...

Same thing happens if I disable the chipset specific code and just use
the ACPI stuff you added.

If I leave it enabled, several config cycles work fine, but the box
eventually hangs after probing 24 devices or so.  I don't see anything
else mapped into this space, and the MTRRs seem ok, so either there's
something hidden in this memory range or there's another chipset register
that needs poking to fully enable this space properly.

Sysrq doesn't seem to work, and I don't see any events in my machine log,
so figuring out exactly why it's hanging is a bit difficult.

Any ideas on what to try next?  I'll see if I can get some more details
from our BIOS folks and do yet another pass over the documentation to see
if there's something I'm missing.

Thanks,
Jesse
-

From: Robert Hancock
Date: Monday, May 21, 2007 - 12:26 pm

Can you find out which config access (bus, device, function, address) is 
the one that hangs the box? I assume that either the corresponding 
address in the MCFG table is problematic (i.e. has something else mapped 
over it), or maybe that device just doesn't like being probed with MCFG 
somehow.

-- 
Robert Hancock      Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/

-

From: Jesse Barnes
Date: Monday, May 21, 2007 - 1:07 pm

Yeah, I've got that data... just a sec while I make sure it's 
reproducable...

Aha, I hadn't decoded the devfn before, looks like it's dying on an access 
to the graphics device (bus 0, slot 2, device 0):

...
pci_mmcfg_read: 0, 0, 0x10, 0x18, 4 = 0xc000000c
pci_mmcfg_read: 0, 0, 0x10, 0x18, 4 = <hang>
...

Offset 0x18 into the graphics config space should be the graphics memory 
range address, and 0xc000000c is the correct value.  But for some reason 
it hangs on the second access.

It hangs here everytime.

Jesse
-

From: Jesse Barnes
Date: Monday, May 21, 2007 - 1:22 pm

That register is in the config space BAR region, so it should be ok to 
write 0xffffffff to it and read it back to size the register.  However, 
it's after writing the 0xffffffff to it and trying to read it back that 
the machine hangs.  I didn't see any accesses to the command register to 
disable decoding (at least not via the mmconfig methods), so maybe that's 
broken during MCFG based probing?

Jesse
-

From: Robert Hancock
Date: Tuesday, May 22, 2007 - 5:31 pm

Eww. I don't see where we disable the decode at all while we probe the 
BARs on the device. That seems like a bad thing, especially with the way 
we probe 64-bit BARs (do the low 32 bits first and then the high 32 
bits). This means the base address effectively gets set to 0xfffffff0 
momentarily, which might cause some issues.

I'd try adding some code inside pci_setup_device (drivers/pci/probe.c) 
to disable PCI_COMMAND_IO and PCI_COMMAND_MEMORY on the device when 
probing devices with the standard header type and then restoring the 
previous command bits afterwards, and see what effect that has. It'll be 
  interesting if it does, since obviously it seems to work as it is with 
non-MMCONFIG access methods. Maybe the base address being set like that 
interferes with MMCONFIG access itself somehow?

-- 
Robert Hancock      Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/
-

From: Jesse Barnes
Date: Tuesday, May 22, 2007 - 5:38 pm

I'm a bit shocked that things work as well as they do without the 

I tried that, and it seems to get past probing the graphics device at 
least, but it hangs a bit later.  It could be that the enable/disable I 
added wasn't correct though, I didn't check to see which one I should 
disable in the command word, which may be a problem (just disabled them 
both every probe).  I'll try again with more precise enable/disable 
semantics.

Jesse
-

From: Robert Hancock
Date: Tuesday, May 22, 2007 - 5:53 pm

It'd be interesting to see at what access it ran into trouble next, at 
least if it's consistent. Could be that some device doesn't like having 
the decode disabled..

-- 
Robert Hancock      Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/
-

From: Jesse Barnes
Date: Tuesday, May 22, 2007 - 5:56 pm

I think it actually gets through the probing but hangs elsewhere, but I'll 
have to test again to be sure.

Jesse
-

From: Robert Hancock
Date: Tuesday, May 22, 2007 - 6:06 pm

There was a big discussion about this back in 2002, in which Linus 
wasn't overly enthused about disabling the decode during probing due to 
risk of causing problems with some devices:

http://lkml.org/lkml/2002/12/19/145

In this particular case (64-bit BAR) we might be able to avoid the 
problem by changing the order in which we probe the two halves of the 
address, i.e. change the top half to 0xffffffff before messing with the 
bottom half and then change it back last. That way, we end up mapping it 
way to the top of 64-bit address space, which hopefully is less likely 
to conflict..

-- 
Robert Hancock      Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/
-

From: Jesse Barnes
Date: Wednesday, May 23, 2007 - 11:52 am

Fixed it (finally).  I don't think moving the 64 bit probing around 
would make a difference, since we'd restore its original value anyway 
before moving on to the 32 bit probe which is where I think the problem 
is.

I think what's happening is the probe is writing 0xffffffff to the video 
device, which is in the GMCH, and without memory decoding disabled, 
it'll start claiming PCI config access cycles causing the problems I 
saw.  So my code to disable I/O and memory decode was actually working 
but I had a bug in the re-enable path so all my devices were staying 
disabled. :)

So here's the patch I used (along with your ACPI patch and my 965 MCFG 
enable patch of course).  The probing code could probably use a bit 
more cleanup, but this patch limits itself to implementing PCI_COMMAND 
based I/O and memory space decode disabling during size probing.  We 
might want to do this unconditionally if we're using mmconfig based 
configuration access, since I imagine other machines might end up 
having similar address space layouts that would cause problems.

Linus, since you were the one concerned about breaking working setups, 
what do you think?  Should we use this approach, or specifically quirk 
out cases where mmconfig space might conflict with BAR probing?

Thanks,
Jesse

diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index e48fcf0..69dfe0c 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -170,6 +170,48 @@ static inline int is_64bit_memory(u32 mask)
 	return 0;
 }
 
+#define BAR_IS_MEMORY(bar) (((bar) & PCI_BASE_ADDRESS_SPACE) ==	\
+			    PCI_BASE_ADDRESS_SPACE_MEMORY)
+
+/**
+ * pci_bar_size - get raw PCI BAR size
+ * @dev: PCI device
+ * @reg: BAR to probe
+ *
+ * Use basic PCI probing:
+ *   - save original BAR value
+ *   - disable MEM or IO decode as appropriate in PCI_COMMAND reg
+ *   - write all 1s to the BAR
+ *   - read back value
+ *   - reenble MEM or IO decode as necessary
+ *   - write original value back
+ *
+ * ...
From: Linus Torvalds
Date: Wednesday, May 23, 2007 - 1:20 pm

Well, the thing is, I'm pretty sure there is at least one northbridge that 
stops memory accesses from the CPU when you turn off the MEM bit on it. 
Oops, you just killed the machine.

Looking at the 925X datasheet (which I happened to have around in my 
google search history because of the discussions of the sky2 DMA 
problems), it looks like at least that one just hardcodes the MEM bit to 
be 1, and thus writing to it is a total no-op.

But I really think that clearing the MEM bit for at least the host bridge 
is conceptually quite wrong, even if it might turn out that all chipsets 
end up just saying (like Intel) "screw it, the user is insane, we're not 
going to actually do what he asks us to do".

Do we really want to be that insane? Turn off memory accesses when probing 
the CPU host bridge?

So at a _minimum_ I would say that that thing needs to be more careful 

So see above. I think at a minimum, we should consider the host bridge 
special.

I also suspect that we'd be simply better off if we didn't use mmconfig at 
all unless we _have_ to. Why use mmconfig for the standard BAR accesses? 
Is there really any reason? I can understand using it for extended config 
space, since then the old-fashioned approach won't work. But for normal 
accesses? What's the point, really?

mmconfig seems to be fundamentally designed to be impossible to bootstrap 
off, so there's no way you can have a machine that _only_ supports 
mmconfig. So why do people seem to think it's so wonderful? Please fill me 
in on this fundamental mystery.

Quite frankly, if we just didn't use mmconfig, the whole issue would go 
away. Isn't _that_ the much better solution?

		Linus
-

From: Alan Cox
Date: Wednesday, May 23, 2007 - 1:38 pm

CS5520. But it doesn't have 64bit or PCI Express.


Alan
-

From: Linus Torvalds
Date: Wednesday, May 23, 2007 - 1:45 pm

That patch does it for _all_ PCI probing. So it would turn any machine 
using that northbridge into a brick.

		Linus
-

From: Jesse Barnes
Date: Wednesday, May 23, 2007 - 1:49 pm

I'm not sure either, but the PCI spec is pretty clear about how probing 
ought to be done, and it seems that other OSes do the disabling (though 
I'm not sure about how they handle broken host bridges like the one you 

Yeah, it's mainly needed for extended config space and PCIe (lots of 
regular PCIe features are in the extended space and are assumed to be 


Not for systems with PCIe...  and the platforms I've been having trouble 
with have PCIe slots, so I'd really like mmconfig to be used at least 
on machines with PCIe bridges.  For other machines, it probably doesn't 
matter much.  I don't know of any regular PCI devices offhand that 
really need extended config space.

Jesse
-

From: Linus Torvalds
Date: Wednesday, May 23, 2007 - 1:56 pm

Umm. Why? Think about it.

You ASKED it to stop forwarding memory.

So who is lamer: the chip that does what it is told, or the software that 
tells it to do it?

I'd vote for the software. Any programmer who expects the hardware to 
"just do what I mean, not what I say" is not a programmer, but a dreamer. 


Ehh. Even for PCIe, why not use the normal accesses for the first 256 
bytes? Problem solved.

			Linus
-

From: Jesse Barnes
Date: Wednesday, May 23, 2007 - 2:03 pm

Well, because that's not actually very useful functionality, and likely 

Yeah, that's another option.  Would just mean an additional conditional 
in the mmconfig code, I'll give it a try...

Apparently Vista will move away from using type 1 config space accesses 
though, so if we keep using it, we'll probably run into some lame board 
that assumes you're using mmconfig at some point in the near future.  
But then again, we're often on that less tested path (e.g. with ACPI), 
so maybe that doesn't matter much.

Jesse
-

From: Jeff Garzik
Date: Wednesday, May 23, 2007 - 2:09 pm

One of the reasons why hardware vendors want to move away from 
traditional accesses is to be able to use the larger config space in 
PCI-Express, rather than being locked into the 256-byte legacy PCI 
config space.

Several modern PCI-Express devices utilize the upper config space, but 
due to legacy reasons the registers are usually ones that do not require 
OS drivers to know about (like BIST stuff or diagnostic registers).

Expect that to change, as MS shakes out the bugs (or maybe we are doing 
their job for them?).

	Jeff


-

From: Alan Cox
Date: Wednesday, May 23, 2007 - 2:35 pm

Mostly for treacherous computing extensions where subsets of the config
space can only be accessed by signed machines blessed by your favourite

The longer it takes - the better.

Alan
-

From: Jeff Garzik
Date: Wednesday, May 23, 2007 - 2:35 pm

Um, no, Mr. Paranoia, it's a standard part of the spec.

	Jeff



-

From: Jesse Barnes
Date: Wednesday, May 23, 2007 - 2:37 pm

I hate "trusted" platform garbage as much as the next guy 
(where "trusted" means the actual user can't trust it, just the 
seller), but I think there are legitimate uses of extended space as 
well, PCIe AER uses it iirc, so don't dismiss it on those grounds. :)

Jesse
-

From: Jeff Garzik
Date: Wednesday, May 23, 2007 - 2:42 pm

Indeed.  It's just a register space.  Assuming one register space is 
"more evil" than another, simply because it is bigger, is.. well.. silly.

	Jeff



-

From: Stephen Hemminger
Date: Wednesday, May 23, 2007 - 4:07 pm

On Wed, 23 May 2007 17:09:37 -0400

On some PCI-Express boards, if you don't clear the advanced error reporting
registers on boot up, they will cause IRQ storm. The AER registers
are above 256 boundary. In fact, the AER support in Linux should depend
on MMCONFIG.

-- 
Stephen Hemminger <shemminger@linux-foundation.org>

-

From: Linus Torvalds
Date: Wednesday, May 23, 2007 - 2:54 pm

I agree that a chip that doesn't do it isn't broken either, but the fact 
is, there is never any reason to disable MEM/IO on a host bridge. Doing so 
is senseless - it can never be a valid operation. So I duspute the 
"obviously correct" part. It's _not_ obviously correct at all.

To get back to the MMIO example: even if you were to never shut off RAM, 
if you turn off just PCI MMIO on the northbridge, what is a mmconfig cycle 
supposed to do? It's not going to _work_ if you disable MEM accesses.

So again, the only sane situation is: don't do it then! You claim that 
hardware shouldn't do it, but I don't think software is in any different 
situation at all! If it's insane to do, then software shouldn't do it.

It's just insane to turn off the MEM bit. There's simply no valid reason 
to. And any PCI spec that says you should is *broken*, or written by 
somebody who really only meant to talk about normal PCI devices, not 

How are those boards going to set up mmconfig? The whole standard is 
broken, since there is no way to set it up.

Trust the firmware? What a piece of crap!

		Linus
-

From: Jesse Barnes
Date: Wednesday, May 23, 2007 - 3:06 pm

Well theoretically for just sizing BARs, turning off the MEM bit should 
be fine, since your next accesses should only be to config space until 
the MEM bit is reenabled.  But if RAM accesses really are disabled, 
then you'd better be sure all the code you need is already in cache, or 
you'll get into trouble.

So yeah, I guess special handling for host bridges is needed, but that 

What do you mean?  You set it up the normal way, by poking at config 
space to see what's there, then size the BARs (disabling mem and I/O 
accesses in PCI_COMMAND shouldn't affect config space cycles afaik).  
You just have to be careful to disable decoding for I/O and memory 
regions, especially if your mmconfig space overlaps with what the 
devices end up with in their BARs.  Which is why my initial patch works 
ok (because fortunately the Intel host bridges hard code the mem decode 
bit to 1 too).

Jesse

-

From: Linus Torvalds
Date: Wednesday, May 23, 2007 - 3:16 pm

HOW DO YOU GET TO THE CONFIG SPACE IN THE FIRST PLACE?

The reason mmconfig is *BROKEN*CRAP* is that you cannot bootstrap it. 
There's no standard way to even figure out WHERE IT IS!

So we depend on firmware tables that are known to be broken!

That crap should be seen for the crap it is! Dammit, how hard can it be to 
just admit that mmconfig isn't that great?

		Linus
-

From: Jesse Barnes
Date: Wednesday, May 23, 2007 - 3:28 pm

Ah, yeah, that's platform specific, I thought you were confused about 
how the sizing worked.  On x86, we either have to look at ACPI tables 
(yay) or use type 1 config accesses to get at the mmconfig base 
register (which is what the patches Olivier and I posted do).  On ia64 
there are firmware calls to do config space accesses.  Not sure about 
other platforms.

I'm not claiming mmconfig is great and we should make everything use it, 
but we do need it these days, so we should figure out a good way of 
getting at it.

Jesse
-

From: David Miller
Date: Wednesday, May 23, 2007 - 4:04 pm

From: Linus Torvalds <torvalds@linux-foundation.org>

I knew mmconfig was broken conceptually the first time I started
seeing write posting "bug fixes" for it that would do a read back from
PCI config space via mmconfig to post the write, which of course has
potential side-effects on the device and is absolutely illegal if the
write just performed put the device into a PM state or whatever.

Truth is stranger than fiction at times.

MMCONFIG is very much an ill-conceived idea.

-

From: Jesse Barnes
Date: Wednesday, May 23, 2007 - 4:11 pm

I've actually seen that specific form of posted write flushing cause 
crashes on some machines, so yes, it sucks.

Unfortunately, I don't think we have any other way of getting at 
extended config space on x86, unless EFI provides methods or something, 
but I'm not sure that would be an improvement...

Jesse
-

From: Robert Hancock
Date: Wednesday, May 23, 2007 - 4:15 pm

That "fix" shouldn't be needed at all, the MMCONFIG memory range 
shouldn't be covered by PCI ordering rules, so there should be no such 
thing as write posting. I suspect that the author of such patch(es) was 
doing so out of some misguided sense that it was needed. (And if there 
is some chipset where it is actually needed, better just disable 
MMCONFIG on that one, as there's no way to use it sanely.)

-- 
Robert Hancock      Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/
-

From: Jesse Barnes
Date: Wednesday, May 23, 2007 - 4:21 pm

PCI allows write posting, and on systems using mmconfig the posting 
generally (unfortunately) extends to that space as well.  So drivers 
need to deal with it somehow.  But many get it wrong.

Jesse
-

From: Jesse Barnes
Date: Wednesday, May 23, 2007 - 2:20 pm

Ok, this patch also works.  We still need to enable mmconfig space for 
PCIe and extended config space, but we can continue to use type 1 
accesses for legacy PCI config space cycles to avoid decode trouble 
with mmconfig based BAR sizing.

Assuming Robert's and my patches to enable mmconfig space go in, we'd 
want a similar patch to the i386 mmconfig code.

Jesse

diff --git a/arch/x86_64/pci/mmconfig.c b/arch/x86_64/pci/mmconfig.c
index 65d8273..5052f80 100644
--- a/arch/x86_64/pci/mmconfig.c
+++ b/arch/x86_64/pci/mmconfig.c
@@ -61,7 +61,7 @@ static int pci_mmcfg_read(unsigned int seg, unsigned 
int bus,
 	}
 
 	addr = pci_dev_base(seg, bus, devfn);
-	if (!addr)
+	if (!addr || reg < 256) /* Use type 1 for non-extended access */
 		return pci_conf1_read(seg,bus,devfn,reg,len,value);
 
 	switch (len) {
@@ -89,7 +89,7 @@ static int pci_mmcfg_write(unsigned int seg, unsigned 
int bus,
 		return -EINVAL;
 
 	addr = pci_dev_base(seg, bus, devfn);
-	if (!addr)
+	if (!addr || reg < 256) /* Use type 1 for non-extended access */
 		return pci_conf1_write(seg,bus,devfn,reg,len,value);
 
 	switch (len) {
-

From: Olivier Galibert
Date: Wednesday, May 23, 2007 - 3:24 pm

Isn't that a mac-intel instant killer?  AFAIK they don't have type1,
period.

  OG.
-

From: Jesse Barnes
Date: Wednesday, May 23, 2007 - 3:31 pm

Yuck.  I'll have to add a check for type 1 then... but that also means 
Macs will probably want the decode disable stuff I posted earlier.

Jesse
-

From: Linus Torvalds
Date: Wednesday, May 23, 2007 - 3:48 pm

mac-intel are totally standard Intel chipsets. They have all of 
conf1/conf2/mmconfig afaik.

I just happily booted my mac-mini with "pci=nommconf", nothing bad 
happened, and the kernel says

	PCI: Using configuration type 1

and I don't think you even _can_ disable conf1 type accesses: they are 
deep in the Intel chipsets.

Of course, in a virtualized environment, anything can happen. Virtual 
machines prefer mmconf, because you can use page-level remapping to hide 
devices or make pseudo-devices show up by mapping in pages that have 
nothing to do with the true hardware.

So no, I don't think Alan was totally smoking crack when he talked about 
"trusted" computing. Read the above paragraph a few times. (You can do it 
with trapping IO port accesses too, but it's going to cost you a lot, so 
if you want to make a fast but untrustoworthy setup, MMIO is the better 
option).

		Linus
-

From: Jesse Barnes
Date: Wednesday, May 23, 2007 - 3:55 pm

After I sent my last message I realized the same thing... though I 
occasionally hear people talk about removing it (I seriously doubt that 
will ever happen).  I don't even think there's a way to disable type 1 
config access on Intel chipsets...

So the last patch is ok then, as long as we can find mmconfig space in 
the first place, but that's a separate problem for another set of 
patches (ones that seem to be working fairly well now btw).

Jesse

-

From: Linus Torvalds
Date: Wednesday, May 23, 2007 - 5:21 pm

Considering that the chipsets still have support for features that 
*really* aren't used (and haven't been used in over a decade), I doubt the 
conf1 thing is going away any time soon.

Things like: A20 gate, 15-16MB holes, i387 FP exception on irq 13 are 
totally pointless in this day and age. Things like the DMA controller are 
getting there, along with PS/2 keyboard support.

So there's a lot of things that are likely to be removed before conf1 
accesses would. Removing CONF1 accesses would break every single current 
OS, they'll do that ten years from now at the earliest.

			Linus
-

From: Jesse Barnes
Date: Wednesday, May 23, 2007 - 7:59 pm

So what do you think?  You ok with enabling mmconfig if it's available as long 
as we use type 1 accesses for non-extended stuff?  If so, I think the patches 
are pretty much ready...

Thanks,
Jesse
-

From: Linus Torvalds
Date: Wednesday, May 23, 2007 - 8:18 pm

Sure. I think mmconfig is perfectly sane if it falls back to conf1 
accesses for legacy stuff..

And I also actually think that your patch to disable MMIO/PIO when testing 
the BAR size is fine - I just think that it should likely only be done for 
non-bridge devices (or at least non-host-bridge).

		Linus
-

From: Linus Torvalds
Date: Wednesday, May 23, 2007 - 8:20 pm

.. but without a regression, it's obviously a post-2.6.22 thing, I guess I 
should make that clear, just because I think people send me patches after 
-rc1 way too eagerly just because they think it fixes a bug.

Basically if it's not somethign that has _ever_ worked some way, it's not 
a bug, it's a feature ;)

			Linus
-

From: Jesse Barnes
Date: Wednesday, May 23, 2007 - 8:40 pm

No, I know better than to send something after your merge window closes.  I 
have no desire to be flamed even further on this topic.  :)

And come to think of it, adding the enable/disable bits might be good even 
with the patch to make legacy accesses go through type 1, since PCIe BAR 
probing is probably done the same way (I haven't looked) and so we might run 
into the same problems there.

Thanks,
Jesse

-

From: Robert Hancock
Date: Wednesday, May 23, 2007 - 10:19 pm

I think that disabling decode on non-host-bridge devices during the BAR 
sizing is something we should at least try, indeed.

The issue I have with forcing legacy config space accesses to type1 is 
that it would make it much less obvious if the MMCONFIG access wasn't 
working properly. You'd likely be able to boot up but then wonder why 
something that does extended config space accesses didn't work or hung 
the box. As I mentioned before, either we trust the MMCONFIG or we 
don't, and if we decide that we don't on a particular box, we should 
really be shutting it off entirely. Hopefully with the ACPI reservation 
checking patch and the disable-decode-during-BAR-sizing patch
we wouldn't need to add that restriction.

But yes, post-2.6.22 for all of this :-)

-- 
Robert Hancock      Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/
-

From: Jeff Garzik
Date: Wednesday, May 23, 2007 - 11:18 pm

The latest Intel chipset I have (ICH9) is legacy free:  no serial port 
and no PS/2 ports.  I had to disable the Linux PS2 input drivers 
completely, just to get the thing to boot.

Whee, "progress".   :)

	Jeff, digging out that USB debug cable cuz there's no serial


-

From: Linus Torvalds
Date: Thursday, May 24, 2007 - 8:42 am

Ahh, that would be a bug. Can you help trying to debug where it locks up?

I'm also surprised, since on the mac mini I have, I already have:

	i8042.c: No controller found.

and it all works beautifully. Of course, it only did that after the 
horrible crud that is "grub" got fixed, because the bootloader used to 
wait forever, but I thought the kernel itself was able to handle a missing 
PS/2 controller fine.

		Linus
-

From: Robert Hancock
Date: Wednesday, May 23, 2007 - 4:04 pm

Which is retarded, since the command bits are only supposed to be for 
memory ranges that are part of the BARs, it's not supposed to completely 
kill the device function. Unless somehow the memory on that system is 
accessed through the PCI bus or something. Anyway, it's something we 

I think we should likely avoid disabling the command bits on host 
bridges (maybe any bridge) due to this risk of disabling something that 
will break things. Ideally we can get around this without doing any 

Why not? Either you trust that the MMCONFIG is working or you don't. If 
you trust it, you might as well use it for everything, and if you don't, 
you can't risk using it for anything. If there are problems that show up 
  only with MMCONFIG, doing what you propose would simply cover them up 

Sure you can bootstrap off it, you just need to have some way to know 

I don't think that is going to be viable in the long run now that 
Windows Vista is out and MS is actually encouraging HW developers to 
allow using that config space..

-- 
Robert Hancock      Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/

-

From: Robert Hancock
Date: Wednesday, May 23, 2007 - 4:04 pm

You couldn't just reorder the code the way it is now, you'd have to 
rearrange the way we do things for 64-bit BARs:

-write FFFFFFFF to high part of 64-bit address (we end up moving the BAR 
to 0xFFFFFFFFC0000000 for example)
-If any bits stick, we know what the size is now (more than 4GB of 
decode), so just change it back, we're done
-If not, we need to check the low part, so write FFFFFFFF to low part of 
64-bit address (BAR moves to 0xFFFFFFFFFFFFFFFF)
-Check which bits stick and calculate the address
-Change the low part of the address back (BAR moves to 0xFFFFFFFFC000000)
-Change the high part of the address back (BAR moves to the original 
0xC0000000 address)

This means that at no point do we map the BAR anywhere near the top of 
32-bit memory, so we should avoid this issue in this particular case. I 
don't think this strategy is too likely to break anything, surely less 
likely than disabling command bits. Jesse, you might want to try hacking 
up something like this and see what happens.

-- 
Robert Hancock      Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/
-

From: Jesse Barnes
Date: Wednesday, May 23, 2007 - 4:06 pm

Ah yeah, that would probably work in this particular case, but doesn't 
seem very general.  I think just using type 1 accesses for non-extended 
config space is a bit more solid.

Jesse
-

From: Jesse Barnes
Date: Wednesday, May 23, 2007 - 5:02 pm

Acked-by:  Jesse Barnes <jesse.barnes@intel.com>

As long as we get a fix for the mmconfig based probing issues in the 
other thread, I think this patch should go in.  Robert, maybe you could 
submit it along with this one (and an i386 equivalent)?  Type 1 config 
access is already a fallback for mmconfig for x86_64 at least, so it 
should be safe for non-extended access too, and it avoids problems with 
our lack of decode disable in the generic PCI probing code.

Signed-off-by:  Jesse Barnes <jesse.barnes@intel.com>

diff --git a/arch/x86_64/pci/mmconfig.c b/arch/x86_64/pci/mmconfig.c
index 65d8273..5052f80 100644
--- a/arch/x86_64/pci/mmconfig.c
+++ b/arch/x86_64/pci/mmconfig.c
@@ -61,7 +61,7 @@ static int pci_mmcfg_read(unsigned int seg, unsigned 
int bus,
 	}
 
 	addr = pci_dev_base(seg, bus, devfn);
-	if (!addr)
+	if (!addr || reg < 256) /* Use type 1 for non-extended access */
 		return pci_conf1_read(seg,bus,devfn,reg,len,value);
 
 	switch (len) {
@@ -89,7 +89,7 @@ static int pci_mmcfg_write(unsigned int seg, unsigned 
int bus,
 		return -EINVAL;
 
 	addr = pci_dev_base(seg, bus, devfn);
-	if (!addr)
+	if (!addr || reg < 256) /* Use type 1 for non-extended access */
 		return pci_conf1_write(seg,bus,devfn,reg,len,value);
 
 	switch (len) {
-

Previous thread: Re: [PATCH] support PCI MCFG space on Intel i915 bridges by Robert Hancock on Sunday, April 29, 2007 - 7:10 pm. (5 messages)

Next thread: [PATCH] x86_64: support poll() on /dev/mcelog by Tim Hockin on Sunday, April 29, 2007 - 7:24 pm. (3 messages)