Peter Chubb posted a patch to the lkml, with which he's now managed to mount a 15 terabyte file (using JFS and the loopback device). Without the patch, Peter explains, "Linux is limited to 2TB filesystems even on 64-bit systems, because there are various places where the block offset on disc are assigned to unsigned or int 32-bit variables."
Peter works on the Gelato project in Australia. His efforts include cleaning up Linux's large filesystem support, removing 32-bit filesystem limitations. When I asked him about the new 64-bit filesystem limits, he offered a comprehensive answer and this interesting link. The full thread follows.
Reaching beyond terabytes, beyond petabytes, on into exabytes. I feel this sudden discontent with my meager 60 gigabyte hard drive...
From: Peter Chubb
Subject: [PATCH] remove 2TB block device limit
Date: Fri, 10 May 2002 13:36:07 +1000
Hi,
At present, linux is limited to 2TB filesystems even on 64-bit
systems, because there are various places where the block offset on
disc are assigned to unsigned or int 32-bit variables.
There's a type, sector_t, that's meant to hold offsets in sectors and
blocks. It's not used consistently (yet).
The patch at
http://www.gelato.unsw.edu.au/patches/2.5.14-largefile-patch
(also available from bk://gelato.unsw.edu.au:2023/ for those using
bitkeeper)
has the following changes to address the problem: bmap() changes from int bmap(struct address_space *, long)
to sector_t bmap(struct address_space *,
sector_t)
The partitioning code takes sector_t everywhere that makes
sense (to allow efi, for example, to create partitions on enormous
discs).
The block_sizes[] array is sector_t not int.
get_nr_sectors() and get_start_sect() etc., now return a
sector_t
__bread() takes a sector_t as its second argument, and struct
buffer_head contains a sector_t blocknumber field.
struct scsi_disk and struct gendisk have a sector_t field for
capacity.
The scsi disc code now uses 16-byte commands if they're
needed.
ioctl(..GETBLKSZ..) now fails with EFBIG if the size won't fit
in a long. (at least for devices using the generic version).
Plus a smattering of casts to avoid compilation warnings (mostly so
that printk() works whether sector_t is 64 or 32 bits) and a new
CONFIG_LFS option to turn on 64-bit sector_t on 32-bit platforms.
On an old pentium I now have a 15Tb file mounted as JFS on the loop
device -- and it seems to work for almost everything. There are a few
user-mode programs that'll have to be fixed (notably parted, mkfs.???
etc) to cope with the new GETBLKSIZE failure (they should use
alternate mechanisms, e.g., GETBLKSIZE64, or just seek to the end of
the partition and look at the offset).
As this touches lots of places -- the generic block layer (Andrew?)
the IDE code (Martin?) and RAID (Neil?) and minor changes to the scsi
I've CCd a few people directly.
--
Peter Chubb
From: Neil Brown
Subject: Re: [PATCH] remove 2TB block device limit
Date: Fri, 10 May 2002 13:53:45 +1000 (EST)
On Friday May 10, peter@chubb.wattle.id.au wrote:
> As this touches lots of places -- the generic block layer (Andrew?)
> the IDE code (Martin?) and RAID (Neil?) and minor changes to the scsi
> I've CCd a few people directly.
Thanks.
MD part looks sane to me. However I would rather the
+#ifdef CONFIG_LFS
+#include
+#else
+#undef do_div
+#define do_div(n, b)({ int _res; _res = (n) % (b); (n) /= (b); _res;})
+#endif
+
part went in linux/raid/md_k.h and defined "sector_div" (or similar)
as either do_div or ({ int _res; _res = (n) % (b); (n) /= (b); _res;})
as appropriate.
NeilBrown
From: Andrew Morton
Subject: Re: [PATCH] remove 2TB block device limit
Date: Thu, 09 May 2002 21:05:37 -0700
Peter Chubb wrote:
> As this touches lots of places -- the generic block layer (Andrew?)
> the IDE code (Martin?) and RAID (Neil?) and minor changes to the scsi
> I've CCd a few people directly.
That would be more Jens and aviro than I.
My vote would be: just merge the sucker while it still (almost)
applies. 2TB is a showstopper for some people in 2.4 today. Obviously
2.6 will need 64-bit block numbers.
The next obstacle will be page cache indices into the blockdev mapping.
That's either an 8TB or 16TB limit, depending on signedness correctness.
One minor point - it is currently not possible to print sector_t's.
This code:
printk("%lu%s", some_sector, some_string);
will work fine with 32-bit sector_t. But with 64-bit sector_t it
will generate a warning at compile-time and an oops at runtime.
The same problem applies to dma_addr_t. Jeff, davem and I kicked
that around a while back and ended up deciding that although there
are a number of high-tech solutions, the dumb one was best:
--- 2.5.14/include/linux/types.h~sector_t-printing Thu May 9 17:08:13 2002
+++ 2.5.14-akpm/include/linux/types.h Thu May 9 17:08:13 2002
@@ -120,8 +120,10 @@ typedef __s64 int64_t;
#ifdef BLK_64BIT_SECTOR
typedef u64 sector_t;
+#define FMT_SECTOR_T "%Lu"
#else
typedef unsigned long sector_t;
+#define FMT_SECTOR_T "%lu"
#endif
#endif /* __KERNEL_STRICT_NAMES */
--- 2.5.14/fs/buffer.c~sector_t-printing Thu May 9 17:08:13 2002
+++ 2.5.14-akpm/fs/buffer.c Thu May 9 17:09:35 2002
@@ -179,7 +179,8 @@ __clear_page_buffers(struct page *page)
static void buffer_io_error(struct buffer_head *bh)
{
- printk(KERN_ERR "Buffer I/O error on device %s, logical block %ldn",
+ printk(KERN_ERR "Buffer I/O error on device %s,"
+ " logical block " FMT_SECTOR_T "n",
bdevname(bh->b_bdev), bh->b_blocknr);
}
From: Anton Altaparmakov
Subject: Re: [PATCH] remove 2TB block device limit
Date: Fri, 10 May 2002 09:43:06 +0100
At 05:05 10/05/02, Andrew Morton wrote:
>The same problem applies to dma_addr_t. Jeff, davem and I kicked
>that around a while back and ended up deciding that although there
>are a number of high-tech solutions, the dumb one was best:
Why not the even dumber one? Forget FMT_SECTOR_T and always use %Lu and
typecast (unsigned long long)sector_t_variable in the printk.
May be ugly, but isn't it correct that you actually need the above typecast
on some architectures where %Lu == unsigned long long != u64?
Anton
From: Andrew Morton
Subject: Re: [PATCH] remove 2TB block device limit
Date: Fri, 10 May 2002 02:04:46 -0700
Anton Altaparmakov wrote:
> Why not the even dumber one? Forget FMT_SECTOR_T and always use %Lu and
> typecast (unsigned long long)sector_t_variable in the printk.
Agree. The nice thing about the typecast is that you
can format the output with %06Lx, %9Ld, %Lo or whatever.
The FMT_SECTOR_T thing forces you to use the chosen formatting.
From: Jens Axboe
Subject: Re: [PATCH] remove 2TB block device limit
Date: Fri, 10 May 2002 11:05:14 +0200
On Fri, May 10 2002, Anton Altaparmakov wrote:
> Why not the even dumber one? Forget FMT_SECTOR_T and always use %Lu and
> typecast (unsigned long long)sector_t_variable in the printk.
I like that better too, it's what I did in the block layer too.
--
Jens Axboe
From: Peter Chubb
Subject: Re: [PATCH] remove 2TB block device limit
Date: Fri, 10 May 2002 19:53:45 +1000
Jens> I like that better too, it's what I did in the block layer too.
That's exactly what I did in the patch....
Except most places I used u64 not unsigned long long (it's the same
thing on all architectures, and much shorter to type).
Peter C
From: Jens Axboe
Subject: Re: [PATCH] remove 2TB block device limit
Date: Fri, 10 May 2002 12:01:05 +0200
On Fri, May 10 2002, Peter Chubb wrote:
> That's exactly what I did in the patch....
Excellent
> Except most places I used u64 not unsigned long long (it's the same
> thing on all architectures, and much shorter to type).
Patch looks fine to me. I was hoping someone would do the grunt
conversion work when I introduced sector_t, thanks! :-)
--
Jens Axboe
From: Anton Altaparmakov
Subject: Re: [PATCH] remove 2TB block device limit
Date: Fri, 10 May 2002 12:43:43 +0100
At 10:53 10/05/02, Peter Chubb wrote:
>Except most places I used u64 not unsigned long long (it's the same
>thing on all architectures, and much shorter to type).
I have been told that this is wrong (it was on this list but I can't
remember who said it - it was one of the prominent kernel hackers... (-;).
u64 is not necesssarily unsigned long long type and this causes compilation
problems on some architectures (apparently).
Anton
From: Martin Dalecki
Subject: Re: [PATCH] remove 2TB block device limit
Date: Fri, 10 May 2002 06:51:34 +0200
Peter Chubb wrote:
> Hi,
> At present, linux is limited to 2TB filesystems even on 64-bit
> systems, because there are various places where the block offset on
> disc are assigned to unsigned or int 32-bit variables.
>
> There's a type, sector_t, that's meant to hold offsets in sectors and
> blocks. It's not used consistently (yet).
>
The IDE part of it appears to be sane. I will take it.
From: Peter Chubb
Date: Fri, 10 May 2002 14:29:24 +1000
Subject: Re: [PATCH] remove 2TB block device limit
Andrew> That would be more Jens and aviro than I.
OK, I'll forward it to them... (And I've added them to the CC list here)
Andrew> My vote would be: just merge the sucker while it still
Andrew> (almost) applies. 2TB is a showstopper for some people in 2.4
Andrew> today. Obviously 2.6 will need 64-bit block numbers.
Yes. But not always, I think --- the overhead on low-end boxes I
think may be prohibitive. It should be a configuration option.
Andrew> The next obstacle will be page cache indices into the blockdev
Andrew> mapping. That's either an 8TB or 16TB limit, depending on
Andrew> signedness correctness.
It's OK at 16TB now. That was the point of the bmap() change.
And if you go to larger pages, or a larger index in the page cache,
you can get even bigger (but I don't think it's necessary just now).
Andrew> One minor point - it is currently not possible to print
Andrew> sector_t's. This code:
Andrew> printk("%lu%s", some_sector, some_string);
Andrew> will work fine with 32-bit sector_t. But with 64-bit sector_t
Currently, sector_t is cast to u64 everywhere it's printed out, and
the format %llu used. I looked at the possibility of usng a PRIsector
macro a la inttypes.h but thought the result incredibly ugly. Mind
you, casts aren't particularly clean
Peter C
From: Peter Chubb
Subject: Re: [PATCH] remove 2TB block device limit
Date: Sat, 11 May 2002 05:12:12 +1000
> Jeremy Andrews writes:
Jeremy> Peter, Out of curiousity, what then does the new filesystem
Jeremy> limit become, on a 64-bit system? Will all filesystems
Jeremy> support your changes?
This depends on the file system.
See
http://www.gelato.unsw.edu.au/~peterc/lfs.html
(which I'm intending to update next week, after some testing to
check the new limits with my new code -- I found the 1TB limit in
the generic code (someone using a signed int instead of unsigned long))
There are three different limits that apply:
--- The physical layout on disc (e.g., ext2 uses 32-bit for block
numbers within a file system; thus the max size is
(2^32-1)*block_size; although it's theoretically possible to use
larger blocksizes, the current toolchain has a maximum of 4k,
thus the largest size of an ext[23] filesystem is ((2^32)-1)*4k
bytes --- around 16TB)
It's extremely unlikely that you'd want to use a non-journalled
file system on such a large partition, so your best bets are
reiserfs, jfs or XFS. jfs and xfs work well on enormous
partitions on other platforms; the current version of reiserfs is
somewhat limited, but version 4 will allow larger file systems.
--- Limitations imposed by the partitioning scheme.
As far as I know, only the EFI GUID partitioning scheme uses
64-bit block offsets, so under any other scheme you're limited to
2^32 or 2^31 blocks per disc; some use the underlying hardware
sector size, some use a block size that's multiple of this.
--- The page cache limit (which on a 32-bit system is 16TB; on a 64
bit system is 18 EB
Jeremy> Mind if I quote you on my webpage?
Go ahead
--
Peter Chubb
[Email Filtered] http://www.gelato.unsw.edu.au
From: Andreas Dilger
Subject: Re: [PATCH] remove 2TB block device limit
Date: Fri, 10 May 2002 17:46:23 -0600
On May 11, 2002 05:12 +1000, Peter Chubb wrote:
> See http://www.gelato.unsw.edu.au/~peterc/lfs.html
> (which I'm intending to update next week, after some testing to
> check the new limits with my new code -- I found the 1TB limit in
> the generic code (someone using a signed int instead of unsigned long))
Any chance you could rename this from "LFS" to something else (e.g. LBD
for Large Block Device). LFS == Large File Summit which describes the
use/access of > 2GB _files_ on 32-bit systems under Unix.
> There are three different limits that apply:
>
> --- The physical layout on disc (e.g., ext2 uses 32-bit for block
> numbers within a file system; thus the max size is
> (2^32-1)*block_size; although it's theoretically possible to use
> larger blocksizes, the current toolchain has a maximum of 4k,
> thus the largest size of an ext[23] filesystem is ((2^32)-1)*4k
> bytes --- around 16TB)
For 64-bit systems like Alpha, it is relatively easy to use 8kB blocks for
ext3. It has been discouraged because such a filesystem is non-portable
to other (smaller page-sized) filesystems. Maybe this rationale should
be re-examined - I could probably whip up a configure option for
e2fsprogs to allow 8kB blocks in a few hours.
Does x86-64 and/or ia64 actually _use_ > 4kB page sizes? If so, it
may be more worthwhile to allow larger block sizes with e2fsprogs.
It may be that the kernel supports >4kB blocks already on systems with
larger PAGE_SIZE, I don't know (no way for me to test this).
> It's extremely unlikely that you'd want to use a non-journalled
> file system on such a large partition, so your best bets are
> reiserfs, jfs or XFS.
I find it somewhat ironic that you suggest reiserfs over ext3, when in
fact they both currently have the same 16TB filesystem limit. On your
web page, you say the ext[23] limit is 1TB, which it definitely is not
(unless there are bugs in the code). There is currently a 16TB filesystem
limit for 4kB blocks, but there are plans towards fixing that also.
> --- Limitations imposed by the partitioning scheme.
> As far as I know, only the EFI GUID partitioning scheme uses
> 64-bit block offsets, so under any other scheme you're limited to
> 2^32 or 2^31 blocks per disc; some use the underlying hardware
> sector size, some use a block size that's multiple of this.
LVM does not need to have partitions, and presumably EVMS using Linux
or AIX LVM devices doesn't either.
Cheers, Andreas
From: David Mosberger
Subject: Re: [PATCH] remove 2TB block device limit
Date: Fri, 10 May 2002 17:07:26 -0700
>>>>> On Fri, 10 May 2002 17:46:23 -0600, Andreas Dilger said:
Andreas> For 64-bit systems like Alpha, it is relatively easy to use
Andreas> 8kB blocks for ext3. It has been discouraged because such
Andreas> a filesystem is non-portable to other (smaller page-sized)
Andreas> filesystems. Maybe this rationale should be re-examined -
Andreas> I could probably whip up a configure option for e2fsprogs
Andreas> to allow 8kB blocks in a few hours.
If you do this, please consider allowing a block size up to 64KB.
The ia64 kernel offers a choice of 4, 8, 16, and 64KB page size.
Andreas> Does x86-64 and/or ia64 actually _use_ > 4kB page sizes?
ia64 linux normally uses > 4KB. The recommended page size at the
moment is 16KB. I didn't think 64KB would become realistic for quite
some time, but performance is surprisingly good, even on today's
systems.
--david
Remember the good old times...
... when the kernel hackers were trying to get around the 2 GB per file limit? Must have been... six months ago :)
Re: Remember the good old times...
Yeah, you're right. I guess it's symbolic that linux is transitioning from a hobby OS (though a very good one) to a serious OS that can handle serious things, like GIANT file systems. I mean, for me to hit a 16TB volume limit is just silly, but i can imangine that plenty of big companies, universities and other people would run into this quite frequently, and thus, probably don't use Linux for these sorts of massive file storage things.
I think it also irks the developers when there's an obvious weakness in Linux. They see the weakness as a challenge to improve the system so that it doesn't exist anymore. It's kinda funny that way. If you bash Linux for legitamate weaknesses, they just come back and fix it.
After all, if Linux is goiong to have World Domination(tm), then the only limits it should have are ones that are impossible to hit. :)
Small customers - big filestores
I have many small customers (physician practices) into which I install linux based document management systems. Even in relatively small settings (4-6 doc offices) I have run into problems due to the 2TB limit. I personally cannot wait until the 2TB limitation is laughable.
Uhhh....
Six months? Try two plus years. They fixed that in 2.3.x. 2.4 has been out for a year and 7 months. The 2.4-test series began in May of 2000. They merged largefile support before that.