On Wed, Apr 28, 2010 at 3:13 PM, Sergio Callegari
<sergio.callegari@gmail.com> wrote:
But why not use a .gitattributes filter to recompress the zip/odp file
with no compression, as I suggested? Then you can just dump the whole
thing into git directly. When you change the file, only the changes
need to be stored thanks to delta compression. Unless your
presentation is hundreds of megs in size, git should be able to handle
that just fine already.
But then you're digging around inside the pdf file by hand, which is a
lot of pdf-specific work that probably doesn't belong inside git.
Worse, because compression programs don't always produce the same
output, this operation would most likely actually *change* the hash of
your pdf file as you do it. (That's also true for openoffice files,
but at least those are just plain zip files, and zip files are
somewhat less of a special case.)
In what way? I doubt you'd get more efficient storage, at least.
Git's deltas are awfully hard to beat.
A resource fork by any other name is still a resource fork, and it's
still ugly. If you really need something like this, just cache the
attributes in a file alongside the big file, and store both files in
the git repo.
I guess. For something like that, though, Debian's pristine-tarball
tool seems to already solve the problem and works with any VCS, not
just git.
I guess this would be mostly harmless; the implementation could mirror
the filter stuff.
In that case, I'd like to see some comparisons of real numbers
(memory, disk usage, CPU usage) when storing your openoffice documents
(using the .gitattributes filter, of course). I can't really imagine
how splitting the files into more pieces would really improve disk
space usage, at least.
Having done some tests while writing bup, my experience has been that
chunking-without-deltas is great for these situations:
1) you have the same data shared across *multiple* files (eg. the same
images in lots of openoffice documents with different filenames);
2) you have the same data *repeated* in the same file at large
distances (so that gzip compression doesn't catch it; eg. VMware
images)
3) your file is too big to work with the delta compressor (eg. VMware images).
However, in my experience #1 is pretty rare and #2 and #3 aren't in
your use case. And deltas-between-chunks is not very easy to do,
since it's hard to guess which chunks might be "similar" to which
other chunks.
Personally, I think it would be great if git could natively handle
large numbers of large binary files efficiently, because there are a
few use cases I would have for it. But whenever I start investigating
my use cases, it always turns out that just "supporting large files"
is just the tip of the iceberg, and there's a huge submerged mass of
iceberg that becomes obvious as soon as you start crashing into it.
The bup use case (write-once, read-almost-never, incremental backups)
is a rare exception in which fixing *only* the file size problem has
produced useful results.
Have fun,
Avery
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html