On Fri, May 07, 2010 at 12:56:59AM +0200, Sergio Callegari wrote:
It gets the cleaned version. The idea is that the sha1 in the repository
is the "official" version, and anything else is simply a representation
suitable for use on your platform.
So in that sense, clean/smudge filters are very visible. Splitting into
multiple blobs would mean that as far as git was concerned, your data
_is_ multiple blobs. And it would diff and merge them as separate
entities. That makes sense for something where that breakdown happens
along user-visible lines, and is useful to the user. For example,
automatically breaking down a tarfile into its constituent files might
be a more desirable representation for git to diff and merge (though the
current implementation of clean/smudge filters does not allow breaking
the file into multiple blobs).
But as I argued later in my email, I think that is not the right model
for performance-oriented multiblobs. Splitting a file at certain length
boundaries simply because it is large is going to be awkward when you
want to look at it as a whole item.
Yes and no. If you are storing some set of N bytes, then you need to
store N bytes whether they are in a single blob or multiple blobs. The
only way that multiple blobs can improve on that is if you can find
better delta candidates by doing so. Which means that you are just as
well off by splitting the large blob when looking for delta candidates
as you are in splitting it in storage.
Sure. And for those cases, I think clean/smudge filters are perhaps
already doing the job.
As an aside, I don't think that _git_ cares about pristine tars. It is
that people want to store compressed tarfiles in git that have a
particular checksum because they are interacting with some _other_
system that cares about the tarfile. In your case, where you don't care
about the particular byte pattern of the odf file, it is much simpler.
So clean/smudge filters are even easier there.
Right. The question is how the structured contents are handled
internally by the SCM. Git's choice is to leave contents as opaque as
possible, and let you handle conversion at the boundaries: textconv (or
a custom external diff) for viewing diffs, and clean/smudge for working
tree files.
Yes, and that is how it works with clean/smudge filters.
I do think it makes sense, but only for some applications. But for those
applications, rather than a multiblob, I think creating a tree structure
is a natural fit, and works well with existing git tools. But again,
that isn't really implemented. Blobs must stay as blobs. So the closest
you can come is saying:
- an ODF file may be a collection of structured text, but we will
store it marshalled as a single binary data stream
- we don't want it compressed for performance reasons, so we won't use
the native marshalling format. Instead, we'll clean/smudge it as an
uncompressed collection format inside of git (e.g., a zip without
compression, or a tarball).
- even though git doesn't understand the structure, we _do_ want to
see the structure when doing diffs or merges. For that, we define
custom diff/merge drivers which can operate on the file. They can
unpack the structure as necessary.
which is really not too bad, and it means git can remain blissfully
unaware of the details of any format.
Yes, I would just do it manually. But in theory a clean/smudge filter
could be the right sort of place for that, if somebody made an
implementation that handle exploding a single file into an arbitrary
tree/blob hierarchy. I think it was discussed when filters were
introduced, but the complexity (both in terms of implementation, and
in meeting user expectations) prevented anyone from moving it forward.
-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html