GIT(7) -- 03/05/2007
NAME
git - the stupid content tracker
Well, I use git for tracking contents. That means, for example,
installation trees for some application. Let's take a typical TeXlive
tree as an example. Those trees contain, among other things,
directories where new fonts/formats/whatever get placed as things run.
Quite a few of them start out empty, but their permissions have to
correspond to their purpose (for example, some are world-writable).
I see little chance to get this achieved without doing something like
find -type d -empty -execdir touch {}/.git-this-is-empty +
before every checkin and
find -name .git-this-is-empty -exec rm -- {} +
after every checkout. Which is pretty stupid.
As some anecdotal stuff, I did something like
mkdir test
cd test
git-init
touch README
git-add README # another peeve: why is no empty reference point possible?
git-commit -a -m "Initial branch"
git checkout -b newbranch master
unzip ../somearchive -d subdir
git add subdir
git commit -a -m "Add subdir"
git checkout -b newbranch2 master
and expect to have a clean slate. No such luck: without warning, all
empty directories in the zip file are still remaining within subdir,
which as a consequence has not been cleaned up.
So even if one is of the opinion that empty directories are not worth
putting into the repository: if I check in an entire subdirectory
hierarchy and then switch to a branch where this subdirectory is not
existent, I expect the subdirectory to be _gone_, and not have some
littering of empty directories lying around.
And that git-diff can see nothing wrong with that does not really
improve things.
So if git is supposed to be a content tracker, I can't see a way
around it actually being able to track content, and empty directories
_are_ content. It can't let them flying around with arbitrary
permissions on them when I switch branches or tags. And the
workaround using "touch" mentioned above is really awful to do
manually all ...Hi, If you had the idea already, I wonder why you did not find it. It's not really anything like hard to find: http://git.or.cz/gitwiki/GitFaq#head-1fbd4a018d45259c197b169e87dafce2a3c6b5f9 Ciao, Dscho -
The FAQ answer is weazeling on several accounts: a) No, git only cares about files, or rather git tracks content and empty directories have no content. In the same manner as empty regular files have no contents, and git tracks those. Existence and permissions are important. b) The problem is not just that empty directories don't get added into the repository. They also don't get removed again when switching to a different checkout. When git-diff returns zero, I expect a subsequent checkout to not leave complete empty hierarchies around because git can't delete any empty leaves which it chose not to track. -- David Kastrup, Kriemhildstr. 15, 44793 Bochum -
Hi, We do not track permissions of directories at all. This is because Git is primarily meant to track source code, and most "permissions" (i.e. I _like_ the behaviour that Git does not remove a directory it added, when I put some untracked file into it. And switching back to that branch, Git has no problems, because it sees that the directory is already there. In case of a file, it would complain, and rightfully so. See the fundamental difference between a file and a directory now? I think it boils down to "an empty directory has _no_ contents, but an empty file has an _empty_ content". Ciao, Dscho -
Yes, but directories really are different. First off, git wouldn't track the permissions anyway (git tracks execute bits, but for directories that _has_ to be set or git couldn't use them itself, so that's not going to happen). Second, and much more important, the directories will exist or not Bzzt. Wrong. We *do* remove directories when all files under them go away. HOWEVER (and this is where one of the reasons for not tracking them comes in): ** YOU CANNOT REMOVE A DIRECTORY IF IT HAS SOME UNTRACKED CONTENTS ** Think about that for five seconds, then think about it some more. Ponder it. So the fact is, git *already* does ass good of a job as it could possibly do wrt directories that go away: it tries to remove them if all the files that are tracked in it have gone away. But that leaves a very common case, namely switching to another branch without those files, and the directory still having stale object files etc build crud in it. A SCM *must*not* just remove that directory. It would be horrible. The fact that it has untracked files in it does not make those untracked files "unimportant". Maybe you feel that way about object files, but what about tracking some important parts of your home directory - does the fact that you don't necessarily track *all* of it mean that the rest is totally unimportant adn that git should just remove it? HELL NO! So directories really _are_ problematic. You cannot (and should not) track them the same way as you track a file. And the difference is very fundamental indeed: when you track a regular file, you track *all* of its content. But when you track a directory, you don't track it's content *at*all*. Think about that, and then think about the fact that git is defined as a "content tracker", and it's not "weasely" at all to say that you don't track directories. So your argument is totally bogus. When you track an empty file, you very much track the *content* of that file, and "empty" ...
Btw, don't get me wrong: I think that in order to be better at tracking other SCM's idiotic choices, we could (and I foresee that we eventually have to) try to track empty directories as a special case too. So I'm not _against_ the notion of tracking empty directories, and I would welcome patches that do so. As I mentioned in some earlier thread when this came up a few weeks ago, I actually suspect that the "subproject" support probably ended up making it easier, because in many ways an "empty directory" is very close to a "anonymous subproject" from a low-level plumbing standpoint (even if it is *not* so from a high-level standpoint). So I suspect that adding support for empty directories ends up being about just slightly extending the places that now have subproject support to know about a new situation. But I do want to point out that "tracking a directory" is not at all the same thing as "tracking a file", no matter how much you try to argue otherwise. The semantics are totally different, and it all boils down to the fact that when you track a file, you are always talking about the *full* content of the file, while tracking a directory is always about tracking just a *subset* of the contents of the directory. Of course, with directories, there's the trivial case where the subset happens to be everything, but that is neither the common nor the interesting case. All the interesting and complex cases happen exactly when the directory has untracked files in it, and at that point - you really aren't tracking "contents" any more - you can no longer recreate the directory from the data you have (so you cannot remove it on branch switches etc) - ergo: you're not a content tracker any more, you're a "container" tracker. And really, the "nontracked files in a directory" is the *default* thing, not some really unusual thing that we could disallow. But I'm not against adding support for "container tracking". I just want people to ...
Since I did not try to argue this, could you beat another strawman? I have seen this prepackaged rant already, but it does not really address the problem I have been experiencing. -- David Kastrup, Kriemhildstr. 15, 44793 Bochum -
How about a bit of honesty?
Here's the quote:
"The FAQ answer is weazeling on several accounts:
a) No, git only cares about files, or rather git tracks content and
empty directories have no content.
In the same manner as empty regular files have no contents, and git
tracks those. Existence and permissions are important."
You called it "weaselly" to say that git tracks only content, and then
very much tried to equate "existence and permissions" with content.
That's the part I answered.
So it wasn't a strawman, it was a direct answer to your assertion. Now go
away and either come back with the patch to implement it (that I have
encouraged you to do), or add a ".gitignore" file to the directory (that
others have told you will solve your problems).
Don't bother talking crap.
Linus
-
I believe David's point was different. If you checkout a branch, create an empty directory in this branch (probably a placeholder, either for future versionned files, or for generated files), you cannot tell git "this empty directory is in this branch, but not in other ones" without adding a file in it. So, doing "git-checkout anotherbranch", this empty directory doesn't go away. It's just unversionned in both branches, git won't touch it. -- Matthieu -
Right. Which is the suggested setup: add an empty ".gitignore" file to the directory, and you're done. It now acts "as if" git tracked the directory (git will remove the directory when switching branches), but without the lie that we really track any directory contents. Linus -
That implies that every directory in a versioned tree will exclusively be created under manual and conscious control. Not by running some installer or script, unpacking some archive and so on. But if every content on a disk was created and put there under manual control of the disk owner, we could still get along with floppy disks quite fine. In practice, much more content gets sent around and juggled than what is under immediate supervision of the user. This is getting silly: you don't need to pull out rabbits out of your head. You said that you are not inclined to do any work in that area since it does not touch _your_ use cases (well, at least not to a degree that you consider worth bothering about) but that is no reason to get into ridiculous arguments about other usage. No code will come of that. -- David Kastrup, Kriemhildstr. 15, 44793 Bochum -
How hard is it for you to admit that I also said "please send in a patch". I don't need it. You do. You do the work. I'm just explaining why the work hasn't been done. Linus -
Yup, that was one sentence in about 5 pages of bile. In contrast, Junio gave a good overview of the technical areas involved here, and estimates about what to do there best. That's a constructive way to encite somebody to delve into the task and try to see whether he can come up with something. But 5 pages of what amounts to "you are an idiot, come up with a No, you are _defending_ why the work has not been done. This rationalizing around the bush is a waste of time. You probably have spent quite more time with your venting than Junio did with his technical analysis, and the latter has been much more helpful. So why waste all that time and adrenaline on something where you have already said all you consider relevant? The arguments don't get any stronger by shouting, and it is not like you are inconvenienced in any manner if somebody takes a look at the matter. -- David Kastrup, Kriemhildstr. 15, 44793 Bochum -
Gaah. I'm a damn softie (and soft in the head too, for writing the code). Ok, here's a trivial patch to start the ball rolling. I'm really not interested in taking this patch any further personally, but I'm hoping that maybe it can make somebody else who is actually _interested_ in trackign empty directories (hint hint) decide that it's a good enough start that they can fill in the details. This really updates three different areas, which are nicely separated into three different files, so while it's one single patch, you can actually follow along the changes by just looking at the differences in each file, which directly translate to separate conceptual changes: - builtin-update-index.c This simply contains the changes to update the index file. As usual, there are multiple different cases, and they boil down to: (a) No index entry existed at all previously. If so, a directory will first go through the "index_path()" logic, which tries to create a GITLINK entry for it, if the subdirectory is a git directory. However, the new thing is that if that fails, it will instead just create a fake empty tree entry for it, and set the index mode to S_IFDIR. (b) It was a gitlink entry before. It stays as a gitlink entry, even if it cannot be indexed, and a file/symlink entry in the working tree is a conflict error. (c) It was a empty directory entry before. A directory stays as an empty directory entry, and a file/symlink entry in the working tree is a conflict error. Somebody should check that we properly delete the directory entry if we add a file under it, I honestly didn't bother to go through all the logic. I *think* we do it correctly just thanks to all the previous code for gitlinks. Whatever. What I'm trying to say is that the changes are fairly straightforward, but if somebody decides to push this, they need to think about it a lot more than I'm ready to right ...
Well, kudos. Together with the analysis from Junio, this seems like a good start. Would you have any recommendations about what stuff one should really read in order to get up to scratch about git internals? -- David Kastrup, Kriemhildstr. 15, 44793 Bochum -
Well, you do need to understand the index. That's where all the new subtlety happens. The data structures themselves are trivial, and we've supported empty trees (at the top level) from the beginning, so that part is not anything new. However, now having a new entry type in the index (S_IFDIR) means that anything that interacts with the index needs to think twice. But a lot of that is just testing what happens, and so the first thing to do is to have a test-suite. There's also the question about how to show an empty tree in a diff. We've never had that: the only time we had empty trees was when we compared a totally empty "root" tree against another tree, and then it was obvious. But what if the empty tree is a subdirectory of another tree - how do you express that in a diff? Do you care? Right now, since we always recurse into the tree (and then not find anything), empty trees will simply not show up _at_all_ in any diffs. And what about usability issues elsewhere? With my patch, doing something like a git add directory/ still won't do anything, because the behaviour of "git add" has always been to recurse into directories. So to add a new empty directory, you'd have to do git update-index --add directory and that's not exactly user-friendly. So do you add a "-n" flag to "git add" to tell it to not recurse? Or do you always recurse, but then if you notice that the end result is empty, you add it as a directory? All the logic for that whole directory lookup is in git/dir.c, and that code takes various flags because different programs want different things (show "ignored" files, or ignore them? Show empty directories or ignore them? etc). So primarily, I think the job is: - thinking about the index, and the interactions when adding a directory or adding files under a directory that already exists. I *think* we get all the corner cases right, because they should be exactly the same as with subprojects, but hey, ...
Another issue I thought about was what you would do in the step
3 in the following:
1. David says "mkdir D; git add D"; you add S_IFDIR entry in
the index at D;
2. David says "date >D/F; git add D/F"; presumably you drop D
from the index (to keep the index more backward compatible)
and add S_IFREG entry at D/F.
3. David says "git rm D/F".
Have we stopped keeping track of the "empty directory" at this
point?
-
Sadly yes. But I don't think that's what the folks who want to track empty directories want to have happen here. Which is why I'm thinking we just need to track the directory, as a node in the index, even if there are files in it, and even if we got that directory and its contained files there by just unpacking trees. -- Shawn. -
I take this back. I really don't want that behavior. If I do: mkdir -p foo/bar echo hello >foo/bar/world git add foo git -f rm foo/bar/world I never asked for foo/bar or foo to stay. In fact I want them to disappear from Git entirely, as foo/bar is now empty and has no content. But we also cannot do a special --mkdir option for update-index either, because how do we know that the user designated subtree is a directory we must always keep in the index? So I think the only way this works is to have a new mode that we use in tree (04755 ?) that tells us not only is this thing a subtree, but also that the user wants it to stay here, even if it is empty. Those trees are always in the index as a real tree entry, even if there are files contained in it. And as far as getting that directory entry created/removed from the index, well, I think a special flag to update-index would be in order, much like --chmod=[+-]x. Just my $0.0002 USD, which really ain't worth much at all. -- Shawn. -
Well, outside git, if you do $ mkdir -p foo/bar $ echo hello > foo/bar/world $ rm -f foo/bar/world You didn't ask foo/bar to stay either, and still, it's quite natural to have it stay in your filesystem. So, the same way you'd have ran "rm -r foo", it seems reasonable to me to ask for "git-rm -r foo" if the user wants to get rid of foo/ itself. -- Matthieu -
Dear Git fellows, A year or so ago I too would strongly advocate the need of tracking empty directories, permissions et al., it seemed so "natural" and "plain obvious" to me back then. But since that time I learned to appreciate the "contents tracking" approach, and now view directories (paths in general) only as the means for Git to know where to put the contents on checkout. This, BTW, is consistent with how Git figures container copies/renames. No doubt mighty Git developers can add support for empty directories, manage to stay backward compatible, think out consistent user interface etc. But there's no end to how much information one may want to store in Git to make it "_file system_ contents tracking software". Starting with empty directories, one may argue then that certain installation trees also need particular file ownership, so lets store user/group names like tar does. It was mentioned already in this thread that in addition to 'rwx' we also would have to store ACLs (some OSes have only one of these concepts, some both), SELinux security contexts, perhaps other arbitrary file attributes that may be part of file system state. Wouldn't it be better to preserve Git as a contents tracking system, and add some tools on top of it that can translate file system state into textual (or binary) form, so it can be stored in current Git? And then use this textual representation to restore actual file system attributes/layout on checkout? And the only change in Git itself would be some more hooks, for instance one hook before checking out over the old work tree, and one after the checkout. Or one can simply wrap certain Git commands to implement such hooks. In any case, no one is going to be against the new feature if it won't break anything for those of us who find the pure contents tracking the right thing. And storing empty directories by default may not be natural for everyone. So before going into technical details of how this can possibly be implemented, ...
Hi,
Thank you. It is my impression, too, that after a while it becomes
obvious what is good and what is not.
FWIW I just whipped up a proof-of-concept patch (so at least _I_ cannot be
accused of chickening out of writing code):
This adds the command line option "--add-empty-dirs" to "git add", which
does the only sane thing: putting a placeholder into that directory, and
adding that. Since ".gitignore" is already a reserved file name in git,
it is used as the name of this place holder.
---
It is probably not fool-proof yet, needs documentation and a test
case. But I am really sick and tired of this discussion.
builtin-add.c | 25 +++++++++++++++++++++----
dir.c | 16 +++++++++++++++-
dir.h | 3 ++-
3 files changed, 38 insertions(+), 6 deletions(-)
diff --git a/builtin-add.c b/builtin-add.c
index 7345479..1294840 100644
--- a/builtin-add.c
+++ b/builtin-add.c
@@ -47,7 +47,7 @@ static void prune_directory(struct dir_struct *dir, const char **pathspec, int p
}
static void fill_directory(struct dir_struct *dir, const char **pathspec,
- int ignored_too)
+ int ignored_too, int substitute_empty_dirs)
{
const char *path, *base;
int baselen;
@@ -63,6 +63,7 @@ static void fill_directory(struct dir_struct *dir, const char **pathspec,
if (!access(excludes_file, R_OK))
add_excludes_from_file(dir, excludes_file);
}
+ dir->substitute_empty_directories = substitute_empty_dirs;
/*
* Calculate common prefix for the pathspec, and
@@ -143,7 +144,8 @@ static const char ignore_warning[] =
int cmd_add(int argc, const char **argv, const char *prefix)
{
int i, newfd;
- int verbose = 0, show_only = 0, ignored_too = 0;
+ int verbose = 0, show_only = 0, ignored_too = 0,
+ substitute_empty_dirs = 0;
const char **pathspec;
struct dir_struct dir;
int add_interactive = 0;
@@ -191,6 +193,10 @@ int cmd_add(int argc, const char **argv, const char *prefix)
take_worktree_changes = 1;
continue;
...Oh, one word of warning: that whole "pretend_sha1_file()" thing won't create the object itself, and when I did the limited testing that I did, I actually made sure had a magic zero-sized tree object in my object directory. If you don't, some things will complain, because they end up getting a SHA1 that they cannot look up, becasue *they* didn't create that pretend entry. I didn't know which way I wanted to go with that thing. I was kind of thinking that maybe we would just have the zero-sized OBJ_BLOB and OBJ_TREE objects as special magical things, and have all git programs just do that "pretend" at the beginning. But that kind of thing is probably just a totally unnecessary special case, and instead, that "pretend_sha1_file()" should have just been a write_sha1_file(NULL, 0, "tree", ce->sha1); instead. Anyway, if there are issues with not finding an object called 4b825dc642cb6eb9a060e54bf8d69288fbee4904, then that's the empty tree object, and that pretend thing was the cause. (The git repo itself has the empty tree as an object in it, because one of the commits has that - probably as a result of a bug, but there you have it) Linus -
But empty directories which were empty to start with don't go away since they are not tracked. And that means that their parents don't go away. Git will remove directories which _had_ git-tracked content prior to the checkout. But it will not register empty directories created Linus, condescension is all very nice, but I already told you: I had a directory hierarchy created outside of git's control (every file comes into being first outside of git). This hierarchy contained empty directories. The while hierarchy was committed into git. git silently skipped registering empty directories. Then a different version got checked out which did not contain the directory hierarchy in question. And git left the (unregistered) empty directories in, as well as all their parent directories. But I told git to track the whole directory tree recursively. There were no uncommitted files it complained about. It is not reasonable that it is afterwards unable to remove this when I checkout some other Sure. But that it refuses to track the files makes the total behavior an annoyance. I don't complain _how_ git handles not being able to track empty directories. I complain about it not being able to track When I tell it to track it, it should not refuse. Even if it is empty. Because if it _stayed_ empty, git can then remove it (and possibly the parents) when I checkout something else. -- David Kastrup, Kriemhildstr. 15, 44793 Bochum -
,----[ http://www.spinics.net/lists/git/msg30730.html ] | From: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> | | I wouldn't personally mind if somebody taught git to just track empty | directories too. | | There is no fundamental git database reason not to allow them: it's in | fact quite easy to create an empty tree object. The problems with | empty directories are in the *index*, and they shouldn't be | insurmountable. | | [...] `---- -- Matthieu -
No objections as long as a patch is cleanly made without regression. It's just nobody agreed that it is "quite serious" yet so far, and no fundamental reason against it. -
Thanks. It certainly is not serious for the Linux kernel source, but seems awkward for quite a few situations. Anyway, what is your take on the situation I described? That creating some directory hierarchy (happening to contain empty directories) with some external program, adding and committing it, then switching to a different branch (or maybe doing a git-reset --hard) leaves a skeleton of empty directories around? I find this almost worse than not being able to put them into the repository: you can't get rid of them anymore either! I'd be tempted to propose that git should remove empty subdirectories when cleaning up a removed tree in the working directory, even though that violates the principle to not delete anything it isn't tracking. But since you can't get it to track the stuff in the first place... But the real fix would be to track them. Does some trick work possibly at checkin time, like putting an empty file into every empty directory, adding to the index, then removing all empty files explicitly from the index and then checking in, or is this hopeless to work around with from the user side without affecting the repository itself? -- David Kastrup, Kriemhildstr. 15, 44793 Bochum -
Didn't I say I do not have an objection for somebody who wants to track empty directories, already? I probably would not do that myself but I do not see a reason to forbid it, either. The right approach to take probably would be to allow entries of mode 040000 in the index. Traditionally, we allowed only 100644 (blobs as regular files) and 120000 (blobs as symlinks). We recently added 160000 (commit from outer space, aka subproject). And we do that for all directories, not just empty ones. So if you have fileA, empty/, sub/fileB tracked, your index would probably have these four entries, immediately after read-tree of an existing tree object: 100644 15db6f1f27ef7a... 0 fileA 040000 4b825dc642cb6e... 0 empty 040000 e125e11d3b63e3... 0 sub 100644 52054201c2a872... 0 sub/fileB Making sure that empty/ directory exists in the working tree is probably done in entry.c; we have been touching that area in an unrelated thread in the past few days. If you add sub/fileC, with "update-index" (and "add"), you invalidate the SHA-1 object name you stored for "sub" (because there is no point recomputing the tree object until you know you need a subtree for "sub" part, which does not happen until the next "write-tree"), and end up with something like: 100644 15db6f1f27ef7a... 0 fileA 040000 4b825dc642cb6e... 0 empty 040000 00000000000000... 0 sub 100644 52054201c2a872... 0 sub/fileB 100644 705bf16c546f32... 0 sub/fileC These "missing" SHA-1 would need to be recomputed on-demand. We have had necessary infrastructure to do this "keeping untouched tree object names in the index" for quite some time, but it is not a part of the index proper (it is stored in an extension section in the index file, to keep the index compatible with older versions of git). Having made it sound so easy, here are the issues I would expect to be nontrivial (but probably not rocket surgery either). * unpack-trees, which is the workhorse for twoway merge (aka "switching branches") ...
Sorry for jumping in late... Why do you want to add _all_ directories, and not just the ones we want to explicitly track (independent of whether they're empty or not). Basically, add a "--dir" flag to git-add, git-rm and friends, to tell them you're acting on the directory itself (rather than its (recursive) contents). "git-add --dir foo" will add the "040000 123abc... 0 foo" to the index/tree whether or not foo is an empty directory. "git-rm --dir foo" will remove that entry (or fail if it doesn't exist), but _not_ the contents of foo. Since we're making directory tracking _explicit_, this should all be trivially backward-compatible. ...Johan -- Johan Herland, <johan@herland.net> www.herland.net -
( I don't know which mail is the best to reply to and I probably missed something in the thread, so bear with me if I'm repeating anything. ) David. Reconsider "tracking" all directories and what that would give, compared to explicitly tracking specific ones and the requires magic entries. Say we have a config setting that tells git never to remove empty trees. Linus patches could be a start for representing trees in the index. As an optimization the index could prune trees from the index if they contain things as long as the index *effectively* remembers all trees. Using the patches again we could add empty directories to the index and remove them. No directory would be removed automatically, except maybe by a merge. We would probably have only a few empty directories and new unexpected ones would only pop up when we remove all blobs from one. Git status could tell us about them so we will not forget them. It could even tell us about "new" empty directories, which is probably the most important thing you'd want to know. Forgetting to untrack an empty directory would not be a big deal. Whether to retain empty trees or not should be a repository policy, but an all or nothing setting. -- robin -
It would be quite a nuisance for a patch-based workflow, since patches don't talk about the creation and deletion of directories. The "track only when entered approach" has the advantage that directories that were only created to accommodate patches will be removed again when becoming empty. But it doesn't. If you do git-add tree, optimizing the dir entry away since tree/zap exists, then subsequently do git-rm tree/zap, of course there is nothing to do except remove tree/zap, and the tree is gone. One can't start tracking trees explicitly only when they become empty, I currently have the problem that rm -rf * unzip some-archive git-add some-archive git-commit -a -m whatever git-checkout something else I don't want a source management system to tell me whenever it is With that approach idea the workflow "Apply a patch creating something/hello" "Undo the patch creating something/hello" will leave something lying around. For somebody managing hundreds of directories, that would be a nuisance. I don't say that a "track all parents automatically" approach would not have its merits: it would likely prevent some mistakes and be easily understandable to most users. But for managing a patch workflow, it would appear to get in the way. -- David Kastrup, Kriemhildstr. 15, 44793 Bochum -
