Re: [RFC/PATCH] shortstatus v1

Previous thread: [PATCH] fix the installation path for html documentation by Michael J Gruber on Tuesday, February 10, 2009 - 8:14 am. (3 messages)

Next thread: Re: [PATCH] Make repack less likely to corrupt repository by Junio C Hamano on Tuesday, February 10, 2009 - 8:59 am. (17 messages)
From: Junio C Hamano
Date: Tuesday, February 10, 2009 - 8:58 am

Bug?   FWIW, the original patch from October shows:

    M changed
M   M changed-again
M     changed-staged
    D deleted
D     deleted-staged

(where changed-again has both staged changes and further changes in the
work tree).

The gap between these two are to show the rename similarity index, which

The output mimicked what was in Shawn's "repo" tool announcement IIRC.

My patch was supposed to give interested parties hint to base a patch like
Tuncer's on (I think this answers your last question, too).
--

From: Jeff King
Date: Tuesday, February 10, 2009 - 11:10 am

OK, that makes more sense. And your example shows a good answer to my
earlier question: why is just sorting the output of the the three
commands (diff, diff --cached, and ls-files -o) not as nice. The answer
is that we need to actually combine lines when files are appear in

I went back and read some of the background. I think having this work
with the wt-status machinery is reasonable, then. My concerns with
Tuncer's patch are still:

  - this should not be part of builtin-commit.c; it doesn't use any of
    the same code except that which has already been lib-ified in
    wt-status.[ch].

  - I don't think the "mini" status is really related to this. The novel
    thing here is collating the outputs into a single sorted list. But
    the "mini" output is not about that at all:

      1. It doesn't care about full output, so it should be able to exit
         early from the diff, avoid rename detection, etc, so that it is
         as quick as possible.

      2. It doesn't collate the output at all. It is about three
         separate symbols for the three separate lists.

--

From: Jeff King
Date: Tuesday, February 10, 2009 - 11:22 am

Oh, sorry, I was misreading the "mini" output. I thought the three flags
corresponded to the staged, unstaged, and untracked changes. But they
are "unstaged or staged but added", "unstaged or staged but changed", or
"untracked" (although right now the last is triggered by unmerged
entries?).

I honestly don't see much point in differentiating added versus changed
files. Splitting it into "some things are staged" and "some things are
not staged" makes more sense to me. But if you do want that distinction
then an early exit from the diff is more complicated (since you might
have to keep going to see if there are any of _either_ type).

-Peff
--

From: Jeff King
Date: Tuesday, February 10, 2009 - 12:11 pm

OK, I realize this is not exactly what the proposed --mini does. But
here is more along the lines of what I was thinking.

Warm cache, it runs in .042s on my git repo, about half of which is the
untracked files check. It takes about .49s on the kernel repo. The
read_directory() bit is not optimized at all, and could probably benefit
from an early return (OTOH, the worst case is still going to need to
look at every path).

I am not particularly interested in a fancy prompt myself, but maybe
this will help somebody else.

The patch relies on the index_differs_from() patch that Stephan
posted earlier today.

---
 .gitignore           |    1 +
 Makefile             |    1 +
 builtin-ministatus.c |   52 ++++++++++++++++++++++++++++++++++++++++++++++++++
 builtin.h            |    1 +
 git.c                |    1 +
 5 files changed, 56 insertions(+), 0 deletions(-)

diff --git a/.gitignore b/.gitignore
index 055eb54..de2249b 100644
--- a/.gitignore
+++ b/.gitignore
@@ -81,6 +81,7 @@ git-mergetool
 git-mktag
 git-mktree
 git-name-rev
+git-ministatus
 git-mv
 git-notes
 git-pack-redundant
diff --git a/Makefile b/Makefile
index a0ca137..9145c7b 100644
--- a/Makefile
+++ b/Makefile
@@ -559,6 +559,7 @@ BUILTIN_OBJS += builtin-merge-base.o
 BUILTIN_OBJS += builtin-merge-file.o
 BUILTIN_OBJS += builtin-merge-ours.o
 BUILTIN_OBJS += builtin-merge-recursive.o
+BUILTIN_OBJS += builtin-ministatus.o
 BUILTIN_OBJS += builtin-mv.o
 BUILTIN_OBJS += builtin-name-rev.o
 BUILTIN_OBJS += builtin-pack-objects.o
diff --git a/builtin-ministatus.c b/builtin-ministatus.c
new file mode 100644
index 0000000..c9f8e7f
--- /dev/null
+++ b/builtin-ministatus.c
@@ -0,0 +1,52 @@
+#include "cache.h"
+#include "diff.h"
+#include "commit.h"
+#include "revision.h"
+#include "dir.h"
+
+static int worktree_is_dirty(void)
+{
+	struct rev_info rev;
+	init_revisions(&rev, "");
+	setup_revisions(0, NULL, &rev, NULL);
+	DIFF_OPT_SET(&rev.diffopt, QUIET);
+	DIFF_OPT_SET(&rev.diffopt, ...
From: Tuncer Ayaz
Date: Tuesday, February 10, 2009 - 2:21 pm

I tried this and it did not run faster than my experiment.
I had to add a missing opening curly brace in
have_untracked() before it compiled.

--

From: Jeff King
Date: Tuesday, February 10, 2009 - 2:36 pm

I'm surprised, based on the numbers you gave before. I guess your
machine is just a lot slower than what I am experimenting with. I think
you will have to profile to see what is taking so long, then, if you

Sorry about that. I added the comment just above directly into the
patch, and obviously managed to butcher the brace while I was doing it.

-Peff
--

From: Junio C Hamano
Date: Tuesday, February 10, 2009 - 3:25 pm

I suspect that with a large tree your have_untracked() would show
unnecessary overhead from dir_add_name(), because you only want one bit of
information but there is no way to stop with "ok, we know enough".  This
toy patch adds a trivial "early return" to read_directory() codepath, but
there are two sad things about it.

 * In order to cheaply run "is there a single other file", you really
   should scan the level you have already opened first before digging
   deeper.  I didn't bother because the primary use of read_directory is
   the depth first traversal.

 * In a cloned work tree, the tracked files and directories come early in
   the physical directory and then crufts you created yourself comes at
   the end in readdir() order.  We tend to read a lot of tracked ones
   first and the finally hit other files.

 builtin-ministatus.c |    4 +++-
 dir.c                |   32 +++++++++++++++++++++++++++-----
 dir.h                |    2 ++
 3 files changed, 32 insertions(+), 6 deletions(-)

diff --git a/builtin-ministatus.c b/builtin-ministatus.c
index aff9e5a..4b5a191 100644
--- a/builtin-ministatus.c
+++ b/builtin-ministatus.c
@@ -25,7 +25,7 @@ static int have_untracked(void)
 
 	read_directory(&dir, ".", "", 0, NULL);
 	/* XXX we are probably leaking memory from dir */
-	for (i = 0; i < dir.nr; i++)
+	for (i = 0; i < dir.nr; i++) {
 		struct dir_entry *ent = dir.entries[i];
 		if (cache_name_is_other(ent->name, ent->len))
 			return 1;
@@ -47,6 +47,8 @@ int cmd_ministatus(int argc, const char **argv, const char *prefix)
 		putchar('*');
 	if (have_untracked())
 		putchar('?');
+	if (untracked_files_exist())
+		putchar('%');
 
 	return 0;
 }
diff --git a/dir.c b/dir.c
index cfd1ea5..8d4fcdd 100644
--- a/dir.c
+++ b/dir.c
@@ -16,7 +16,10 @@ struct path_simplify {
 
 static int read_directory_recursive(struct dir_struct *dir,
 	const char *path, const char *base, int baselen,
-	int check_only, const struct path_simplify *simplify);
+	int mode, const ...
From: Tuncer Ayaz
Date: Tuesday, February 10, 2009 - 3:52 pm

I've done some measurements from within .bashrc via  a
custom timer() bash function called before and after git ministatus call.
The box I'm testing on is a Core2Duo with 1.8GHz and 2GB of RAM.
I have faster machines I can test on but do Git coding on this one.

For WebKit.git this is similar to the previous patch. It does also take
at least 8 or 9 seconds. That's an improvement over 8-11secs.

--

From: Jeff King
Date: Tuesday, February 10, 2009 - 3:55 pm

My timings are exactly the same (with my have_untracked removed,
of course). Due most likely to the fact that I keep my repos clean, so
we have to look at every directory to realize there are no files.

Here are a few timings:

For the relative times of each action, I get (in the linux-2.6 tree):

  - nothing (just read_cache / refresh_cache)
    real    0m0.178s
    user    0m0.060s
    sys     0m0.116s

  - just index_differs_from (minus nothing = 0.107s)
    real    0m0.285s
    user    0m0.140s
    sys     0m0.140s

  - just worktree_is_dirty (minus nothing = 0.0s)
    real    0m0.178s
    user    0m0.048s
    sys     0m0.132s

  - just untracked_files_exist (minus nothing = 0.184s)
    real    0m0.362s
    user    0m0.188s
    sys     0m0.176s

So untracked_files is definitely the worst, but I was surprised that
the index to HEAD diff takes so long.

For just index_differs_from, gprof claims we spend most of our time in
unpack-trees stuff:

 33.33      0.01     0.01    28301     0.00     0.00  unpack_callback
 33.33      0.02     0.01    28301     0.00     0.00  unpack_nondirectories
 33.33      0.03     0.01    26627     0.00     0.00  convert_from_disk

For just untracked_files_exist, gprof claims we spend most of our time
dealing with the index name hash (and obviously a bit of time in
excluded_1):

 27.27      0.03     0.03  1479931     0.00     0.00  icase_hash
 18.18      0.05     0.02    74142     0.00     0.00  excluded_1
  9.09      0.06     0.01   120178     0.00     0.00  lookup_hash_entry
  9.09      0.07     0.01    49855     0.00     0.00  hash_name
  9.09      0.08     0.01    26627     0.00     0.00  convert_from_disk
  9.09      0.09     0.01    24739     0.00     0.00  excluded
  9.09      0.10     0.01       18     0.56     0.88  grow_hash_table

So maybe there are speedups to be had there. ISTR the icase_hash stuff
came about to support case-challenged filesystems. I haven't looked to
see if it could be turned into a ...
From: Junio C Hamano
Date: Tuesday, February 10, 2009 - 4:05 pm

There is an obvious optimization you can do to "diff-index --cached" using
cache-tree.  If your index is really clean, computing the tree object the
index would represent (without writing the tree object out) and comparing
it against HEAD^{tree} may be a tad faster.
--

From: Jeff King
Date: Wednesday, February 11, 2009 - 5:49 pm

Clever, but I think you may just be trading one scenario for "worst
case" versus another (i.e., now when you _do_ have a difference to do an
early return, you still have to touch everything in the cache).

Just for fun, I timed a quick and dirty implementation. It looks like
generating the tree actually ends up taking just a little bit longer,
even with an unchanged index (which should be the case it speeds up).

But maybe I just did it wrong. My implementation was basically just:

  t = cache_tree();
  if (!cache_tree_fully_valid(t))
    cache_tree_update(t, active_cache, active_nr, 0, 0);
  get_sha1("HEAD", sha1);
  return hashcmp(t->sha1, sha1);

-Peff
--

From: Nanako Shiraishi
Date: Tuesday, February 10, 2009 - 4:52 pm

I have a question. Why do you have the gap for the rename similarity between the two but not between the second status and the filename?

--

Previous thread: [PATCH] fix the installation path for html documentation by Michael J Gruber on Tuesday, February 10, 2009 - 8:14 am. (3 messages)

Next thread: Re: [PATCH] Make repack less likely to corrupt repository by Junio C Hamano on Tuesday, February 10, 2009 - 8:59 am. (17 messages)