Re: 'git gc --aggressive' effectively unusable

Previous thread: Re: 'git gc --aggressive' effectively unusable by Frans Pop on Friday, April 2, 2010 - 3:12 pm. (1 message)

Next thread: [PATCH 0/6] Add tab-in-indent whitespace rule by Chris Webb on Friday, April 2, 2010 - 4:36 pm. (8 messages)
From: Frans Pop
Date: Friday, April 2, 2010 - 3:05 pm

Note: this is on a different repo from the 'git reflog expire --all' I
reported a bit earlier.

I have a git-svn checkout of a subversion repo which I wanted to compress
as much as possible. 'git gc --aggressive' starts to run fairly well, but
eats more and more memory and gets slower and slower. After it gets to
about 45% or 50% progress slows down noticeably and so far I haven't had
the patience to let it finish (40 minutes is already way too long).

A regular 'git gc' run completes without any problems.

$ du -sh .git/
612M    .git/

Special about this repo is that it contains two huge objects [1], which
could maybe be a factor:
     size    pack  SHA
- packages/po/sublevel4/da.po:
     495661  4654  801cd6451ece536c0ab41f79e09fc52efdf3361f
- packages/arch/powerpc/quik-installer/debian/po/da.po
     149515  1403  83a787b20817dc4d72db052de4055e7a7c9221d7  

Below some output from top and of the progress of the command showing the
problem. Check the change in number of compressed objects against the
timestamps from top.

Cheers,
FJP

[1] Caused by a bug in a script a couple of years back.

$ git gc --aggressive

Counting objects: 843342, done.
Delta compression using up to 2 threads.
Compressing objects:  53% (449663/836424)

top - 22:55:02 up 18 min,  1 user,  load average: 1.83, 1.68, 1.07
Tasks: 161 total,   1 running, 160 sleeping,   0 stopped,   0 zombie
Cpu0  : 91.4%us,  0.7%sy,  0.0%ni,  1.3%id,  6.6%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  : 97.7%us,  0.3%sy,  0.0%ni,  1.3%id,  0.7%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   2034284k total,  2018288k used,    15996k free,    10188k buffers
Swap:  2097148k total,    22612k used,  2074536k free,   449444k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 5861 fjp       20   0 1775m 1.3g 194m S  188 66.7  21:10.89 git


Counting objects: 843342, done.
Delta compression using up to 2 threads.
Compressing objects:  58% (486001/836424)

top - 23:00:12 up 23 min,  1 user,  load average: ...
From: Frans Pop
Date: Saturday, April 3, 2010 - 2:16 pm

To avoid confusion: these sizes are in kB.
--

From: Michael Witten
Date: Saturday, April 3, 2010 - 2:33 pm

There's your problem.

$ git help gc | sed -n /--aggressive$/,+3p
       --aggressive
           Usually git gc runs very quickly while
           providing good disk space utilization
           and performance. This option will
           cause git gc to more aggressively
           optimize the repository at the expense
           of taking much more time. The effects
           of this optimization are persistent, so
           this option only needs to be used
           occasionally; every few hundred
           changesets or so.

Last time I used this option (on Linus's Linux repo), I let the
algorithm do its thing for a couple of hours. Maybe the efficiency
could be vastly improved, but it does finish if you let it.

SIncerely,
Michael Witten
--

From: Michael Witten
Date: Saturday, April 3, 2010 - 2:42 pm

As an aside: I didn't realize I copied that in there; this would
probably be better:

$ git help gc | sed -n /--aggressive$/,/^$/p
--

From: Frans Pop
Date: Saturday, April 3, 2010 - 4:23 pm

Yes, I had seen that. But there's a difference between taking much more 
time and slowing down to such an extend that it never finishes.

I've tried it today on my linux-2.6 repo as well and the same thing 
happened. At first the progress is not fast but reasonable. When it gets 
to about 45% percent it starts slowing down a lot: from ~1500 objects per 
update of the counters to ~300 objects per update. And who knows what the 
progress is going to be when it reaches 70% or 90%: 10 per update?

With a total of over 2 milion objects in the repository such a low speed is 
simply not going to work, ever. So I maintain that it is effectively 
unusable.

Cheers,
FJP
--

From: Michael Witten
Date: Saturday, April 3, 2010 - 4:42 pm

Well, all I can do is quote myself:

    Last time I used this option (on Linus's Linux repo),
    I let the algorithm do its thing for a couple of hours.
    Maybe the efficiency could be vastly improved, but
    it does finish if you let it.


I think I must have run gc with 1.7.0.2.199.g90a2bf9; perhaps you
could use something like oprofile to figure out where gc is spending
most of its time.
--

From: Miles Bader
Date: Saturday, April 3, 2010 - 5:14 pm

Are you sure it doesn't subsequently speed up again?

-Miles

-- 
Idiot, n. A member of a large and powerful tribe whose influence in human
affairs has always been dominant and controlling.
--

From: Michael Poole
Date: Sunday, April 4, 2010 - 7:50 am

I have seen asymptotic slowdown as "git gc --aggressive" progresses on
certain repositories.  It is particularly bad with
git://git.infradead.org/gcc.git (on an x86-64 system with 4 GB RAM).
git seemed to be thrashing swap badly as time went on.  I don't know
that git gc --aggressive would *never* finish on my gcc-git repository.
I just know that it got to about 80% done in less than an hour, to 90%
after twelve hours, and about 94% after another twelve hours.  (The same
operation on linux-2.6.git takes about 40 minutes with all the default
settings.)

I may have been dreaming, but I thought with some 1.6.x version of git,
reducing core.packedGitLimit and pack.windowLimit (now windowMemory?)
mostly made the thrashing go away.  When I try again with v1.7.0.2,
though, it doesn't seem to help very much -- there is still a lot of
swapping, and the git process got to about 7 GB virtual size before I
killed it after about 10 hours of operation.

Michael Poole
--

From: Jeff King
Date: Sunday, April 4, 2010 - 1:38 pm

I packed Frans' sample kernel repo with "git gc --aggressive" last
night. It did finish after about 9 hours. I didn't take memory usage
measurements, but here's what time said:

  real    535m38.898s
  user    216m46.437s
  sys     0m24.186s

That's 3.6 hours of CPU time over almost 9 hours (on a dual-core
machine). The non-agressive pack was about 680M, and the result was
480M. The machine has 2G of RAM, and not much else running. So I would
really not expect there to be much disk I/O required, but clearly we
were waiting quite a bit.

I'll try tweaking a few of the pack memory limits and try again.

-Peff
--

From: Jeff King
Date: Sunday, April 4, 2010 - 2:49 pm

Hmm, this may be relevant:

  http://thread.gmane.org/gmane.comp.version-control.git/67791/focus=94797

In my experiments, memory usage is increasing but valgrind doesn't
leaks. So perhaps it is fragmentation in the memory allocator.

-Peff
--

From: Nicolas Pitre
Date: Monday, April 5, 2010 - 2:07 pm

To verify this, simply try with pack.threads = 1.  That should help the 
memory allocator not to fragment memory allocation across threads 
randomly.

Also, going multithreaded _may_ be faster only if you can afford the 
increased memory usage.  Especially with gc --aggressive, each thread is 
adding its own share of memory usage in the delta window.

First thing to try for the biggest possible improvement is 
pack.threads=1.  On a quad core machine this means repacking 4 times 
slower, but this is certainly much faster than 100 times slower when the 
system starts swapping. That might even make the resulting pack a tad 
tighter due to delta windows not being fragmented across different 
threads.

If that is not enough, then try:

	pack.deltaCacheSize = 1
	core.packedGitWindowSize = 16m
	core.packedGitLimit = 128m

This should reduce Git's memory usage while making it slower without 
affecting the packing outcome.  Again "slower" could mean "much faster" 
if by reducing memory usage then swapping is completely avoided.

If that still doesn't help much, then the next tweaks will affect the 
packing result:

	pack.windowMemory = 256m

Here 256m is arbitrary and must be guessed from the size of the objects 
being packed.  The idea is to let smallish objects completely fill the 
search window (it has 250 entries by default with --aggressive) while 
not letting that many huge objects completely eat up all memory.  If 
there is still swapping going on then you can try 64m instead.  That 
means that if you have a large set of 1MB objects then the delta search 
window will be scaled down to less than 64 entries in that case.  This 
is why packing might be less optimal as there are fewer delta 
combinations being considered.

If this still doesn't prevent swapping then you should really consider 
installing more RAM.  There are fundamental object accounting structures 
that can hardly be shrunk such as struct object_entry in 
builtin/pack-objects.c, and one instance of such ...
From: Mike Galbraith
Date: Saturday, April 3, 2010 - 9:27 pm

As a data point, when I do gc, I routinely use --aggressive.  It takes a
while here, but not forever.  (I'm a tad short of 2 million objects)

Repo is mainline + next + tip + stable >= 2.6.22 + local branches.

git@marge:..git/linux-2.6> time git gc --aggressive
Counting objects: 1909894, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (1889774/1889774), done.
Writing objects: 100% (1909894/1909894), done.
Total 1909894 (delta 1674098), reused 0 (delta 0)

real    22m24.943s
user    55m33.756s
sys     0m8.149s

git is 1.7.0.3

	-Mike

--

Previous thread: Re: 'git gc --aggressive' effectively unusable by Frans Pop on Friday, April 2, 2010 - 3:12 pm. (1 message)

Next thread: [PATCH 0/6] Add tab-in-indent whitespace rule by Chris Webb on Friday, April 2, 2010 - 4:36 pm. (8 messages)