01 October 2011

Using git svn with a large repository

I've started using the git svn bridge for one of our projects, but I had a couple of problems with the initial clone of the repository, due to the file size
(some > 100Mb), and to the subversion server dropping the connection.
So, I started using the standard git svn clone:
$ git svn clone https://svn.farwell.co.uk/svn/project --stdlayout
Initialized empty Git repository in c:/code/project/.git/
r1 = 339bd134b2d482cf9038c16fa75f93255ebfbc1a (refs/remotes/trunk)
W: +empty_dir: trunk/blah1
W: +empty_dir: trunk/blah2
W: +empty_dir: trunk/blah3
W: +empty_dir: trunk/blah4
....
The --stdlayout means that git expects the trunk to be called trunk, tags be called tags and branches to be called branches.
Note also that you need to specify the url without the trunk at the end. This ran for a while, and then fell over, because svn dropped the connection on me. There is a timeout on the server.
RA layer request failed: REPORT request failed on '/svn/project
/!svn/vcc/default': REPORT of '/svn/project/!svn/vcc/default':
Could not read chunk delimiter: Secure connection truncated (ht
tps://svn.farwell.co.uk) at C:\Program Files (x86)\Git/libexec/
git-core/git-svn line 5114
We need to load in batches. git fetch has a -r option to allow you to specify the range of revisions to fetch. We've got some large files, so we'll do 10 at a time.
I started again:
$ git svn clone https://svn.farwell.co.uk/svn/project \
--stdlayout -r1:2
which fetched the first two revisions, but we have to fetch the rest, about 1000 revisions. I used a quick perl script.
my $count = 1;

while ($count <= 1000) {
# executes git svn fetch -r1:11 etc.
my $cmd="git svn fetch -r$count:" . ($count + 10);
print "$cmd\n";
system($cmd);
$count += 10;
}
But then we get another problem: git is running out of memory; it crashed and this time it's more serious. Another problem with our big files. This is the error message:
Out of memory during "large" request for 268439552 bytes, total sbrk() is 140652544 bytes at /usr/lib/perl5/site_perl/Git.pm line 898,  line 3.
Git svn uses perl to download and process the files, but it slurps the entire file in one go. So for our large files, it runs out of memory.

After a bit of searching on the internet, I found a solution on github for our problem: Git.pm: Use stream-like writing in cat_blob().
This is a fairly simple patch, which doesn't seem to have made it into a release yet, so I applied it manually to C:\Program Files (x86)\Git\lib\perl5\site_perl\Git.pm.
@@ -896,22 +896,26 @@ sub cat_blob {
}
my $size = $1;
-
- my $blob;
my $bytesRead = 0;

while (1) {
+ my $blob;
my $bytesLeft = $size - $bytesRead;
last unless $bytesLeft;

my $bytesToRead = $bytesLeft < 1024 ? $bytesLeft : 1024;
- my $read = read($in, $blob, $bytesToRead, $bytesRead);
+ my $read = read($in, $blob, $bytesToRead);
unless (defined($read)) {
$self->_close_cat_blob();
throw Error::Simple("in pipe went bad");
}

$bytesRead += $read;
+
+ unless (print $fh $blob) {
+ $self->_close_cat_blob();
+ throw Error::Simple("couldn't write to passed in filehandle");
+ }
}

# Skip past the trailing newline.

@@ -926,11 +930,6 @@ sub cat_blob {
throw Error::Simple("didn't find newline after blob");
}

- unless (print $fh $blob) {
- $self->_close_cat_blob();
- throw Error::Simple("couldn't write to passed in filehandle");
- }
-
return $size;
}
I restarted the process from the beginning and voilĂ , it got to the end. All of the revisions had been fetched, all that was left to do was a
$ git svn rebase
to merge the changes into the tree and have a working git repo.

If had wanted to migrate from svn to github, rather than continue to use git svn, I'd have done exactly the same thing, but add a --no-metadata to the clone command.
And obviously you don't need to to an svn rebase, just a rebase.
Post a Comment