Random Allsorts: Using git svn with a large repository

01 October 2011

Using git svn with a large repository

I've started using the git svn bridge for one of our projects, but I had a couple of problems with the initial clone of the repository, due to the file size
(some > 100Mb), and to the subversion server dropping the connection.
So, I started using the standard git svn clone:

$ git svn clone https://svn.farwell.co.uk/svn/project --stdlayout
Initialized empty Git repository in c:/code/project/.git/
r1 = 339bd134b2d482cf9038c16fa75f93255ebfbc1a (refs/remotes/trunk)
W: +empty_dir: trunk/blah1
W: +empty_dir: trunk/blah2
W: +empty_dir: trunk/blah3
W: +empty_dir: trunk/blah4
....

The --stdlayout means that git expects the trunk to be called trunk, tags be called tags and branches to be called branches.
Note also that you need to specify the url without the trunk at the end. This ran for a while, and then fell over, because svn dropped the connection on me. There is a timeout on the server.

RA layer request failed: REPORT request failed on '/svn/project
/!svn/vcc/default': REPORT of '/svn/project/!svn/vcc/default': 
Could not read chunk delimiter: Secure connection truncated (ht
tps://svn.farwell.co.uk) at C:\Program Files (x86)\Git/libexec/
git-core/git-svn line 5114

We need to load in batches. git fetch has a -r option to allow you to specify the range of revisions to fetch. We've got some large files, so we'll do 10 at a time.
I started again:

$ git svn clone https://svn.farwell.co.uk/svn/project \
                     --stdlayout -r1:2

which fetched the first two revisions, but we have to fetch the rest, about 1000 revisions. I used a quick perl script.

my $count = 1;

while ($count <= 1000) {
    # executes git svn fetch -r1:11 etc.
    my $cmd="git svn fetch -r$count:" . ($count + 10);
    print "$cmd\n";
    system($cmd);
    $count += 10;
}

But then we get another problem: git is running out of memory; it crashed and this time it's more serious. Another problem with our big files. This is the error message:

Out of memory during "large" request for 268439552 bytes, total sbrk() is 140652544 bytes at /usr/lib/perl5/site_perl/Git.pm line 898,  line 3.

Git svn uses perl to download and process the files, but it slurps the entire file in one go. So for our large files, it runs out of memory.

After a bit of searching on the internet, I found a solution on github for our problem: Git.pm: Use stream-like writing in cat_blob().
This is a fairly simple patch, which doesn't seem to have made it into a release yet, so I applied it manually to C:\Program Files (x86)\Git\lib\perl5\site_perl\Git.pm.

@@ -896,22 +896,26 @@ sub cat_blob {
   }
   my $size = $1;
-
-  my $blob;
   my $bytesRead = 0;

   while (1) {
+    my $blob;
     my $bytesLeft = $size - $bytesRead;
     last unless $bytesLeft;

     my $bytesToRead = $bytesLeft < 1024 ? $bytesLeft : 1024;
-    my $read = read($in, $blob, $bytesToRead, $bytesRead);
+    my $read = read($in, $blob, $bytesToRead);
     unless (defined($read)) {
       $self->_close_cat_blob();
       throw Error::Simple("in pipe went bad");
     }

     $bytesRead += $read;
+
+    unless (print $fh $blob) {
+      $self->_close_cat_blob();
+      throw Error::Simple("couldn't write to passed in filehandle");
+    }
   }

   # Skip past the trailing newline.

@@ -926,11 +930,6 @@ sub cat_blob {
     throw Error::Simple("didn't find newline after blob");
   }

-  unless (print $fh $blob) {
-    $self->_close_cat_blob();
-    throw Error::Simple("couldn't write to passed in filehandle");
-  }
-
   return $size;
}

I restarted the process from the beginning and voilà, it got to the end. All of the revisions had been fetched, all that was left to do was a

$ git svn rebase

to merge the changes into the tree and have a working git repo.

If had wanted to migrate from svn to github, rather than continue to use git svn, I'd have done exactly the same thing, but add a --no-metadata to the clone command.
And obviously you don't need to to an svn rebase, just a rebase.

4 comments:

Anonymous said...: Great blog you have here but I was wondering if you knew of any community forums that cover the same topics talked about in this article? Id really love to be a part of group where I can get feedback from other knowledgeable people that share the same interest. If you have any recommendations, please let me know. Appreciate it!
Acheter vimax en France.2011AVEF; Tue Oct 04, 02:20:00 pm CEST
Matthew Farwell said...: There are a number of options: there is a git channel on IRC, irc://irc.freenode.net/git, and http://stackoverflow.com which is a great site for asking or answering questions. If you want to start somewhere, start at stackoverflow.com; Wed Oct 05, 08:29:00 am CEST
Unknown said...: If you want to migrate to gihub, consider using subgit, http://subgit.com

To use this tool you have to have local access to your svn repository. If so, just install subgit into this repository.

This will import all revisions into specified git repository. Since then you can use pure git (not git-svn) to push changes into this newly created repository — subgit will automatically synchronize changes with svn.

In order to make it work with github, you can publish your git repository and from time to time push changes into git repository synchronized with svn.

Hope that helps.; Thu Feb 23, 04:08:00 pm CET
Anonymous said...: Hi Matthew, it is a great blog. I've started to migrate our svn repositories to github and I got the same problem you declared and solved here. With this know-how I got svn repositories converted. I've got some other problem related to big svn repositories and I don't know if you have any solution for it. every time you call "git svn clone .." it adds new rows for branches and tags in .git/config file. so after a while you get a long list of the same inputs in your config file:

[core]
repositoryformatversion = 0
filemode = false
bare = false
logallrefupdates = true
symlinks = false
ignorecase = true
[svn-remote "svn"]
url = https://svn.de050.corpintra.net/ltm
fetch = trunk:refs/remotes/trunk
branches = branches/*:refs/remotes/*
tags = tags/*:refs/remotes/tags/*
branches = branches/*:refs/remotes/*
tags = tags/*:refs/remotes/tags/*
branches = branches/*:refs/remotes/*
tags = tags/*:refs/remotes/tags/*
branches = branches/*:refs/remotes/*
tags = tags/*:refs/remotes/tags/*
branches = branches/*:refs/remotes/*
tags = tags/*:refs/remotes/tags/*
branches = branches/*:refs/remotes/*
tags = tags/*:refs/remotes/tags/*
branches = branches/*:refs/remotes/*
tags = tags/*:refs/remotes/tags/*
branches = branches/*:refs/remotes/*
tags = tags/*:refs/remotes/tags/*
branches = branches/*:refs/remotes/*
tags = tags/*:refs/remotes/tags/*
branches = branches/*:refs/remotes/*
tags = tags/*:refs/remotes/tags/*
branches = branches/*:refs/remotes/*
tags = tags/*:refs/remotes/tags/*
... same as above
... same as above
[svn]
authorsfile = C:/work/cygwin64/home/hnikzat/GitMigration/authors.txt

my repsitory is about 50 GB after 4 days running the prozess is the list of branches and tags line more than 2000 (same rows repeated) !!

How did you solve this problem?

Regards,
Ynca; Fri Aug 09, 01:55:00 pm CEST