Cache upload slow if stored on NFS (OD-1952)
jbauer opened 2 years ago

I use OneDev as docker image and storage volume for OneDev is mounted as NFS volume by docker. Docker itself sets a reasonable large NFS rsize and wsize mount options to increase NFS performance by reducing overhead. OneDev agents use local volumes.

However when OneDev Agents uploads the build cache (about 1-2GB) to OneDev server it takes very long. Manually testing NFS share using dd command with different block sizes shows that the block size should not be too small. With a block size of about 64kb NFS performance was ok.

I have seen in OneDev code that OneDev uses similar sized byte buffers but maybe there is (unconfigured) JDK or 3rd party code involved that drops down back to 8kb buffer size which is often the default size in JDK / library code.

Any chance you can check that by looking through the code and also any JDK/library code that is involved in build cache tar/untar/send/receive actions? I feel like OneDev effectively only uses 8kb buffers when working with build cache. Maybe git checkout and writing files to disk might also worth checking.

Just to give you an impression: Cache upload of roughly 2 GB took 12 minutes which equals about 2.8 MB/s. However the NFS server is very well capable of receiving data with 60-80 MB/s when using dd. Given that the build cache is one large tar archive OneDev should produce nearly the same performance.

  • Robin Shen changed state to 'Closed' 1 year ago
    Previous Value Current Value
    Open
    Closed
  • Robin Shen commented 1 year ago

    Checked and OneDev is using large buffer for all file I/O operations. It can be slow as cache upload is not simply a file copy, it also needs to go through web server.

  • jbauer commented 1 year ago

    Have you also checked any library code or JDK code involved? The speed we see basically matches what we get if we use dd with 8kb buffer. If OneDev uses large buffers that is great but if the underlying code that actually sends the data over the wire uses buffers itself and these buffers are small, then the large OneDev buffer is basically useless in terms of performance.

  • jbauer commented 1 year ago

    Our web proxy in front of OneDev does not use proxy buffers. The proxy should be irrelevant because the Agents sends the data to OneDev and then OneDev uses its own buffers to write it do disk. If a OneDev buffer isn't full it should wait writing. Writing to disk then triggers NFS writes which are slow.

  • Robin Shen commented 1 year ago

    Checked all the paths for cache upload and am sure that 64k buffer is used.

  • jbauer commented 1 year ago

    @robin I have taken some time now to debug OneDev server locally because some of our builds now actually start to timeout after 1 hour because uploading cache is so slow.

    You are using Apache commons class/method IOUtils.copy(InputStream, OutputStream, int) in various places and as you said in all those places you are calling the method using a buffer size of 64kb. That buffer is used to read data from the InputStream and then to write that data to OutputStream. However it is not guaranteed that the 64kb buffer actually fills up because InputStream.read(buffer) returns the number of bytes that have been read and that amount of bytes is then written to the OutputStream.

    I did set a breakpoint in IOUtils.copyLarge(InputStream, OutputStream, byte[]) and started a build on an agent (within docker) that also uploads cache to the server. It turns out that the InputStream only allows reading 4kb of data for each read(buffer) call. The InputStream is provided by Jersey as EntityInputStream which uses a HttpInputOverHTTP-Stream of Jetty at some point. This Jetty stream uses an 8kb ByteBuffer which has a limit set to 4KB for some reason, thus you can only read 4kb blocks from HTTP API calls received by Jetty.

    In case of cache uploading in DefaultJobCacheManger.uploadCache(Long, Long, List<String>, InputStream) you create a FilterOutputStream(FileOutputStream) chain that is passed on to IOUtils.copy(). Because the FileOutputStream is not wrapped with BufferedOutputStream the cache will be written out to disk in 4kb blocks which greatly hurts NFS performance as each write equals a network request to the NFS server.

    Because IOUtils.copy() is used in various places you either have to configure Jetty somehow to allow reading more data at once from its HTTP connection or you must wrap your OutputStream with a BufferedOutputStream to be sure that you actually write out 64kb blocks when reading data from your HTTP API and transferring that data to an OutputStream.

  • Robin Shen commented 1 year ago

    Thanks for the info. I filed a task to investigate this when I am free. OD-2204

  • Robin Shen commented 1 year ago

    Build OD-5768 uses BufferedOutputStream with 64k buffer size for all operations writing large file

  • jbauer commented 1 year ago

    Thanks. I just tried the new build and uploading ~3 GB cache changed from roughly 2 MB/s to roughly 10 MB/s which already improves build time quite a bit. Uploading the cache now takes ~6 minutes (down from ~20 minutes). Downloading the cache takes ~1 minute (even before your change here).

    It seems there is still something limiting performance in the chain tar -> gz -> jetty outputstream -> 1gbit network -> jetty inputstream -> bufferedoutputstream -> fileoutputstream.

    I will try to investigate further if I find time for it. Maybe tar -> gz of many small files on the agent is generally slow on our hardware/setup (VM, Docker).

issue 1/1
Type
Improvement
Priority
Major
Assignee
Labels
No labels
Issue Votes (0)
Watchers (2)
Reference
OD-1952
Please wait...
Connection lost or session expired, reload to recover
Page is in error, reload to recover