Retry deleting docker container/network upon failure (OD-2257)
jbauer opened 12 months ago

We received a build error saying that docker had no subnets available anymore to create a new network for the job container. Looking on the VM hosts we discovered that multiple networks created by agent have not been cleaned up correctly.

Comparing the numbers on the network name it seems like that a network is not cleaned up correctly if the build job has been canceled by OneDev.

  • Robin Shen commented 12 months ago

    OneDev does delete the network upon job cancellation. But it will take a while after build being cancelled.

  • jbauer commented 12 months ago

    The networks covered a long timespan. Currently we are at build number ~1300 and there were networks with ~800 up to recent builds not being removed. Build cancelation usually happens when a build runs but new commits on that branch have occurred. It doesn't happen too often for us so the networks have been accumulated slowly over time. I had to delete about 30-40 networks manually spread across three agent VMs.

    So I am relatively sure cleanup has not happened. What does "take a while" mean?

  • Robin Shen changed fields 12 months ago
    Name Previous Value Current Value
    Type
    Bug
    Question
  • Robin Shen changed fields 12 months ago
    Name Previous Value Current Value
    Priority
    Major
    Normal
  • Robin Shen commented 12 months ago

    So I am relatively sure cleanup has not happened. What does "take a while" mean?

    OneDev deletes the container and network in the background upon cancellation, and this can take some time. Most of the time this works, but occasionally container and network deletion fails for some unknow reason. I also see this happening outside of OneDev before.

  • jbauer commented 12 months ago

    I have now forced some cancellations by committing single commits in a row and you are right corresponding networks have been cleaned up, almost immediately actually.

    Maybe OneDev should retry the cleanup 3 times with some backoff timer before giving up? Given that the container/network name is unique it should be fine to retry it multiple times.

  • Robin Shen changed fields 12 months ago
    Name Previous Value Current Value
    Type
    Question
    Improvement
  • Robin Shen changed title 12 months ago
    Previous Value Current Value
    Agent fails to cleanup Docker networks if jobs are canceled
    Retry deleting docker container/network upon failure
  • OneDev changed state to 'Closed' 12 months ago
    Previous Value Current Value
    Open
    Closed
  • OneDev commented 12 months ago

    State changed as code fixing the issue is committed (5f5259da)

  • OneDev changed state to 'Released' 11 months ago
    Previous Value Current Value
    Closed
    Released
  • OneDev commented 11 months ago

    State changed as build OD-5877 is successful

issue 1/1
Type
Improvement
Priority
Normal
Assignee
Labels
No labels
Issue Votes (0)
Watchers (2)
Reference
OD-2257
Please wait...
Connection lost or session expired, reload to recover
Page is in error, reload to recover