onedev/server

#1544 Temporal agents do not work anymore correctly

Closed

Activities

jbauer opened 8 months ago

I have upgraded from 8.6.10 to 9.1.2 and now temporal agents do not seem to work correctly anymore.

I am using Docker swarm to scale agents and thus agents use properties serverUrl, agentTokenFile and temporalAgent. I had 2 agents running and after the upgrade OneDev UI shows both but one was always offline. I stopped both docker containers and OneDev now showed both as offline. This is already weird because temporal agents should just disappear in the OneDev UI instead of being listed as offline.

So I deleted both agents in OneDev, generated a new agent token, reconfigured the docker service, pulled the newest agent image manually and scaled the service to 1 to start a single agent. It showed up in OneDev as temporal agent with name onedev-agent-1 (the hostname) and agent log in OneDev is

2023-09-06 09:15:45,044 INFO  [WrapperSimpleAppMain] io.onedev.agent.Agent Cleaning temp directory...
2023-09-06 09:15:45,070 INFO  [WrapperSimpleAppMain] io.onedev.agent.Agent Connecting to http://onedev:6610...
2023-09-06 09:15:45,510 INFO  [WrapperSimpleAppMain] io.onedev.agent.Agent Connecting to http://onedev:6610...
2023-09-06 09:15:45,836 INFO  [HttpClient@caef02e-32] io.onedev.agent.AgentSocket Connected to server

Scaling the agent service to 2 in order to start a second instance of the agent with hostname onedev-agent-2 does not change anything in the OneDev UI. It still shows a single agent. The container log of the second agent is

09:24:11 INFO  io.onedev.agent.Agent - Connecting to http://onedev:6610...
09:24:11 INFO  io.onedev.agent.AgentSocket - Connected to server
09:24:11 ERROR io.onedev.agent.AgentSocket - Token already used by another agent
09:24:16 INFO  io.onedev.agent.Agent - Connecting to http://onedev:6610...

So two issues:

agent scaling does not work anymore
temporal agents that are offline are not removed automatically from OneDev. Instead they are listed as offline.

Robin Shen commented 8 months ago

Since 8.1.0, different token should be used for different agent. See this incompatiblity note:

https://code.onedev.io/~help/incompatibilities#810

Sharing same token across different agent is not safe and is not maintenable. Assume if you want to remove an agent, it will remove all agents sharing the same token.

So temporal agent feature is flawed, as it always use the same token. In future versions, a secure and flexbile approach will be introduced, to launch agent on demand, so that it is possible to launch agent in EC2 or your swarm cluster.

jbauer commented 8 months ago

Since 8.1.0, different token should be used for different agent. See this incompatiblity note:

https://code.onedev.io/~help/incompatibilities#810

Sharing same token across different agent is not safe

Why is it not safe? If one agent has been compromised it is already an unsafe situation.

and is not maintenable.

Well now it is not maintainable anymore because scaling agents have to be done manually by defining new agent services with their own tokens and deploy them.

Assume if you want to remove an agent, it will remove all agents sharing the same token.

Yes that was already clear two years ago as sharing a token was introduced. See issue #601

To mitigate that in issue #602 a distinction has been made between deleting a token (which would remove all agents) and simply remove an agent by name for agents that have become outdated because of scaling changes or hostname changes. In addition temporal agents have been introduced which delete themselves automatically.

So now we are back in the same situation from 2 years ago.

So temporal agent feature is flawed, as it always use the same token. In future versions, a secure and flexbile approach will be introduced, to launch agent on demand, so that it is possible to launch agent in EC2 or your swarm cluster.

I guess there is no concrete time frame?

Robin Shen commented 8 months ago

Why is it not safe? If one agent has been compromised it is already an unsafe situation.

The unsafe might be inaccurate. I mean a leaked token will affect all agents sharing the token and you need to re-assigne tokens for all of them.

Also shared tokens means that there is not a reliable way to disguish different agents, and this may cause other troubles in future.

Well now it is not maintainable anymore because scaling agents have to be done manually by defining new agent services with their own tokens and deploy them.

With the on demand agent launch feature, it will be much maintainable and scalable. OneDev will launch new agents if system is busy, and terminate them if system is idle. However this will be a EE feature.

If you want to use on demand agent for now and for free, please consider k8s which is obviously the mainstream.

jbauer commented 8 months ago

I mean a leaked token will affect all agents sharing the token and you need to re-assigne tokens for all of them.

The agent token is basically the same as an api token. It is usually up to the user to use one token per application or share the token between multiple applications. It is the equivalent of a login. If it is compromised and you revoke it, sure everyone with that "login" looses access.

Also shared tokens means that there is not a reliable way to disguish different agents, and this may cause other troubles in future.

I don't think using authentication information should be used to distinguish agents. Each agent could compute a unique id and store it in its own working directory. Or simply use the hostname which is already used to create the working directory and thus should be unique.

If you want to use on demand agent for now and for free, please consider k8s which is obviously the mainstream.

I don't know how k8s would solve this as you still need to somehow have a single env variable with unique content per pod. Also k8s is usually way too complex to maintain to use it during development unless you are fine with being stuck if something stops working and you have no idea how to fix it. So I don't think it is mainstream unless you reach a certain size and have dedicated persons to maintain it.

Currently I don't care that much about auto scaling / on demand creation of agents during development. It is fine in development to have a fixed number of agents and occasionally increase that number if too many builds queue up. However with that step backwards I simply can not configure the agent in the same way as before. If looses features and configuration is more complex.

Before I had a single service with three replicas and I could tell docker to never run multiple agents on the same physical host at the same time. Now I would need to create three services without replicas (replica=1) because each service needs its own unique agent token as env variable. Now two or more agents can potentially run on the same physical host, because services are independent. I don't want that because I want maximum parallel computing power. Now I only have the option to tell each of the three services on which physical hosts they are allowed to run. Consider 5 physical hosts and I want 3 agents to be alive all the time while being able to loose 2 physical hosts (e.g. one is in maintenance and the other crashes unexpectedly). That would mean each agent service would need to be assigned to three hosts (since two can go down) but then you have overlapping assignments and multiple agents would likely run on the same host as soon as you shutdown some hosts.

With the solution before Docker swarm would simply rearrange the 3 containers on the available hosts. If two hosts are down, no problem, I still have three.

Robin Shen changed fields 8 months ago

Name	Previous Value	Current Value
Type	Bug	Discussion

Robin Shen commented 8 months ago

The agent token is basically the same as an api token. It is usually up to the user to use one token per application or share the token between multiple applications. It is the equivalent of a login. If it is compromised and you revoke it, sure everyone with that "login" looses access.

Agent is different from adhoc applications using api tokens. It connected with OneDev server and is part of the build grid, and OneDev server has to manage them.

I don't think using authentication information should be used to distinguish agents. Each agent could compute a unique id and store it in its own working directory. Or simply use the hostname which is already used to create the working directory and thus should be unique.

An authoritive identification, not something generated at agent side.

Shared tokens continues to bite me while I am developing the build grid feature. And I can not remember some of them exactly. This is why I decide to remove it in 8.1.0.

For the majority cases, starting fixed agents is enough. For cases where dynamic agents are required, looking for the EE feature is fair I think.

Robin Shen commented 8 months ago

Also agent work files will no longer be put under a directory identified by host name as this causes side effect for majority of cases (cache missing if host name changes etc).

Each agent should mount a local directory as work volume. Mounting NFS as work directory will make cache extremely slow as cache typically contains many many small files.

jbauer commented 8 months ago

Each agent should mount a local directory as work volume. Mounting NFS as work directory will make cache extremely slow as cache typically contains many many small files.

Yes I moved away from NFS for the agent. It was too slow. Agents now use a local volume even if that means loosing the whole cache because the agent has been moved to a different physical host by the orchestrator. But this only happens rarely and did not justify the longer build times when using NFS.

Do you think OneDev will stay compatible and useable with docker / docker swarm or will it be a k8s only product sooner or later?

Robin Shen commented 8 months ago

Do you think OneDev will stay compatible and useable with docker / docker swarm or will it be a k8s only product sooner or later?

More than 90% OneDev deployments (based on download metrics) are in docker enviornments directly without k8s. So OneDev will not be a k8s only product. For docker swarm support, do you know if it has some convenient api to launch container programatically?

jbauer commented 8 months ago

For HTTP examples see here https://docs.docker.com/engine/api/sdk/examples/ (switch examples to HTTP)

The engine API itself is documented at https://docs.docker.com/engine/api/ and also allows creating swarm clusters, managing swarm services, etc. For example: https://docs.docker.com/engine/api/v1.43/#tag/Service/operation/ServiceCreate

Robin Shen commented 8 months ago

Thanks. Will check them when implement the on demand agent feature (issue #1545)

jbauer commented 8 months ago

In Docker swarm there are only swarm services and docker swarm manages the running containers (which are called tasks within a swarm service) of that service. However just because docker has been configured using docker swarm it does not forbid running normal containers that are not managed by docker swarm.

jbauer changed state to 'Closed' 8 months ago

Previous Value	Current Value
Open	Closed

jbauer commented 8 months ago

Closing this for now. I have updated OneDev to latest and reconfigured the agent deployment. It is less optimal now but at least there are multiple agents online again.

Type	Question
Priority	Major
Assignee	Robin Shen
Labels	No labels

Issue Votes (0)

Watchers (3)

Reference

onedev/server#1544