Failed to validate connection org.postgresql.jdbc.PgConnection@3a84e4a4 (This connection has been closed.) (OD-893)

Closed

Activities

Zhou You opened 2 years ago

连接池配置会占满数据库所有链接，影响其它程序

select datname,pid,application_name,state from pg_stat_activity;

CleanShot 2022-08-27 at 00.38.24@2x.png CleanShot 2022-08-27 at 00.37.25@2x.png

Robin Shen commented 2 years ago

大部分都是idle状态，你可以编辑hibernate.properties将maximumPoolSize设置小一点。

Robin Shen commented 2 years ago

另外这个错误容易复现吗？

Zhou You commented 2 years ago

一开始我通过环境变量指定一些配置但是没作用看了代码发现是硬编码的几个变量有用，后来改了hibernate.properties

hibernate.hikari.minimumIdle=5
hibernate.hikari.maximumPoolSize=25
#hibernate.hikari.keepaliveTime=60000
hibernate.hikari.idleTimeout=120000
hibernate.hikari.leakDetectionThreshold=150000
#hibernate.hikari.maxLifeTime=180000
hibernate.hikari.connectionTimeout=3000
hibernate.hikari.validationTimeout=2500

数据库就是用supabase/postgres:14.1.0.61镜像启的容器改了配置之后发现可能有conn没有被释放或者在PG那边还是被占用？

2022-08-27 07:30:10,946 DEBUG [HikariPool-1 housekeeper] com.zaxxer.hikari.pool.HikariPool HikariPool-1 - Before cleanup stats (total=5, active=0, idle=5, waiting=0)
2022-08-27 07:30:10,946 DEBUG [HikariPool-1 housekeeper] com.zaxxer.hikari.pool.HikariPool HikariPool-1 - After cleanup  stats (total=5, active=0, idle=5, waiting=0)
2022-08-27 07:30:40,946 DEBUG [HikariPool-1 housekeeper] com.zaxxer.hikari.pool.HikariPool HikariPool-1 - Before cleanup stats (total=5, active=0, idle=5, waiting=0)
2022-08-27 07:30:40,947 DEBUG [HikariPool-1 housekeeper] com.zaxxer.hikari.pool.HikariPool HikariPool-1 - After cleanup  stats (total=5, active=0, idle=5, waiting=0)
2022-08-27 07:31:10,947 DEBUG [HikariPool-1 housekeeper] com.zaxxer.hikari.pool.HikariPool HikariPool-1 - Before cleanup stats (total=5, active=0, idle=5, waiting=0)
2022-08-27 07:31:10,947 DEBUG [HikariPool-1 housekeeper] com.zaxxer.hikari.pool.HikariPool HikariPool-1 - After cleanup  stats (total=5, active=0, idle=5, waiting=0)
2022-08-27 07:31:40,947 DEBUG [HikariPool-1 housekeeper] com.zaxxer.hikari.pool.HikariPool HikariPool-1 - Before cleanup stats (total=5, active=0, idle=5, waiting=0)
2022-08-27 07:31:40,948 DEBUG [HikariPool-1 housekeeper] com.zaxxer.hikari.pool.HikariPool HikariPool-1 - After cleanup  stats (total=5, active=0, idle=5, waiting=0)
2022-08-27 07:32:10,948 DEBUG [HikariPool-1 housekeeper] com.zaxxer.hikari.pool.HikariPool HikariPool-1 - Before cleanup stats (total=5, active=0, idle=5, waiting=0)
2022-08-27 07:32:10,948 DEBUG [HikariPool-1 housekeeper] com.zaxxer.hikari.pool.HikariPool HikariPool-1 - After cleanup  stats (total=5, active=0, idle=5, waiting=0)
2022-08-27 07:32:40,948 DEBUG [HikariPool-1 housekeeper] com.zaxxer.hikari.pool.HikariPool HikariPool-1 - Before cleanup stats (total=5, active=0, idle=5, waiting=0)
2022-08-27 07:32:40,948 DEBUG [HikariPool-1 housekeeper] com.zaxxer.hikari.pool.HikariPool HikariPool-1 - After cleanup  stats (total=5, active=0, idle=5, waiting=0)
2022-08-27 07:33:10,948 DEBUG [HikariPool-1 housekeeper] com.zaxxer.hikari.pool.HikariPool HikariPool-1 - Before cleanup stats (total=5, active=0, idle=5, waiting=0)
2022-08-27 07:33:10,949 DEBUG [HikariPool-1 housekeeper] com.zaxxer.hikari.pool.HikariPool HikariPool-1 - After cleanup  stats (total=5, active=0, idle=5, waiting=0)
2022-08-27 07:33:40,949 DEBUG [HikariPool-1 housekeeper] com.zaxxer.hikari.pool.HikariPool HikariPool-1 - Before cleanup stats (total=5, active=0, idle=5, waiting=0)
2022-08-27 07:33:40,949 DEBUG [HikariPool-1 housekeeper] com.zaxxer.hikari.pool.HikariPool HikariPool-1 - After cleanup  stats (total=5, active=0, idle=5, waiting=0)
2022-08-27 07:34:10,949 DEBUG [HikariPool-1 housekeeper] com.zaxxer.hikari.pool.HikariPool HikariPool-1 - Before cleanup stats (total=5, active=0, idle=5, waiting=0)
2022-08-27 07:34:10,949 DEBUG [HikariPool-1 housekeeper] com.zaxxer.hikari.pool.HikariPool HikariPool-1 - After cleanup  stats (total=5, active=0, idle=5, waiting=0)

CleanShot 2022-08-27 at 07.35.04@2x.png 连接占用是从5开始慢慢增长的

Zhou You commented 2 years ago

可能是Docker (Swarm / K8S) 的IPVS default idle timeout(900s)超时被丢弃了空闲连接，因此，当 hikari 关闭连接时，postgres 没有任何感知，但是从 postgres 的角度来看，仍然存在连接。造成连接泄漏，我改下maxLifeTime观察一天看下情况

Robin Shen commented 2 years ago

IPVS 这方面不太了解。如果找到原因请告知，我也好写在文档里，感谢 ??

Zhou You commented 2 years ago

经过确认确实是IPVS的问题，将maxLifetime改为10分钟，再也没有连接泄漏了

#hibernate.hikari.minimumIdle=10
#hibernate.hikari.maximumPoolSize=10
#hibernate.hikari.keepaliveTime=60000
#hibernate.hikari.idleTimeout=120000
hibernate.hikari.leakDetectionThreshold=150000
hibernate.hikari.maxLifetime=600000
hibernate.hikari.connectionTimeout=3000
hibernate.hikari.validationTimeout=2500

CleanShot 2022-08-27 at 20.22.07@2x.png 不过还是建于连接池的配置最好都能通过环境变量进行调整

Robin Shen changed state to 'Closed' 2 years ago

Previous Value	Current Value
Open	Closed

Robin Shen commented 2 years ago

Thank you. These properties will be added to environment variables.

jbauer commented 2 years ago

As I also run OneDev in docker I encountered that problem.

Instead of changing the maxLifetime setting of HikariCP connection pool, you can also configure Linux to send tcp keep alive probes earlier. This is generally a good idea for any swarm service that uses IPVS and that has long idle times for TCP connections.

In my docker compose file I have added the following to all services that have such an idle behavior:

sysctls:
  - net.ipv4.tcp_keepalive_time=300
  - net.ipv4.tcp_keepalive_intvl=30
  - net.ipv4.tcp_keepalive_probes=10

The above tells docker to set the sysctl properties for the onedev service/container. Linux will then send the first tcp keep alive probe after 300 seconds of inactivity and then continues to send probes every 30 seconds. It will also send up to 10 probes if no probes have been answered. After 10 failed probes the connection is considered dead. This will keep the IPVS updated and the connection will not be silently killed by IPVS because of idle time. Because IPVS does not tell anybody that it has killed a connection (does not send any reset packets) both sides of the connection might leak the connection. In case of postgres the configured max connections on postgres side can be reached easily.

Zhou You commented 2 years ago

default TCP keepalive settings:

$ sysctl -a | grep tcp_keep
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200

However, as https://cloud.google.com/compute/docs/troubleshooting/general-tips mentions:

Note that idle connections are tracked for a maximum of 10 minutes, after which their traffic is subject to firewall rules, including the implied deny ingress rule. If your instance initiates or accepts long-lived connections with an external host, you should adjust TCP keep-alive settings on your Compute Engine instances to less than 600 seconds to ensure that connections are refreshed before the timeout occurs.

They recommend the following settings:

sudo /sbin/sysctl -w net.ipv4.tcp_keepalive_time=60 net.ipv4.tcp_keepalive_intvl=60 net.ipv4.tcp_keepalive_probes=5

In k8s we can to do this via the initContainer route.

initContainers:
  - command:
    - sysctl
    - -w
    - net.ipv4.tcp_keepalive_time=60
    - net.ipv4.tcp_keepalive_intvl=60
    - net.ipv4.tcp_keepalive_probes=5
    image: busybox:stable
    name: prepare-sysctl
    securityContext:
      privileged: true

The Kubernetes issue tracker mentions this: https://github.com/kubernetes/kubernetes/issues/32457#issuecomment-680325785 Since these settings seem to be reasonable defaults, I propose we make these our charts defaults as well.

However, it is simple and convenient to setting of HikariCP connection pool through environment variables

jbauer commented 2 years ago

However, it is simple and convenient to setting of HikariCP connection pool through environment variables

Sure it is an easy solution, but at the cost of postgres having to tear down and spin up a postgres backend every x minutes. That is something you can avoid when adjusting keep alive settings.

Keep in mind that maxLifetime also kills connections that have recently be used because it is a hard limit for each connection regardless of its usage. If your minimumIdle is smaller than maximumPoolSize then you can use idleTimeout to tear down idle connections earlier or you can use keepaliveTime so that HikariCP does a ping regularly which should refresh IPVS as well. If minimumIdle and maximumPoolSize are the same then you have to use keepaliveTime because the pool has a fixed size.

However I still think configuring tcp keep alive settings for a given k8s/swarm service is better, because I feel like the application should not assume/research anything about the underlying default system configuration in a cloud environment. Instead at deployment time the linux namespace for that container/service should be configured to better support the application characteristics.

But at the end all solutions work.

Type	Question
Priority	Normal
Assignee	Robin Shen

Issue Votes (0)

Watchers (4)

Reference

OD-893