A WebSocket Connection Leak Troubleshooting Investigation

Previously, I refactored a WebSocket service. After running for a period of time, a colleague reported that the service was lagging and often failed to get data.

So I tried to reproduce the situation.

After opening the browser and jumping to the corresponding page, I found that the data was loaded immediately, but after refreshing the page multiple times, the situation where data could not be loaded began to appear. Opening the console, I found the following error reported.

WebSocket connection to 'ws://xxxxx/' failed: Error in connection establishment: net::ERR_CONNECTION_FAILED

Generally speaking, this problem may be caused by network problems or connection timeout. After ruling out own network problems, I increased the connection timeout of WebSocket. But the above situation still occurred.

Then I guessed whether our long connection gateway had a problem. So I checked the gateway log and found that the gateway didn't receive the connection with ERR_CONNECTION_FAILED error at all.

So I determined that it was the problem of the project refactored before.

I started checking Pod logs and found that logs were not printed when connection was successfully established and disconnected.

The project used echo framework and used LogMiddleware, which prints request information after each request finishes.

That should be because when the client disconnected the WebSocket connection, the server did not disconnect, leading to WebSocket connection leak.

Enter Pod to verify connection count

netstat -na|grep ESTABLISHED|wc -l
> 19

After successfully establishing connection

netstat -na|grep ESTABLISHED|wc -l
> 23

After client exits

netstat -na|grep ESTABLISHED|wc -l
> 21

Obviously the connection leaked. The browser limits the number of connections to the same domain name, so it is possible that due to connection leak, the number of connections became too large, causing subsequent connections to fail to establish, resulting in ERR_CONNECTION_FAILED.

So I started to locate the problem in the code. The following is the approximate code logic.

This is a project related to K8S management platform. The service will listen to the state change of Pod and notify the result of change to the frontend through WebSocket.

func handler(c echo.Context) error {
    // ...
    // Use request.Context() as global context
 ctx := c.Request().Context()

 websocket.Handler(func(ws *websocket.Conn) {
  defer ws.Close()
  if err := watchPodStatus(ctx, param.Namespace, func(e watch.Event) (bool, error) {
    // Send change to frontend
    websocket.Message.Send(ws, ...)
        }); err != nil {
   websocket.Message.Send(ws, err.Error())
  }
 }).ServeHTTP(c.Response(), c.Request())

 return nil
}
// WatchPodStatus
// name: deployment name
func watchPodStatus(ctx context.Context, namespace string, conditionFunc func(e watch.Event) (bool, error)) error {
    // ...
    // This step will list changed pods
    // Handle events through conditionFunc method
    // Note: This method is blocking, it will not end unless ctx is canceled or conditionFunc returns true
 _, err = watchtools.UntilWithSync(ctx, listAndWatchFunction, &corev1.Pod{}, preconditionFunc, conditionFunc)
 return err
}

This is where I used echo.Request().Context() as context for watchtools.UntilWithSync. The original idea was that context would cancel after the request ended, thereby terminating watch event and exiting the entire request.

The result is request.Context() did not exit when the frontend called socket.close(). Leading to watch not exiting either, so defer ws.Close() could not be triggered, causing the server unable to end this request.

After knowing the reason, the problem is easy to solve. Use request.Context() as parent context to create a new context. Use a goroutine to listen to the status of WebSocketConnection. (By constantly conn.Receive()), because in this scenario there is no scenario where the client sends data to the server, this call will be blocked until after the client calls socket.close(), conn.Receive() receives an error message, then cancel() the above context to exit watch. The following is the modified code logic.

func handler(c echo.Context) error {
    // ...
    // Use request.Context() as global context
    ctx := c.Request().Context()
    cancelctx, cancel := context.WithCancel(ctx)

    websocket.Handler(func(ws *websocket.Conn) {
        go func(){
            for {
                // receive is a blocking method, if no message is received, it will wait forever
                // If an error is received, it means the connection is disconnected or an exception occurred, cancel ctx at this time, watch behavior will also exit
                if err := websocket.Message.Receive(ws, &res); err != nil{
                    cancel()
                }
            }
        }()
    defer ws.Close()
    if err := watchPodStatus(cancelctx, param.Namespace, func(e watch.Event) (bool, error){
                // Send change to frontend
                websocket.Message.Send(ws, ...)
            }); err != nil {
    websocket.Message.Send(ws, err.Error())
    }
    }).ServeHTTP(c.Response(), c.Request())

    return nil
}

Done!