Weird Ktor and Nginx Upstream Issue

Publicado às 15/04/2023 16:00 • #kotlin #ktor #linux #nginx

Lately I've noticed that my nginx server is throwing "upstream prematurely closed connection while reading upstream" when reverse proxying my Ktor webapps, and I'm not sure why.

The client (Ktor) fails with "Chunked stream has ended unexpectedly: no chunk size" when nginx throws that error.

Exception in thread "main" java.io.EOFException: Chunked stream has ended unexpectedly: no chunk size
    at io.ktor.http.cio.ChunkedTransferEncodingKt.decodeChunked(ChunkedTransferEncoding.kt:77)
    at io.ktor.http.cio.ChunkedTransferEncodingKt$decodeChunked$3.invokeSuspend(ChunkedTransferEncoding.kt)
    at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
    at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
    at kotlinx.coroutines.internal.LimitedDispatcher.run(LimitedDispatcher.kt:42)
    at kotlinx.coroutines.scheduling.TaskImpl.run(Tasks.kt:95)
    at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:570)
    at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:750)
    at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:677)
    at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:664)

The error happens randomly, and it only seems to affect big (1MB+) requests... And here's the rabbit hole that I went down to track the bug and figure out a solution.

My setup is relatively simple: Cloudflare connects to my nginx server, and nginx proxies the connection to my webapp. The webapp connection is proxied by tailscale. So the first thing that I tried figuring out was if it was tailscale's fault or not, but the results were... inconclusive.

If the webapp is on the same host as the nginx server (path: cloudflare -> nginx -> webapp), the issue doesn't happen
If I connect to the webapp over tailscale (path: webapp (tailscale)), the issue doesn't happen
If I connect to the nginx server directly (path: nginx -> webapp (tailscale)), the issue doesn't happen
If I connect to the nginx server over tailscale (path: nginx (tailscale) -> webapp (tailscale)), the issue doesn't happen
If I connect to the nginx server over cloudflare (path: cloudflare -> nginx -> webapp (tailscale)), the issue happens

Why would Cloudflare interfere on the issue? What could Cloudflare be doing that would cause the upstream server (the webapp) to fail?

And another weird thing? I cannot reproduce the issue with curl. Well, actually, I can reproduce the error on nginx's side with curl, but the response is successfully received on curl's end, so the response is being sent to the client. In fact, if you look at nginx's logs, you can see that it says that it sent the response to the client. However, Ktor's HTTP client (tested with Apache and CIO engine) and Java's HTTP client fail to parse the response.

nginx-web-1        | 2023/04/15 03:17:19 [error] 20#20: *46 upstream prematurely closed connection while reading upstream, client: 172.71.238.54, server: rrp-test.loritta.website, request: "GET /img/c04518cdc6c9806022e89a64b0e36eb61da19d6e.gif?test=1681528486977 HTTP/1.1", upstream: "http://100.111.57.121:8080/img/c04518cdc6c9806022e89a64b0e36eb61da19d6e.gif?test=1681528486977", host: "rrp-test.loritta.website"
nginx-web-1        | 172.71.238.54 - - [15/Apr/2023:03:17:19 +0000] "GET /img/c04518cdc6c9806022e89a64b0e36eb61da19d6e.gif?test=1681528486977 HTTP/1.1" 200 219781 "-" "Ktor client"

So I've decided to investigate further... Trying to research about this issue on the internet is a pain because most of the related issues are timeout related, such as uWSGI/FastCGI/etc timing out, which isn't related to my issue since I know that it isn't the webapp timing out.

When I asked about this issue on the Bot Managers guild, Carl asked GPT 4 about this issue, and one of the solutions they gave was this

proxy_http_version 1.1;
proxy_set_header Connection "";

Surprisingly, this does fix the issue... but why, it doesn't make any sense! Why would that fix the issue? No one knows! There's two posts on StackOverflow talking about this fix, the first post says that they don't know why this fixes the issue, and the second post points to the first post, and they also don't know why this fixes the issue.

This is a solution, but I wasn't satisfied with it. I want to know more, I want a better solution, something is going wrong and I don't know what it is.

fofofofofofofofo suggested that I enabled nginx's debug logs and tcpdump'ing the connection, but alas, nginx's debug logs are super verbose and didn't indicate anything sus, and tcpdump didn't show any network related issues, both requests were received, parsed, and sent to nginx.

Then I thought about trying to change the server's engine. I always used Netty on all of my projects, so I didn't think about changing engines, so I swapped to CIO engine and... it worked.

It worked! That fixed the issue! I wasn't able to reproduce the issue anymore! With Netty, if I executed ~200 requests, around ~60 requests would fail due to the upstream issue. And with CIO, if I executed ~200 requests, none of them would fail.

After I found out that changing engines fixes the issue, I thought about the previous solution, and I came up with a theory about why it worked:

By default, nginx uses http/1.0 with the Connection: close header, the Connection: close header means that the client will close the connection after the response is received.

Maybe there's a race condition on Ktor's Netty engine implementation, where the server closes the connection before the client has fully read the response, so when nginx detects that upstream closed the connection prematurely, it thinks "oh shit" and flushes the received data downstream, and then it flushes the rest of the received data, because it did receive a complete request (and it knows that it did, after the error is thrown, nginx logs that it sent a http 200 response to the client), it just didn't receive the entire data before flushing.

That incomplete request borks out most http clients, curl is unaffected because curl is curl, so I guess curl has some failsafes.

The reason why using http/1.1 and removing the Connection header works is because the server, according to the http/1.1 spec, defaults to keep-alive if the Connection header isn't present, so the server ends up not closing the connection because it thinks that nginx will reuse the connection later on, making the client (in this case, nginx) handle the connection close step, which avoids the issue.

This would explain why setting http/1.1 + removing the connection header OR switching http engines fixes the issue.

Sadly, because this is so hard to reproduce and pinpoint what trips nginx, I haven't submitted this issue to JetBrains. If anyone has a better and consistent way to reproduce this issue, feel free to submit it as an issue on JetBrains' YouTrack! But for now, I'm relieved that I found a solution after spending ~2 days attempting to track down this stupid issue.