Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Incoming request ended abruptly: context canceled - http type #1360

Open
seano-vs opened this issue Nov 21, 2024 · 9 comments
Open

🐛 Incoming request ended abruptly: context canceled - http type #1360

seano-vs opened this issue Nov 21, 2024 · 9 comments
Labels
Priority: Normal Minor issue impacting one or more users Type: Bug Something isn't working

Comments

@seano-vs
Copy link

seano-vs commented Nov 21, 2024

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

  1. Running cloudflared in Kubernetes (DOcean)

If it's an issue with Cloudflare Tunnel:
4. Tunnel ID : 37c23069-2c21-413b-917e-3c618d4e05ef
5. Configuration is here: https://github.com/Five-Borough-Fedi-Project/masto.nyc-docean/blob/main/kubernetes/dependencies/cloudflared.yaml and here: https://github.com/Five-Borough-Fedi-Project/masto.nyc-docean/blob/main/kubernetes/services/configmap-cloudflared.yaml

Expected behavior
A clear and concise description of what you expected to happen:

I'm tracking down some issues in our proxy, and am addressing strange errors in the chain. This keeps coming up. It's not so much that I expect them to not happen, but I expect the error message to explain more what is going on- Incoming request ended abruptly: context canceled doesn't really tell me anything. What does this mean? How do I address it?

Environment and versions

  • OS: official docker image
  • Architecture: DOcean k8s
  • Version: 2024.11.1

Logs and errors

2024-11-21T14:40:33Z INF Starting tunnel tunnelID=37c23069-2c21-413b-917e-3c618d4e05ef
2024-11-21T14:40:33Z INF Version 2024.11.1 (Checksum 2737bd3967532125fd5b3dd3a066ba19a90513fb302521442c3f8fda81b59e9f)
2024-11-21T14:40:33Z INF GOOS: linux, GOVersion: go1.22.5-devel-cf, GoArch: amd64
2024-11-21T14:40:33Z INF Settings: map[config:/etc/cloudflared/config/config.yaml cred-file:/etc/cloudflared/creds/credentials.json credentials-file:/etc/cloudflared/creds/credentials.json metrics:0.0.0.0:2000 no-autoupdate:true]
2024-11-21T14:40:33Z INF Environmental variables map[STUNNEL_PORT:tcp://10.245.58.160:6379 STUNNEL_PORT_6379_TCP:tcp://10.245.58.160:6379 STUNNEL_PORT_6379_TCP_ADDR:10.245.58.160 STUNNEL_PORT_6379_TCP_PORT:6379 STUNNEL_PORT_6379_TCP_PROTO:tcp STUNNEL_SERVICE_HOST:10.245.58.160 STUNNEL_SERVICE_PORT:6379 STUNNEL_SERVICE_PORT_STUNNEL:6379]
2024-11-21T14:40:33Z INF Generated Connector ID: 68095439-0a95-4c0f-82f8-91203f90c9df
2024-11-21T14:40:33Z INF Initial protocol quic
2024-11-21T14:40:33Z INF ICMP proxy will use 10.244.0.68 as source for IPv4
2024-11-21T14:40:33Z INF ICMP proxy will use fe80::2467:98ff:fe27:a1a6 in zone eth0 as source for IPv6
2024-11-21T14:40:33Z INF Starting Hello World server at 127.0.0.1:40067
2024-11-21T14:40:33Z INF Starting metrics server on [::]:2000/metrics
2024-11-21T14:40:34Z INF Registered tunnel connection connIndex=0 connection=xxx event=0 ip=198.41.200.43 location=ewr07 protocol=quic
2024-11-21T14:40:34Z INF Registered tunnel connection connIndex=1 connection=xxx event=0 ip=198.41.192.107 location=ewr11 protocol=quic
2024-11-21T14:40:35Z INF Registered tunnel connection connIndex=2 connection=xxx event=0 ip=198.41.200.73 location=ewr08 protocol=quic
2024-11-21T14:40:36Z INF Registered tunnel connection connIndex=3 connection=xxx event=0 ip=198.41.192.47 location=ewr01 protocol=quic
2024-11-21T14:45:52Z ERR  error="Incoming request ended abruptly: context canceled" connIndex=1 event=1 ingressRule=0 originService=http://mastodon-nginx.mastodon:80
2024-11-21T14:45:52Z ERR Request failed error="Incoming request ended abruptly: context canceled" connIndex=1 dest=https://masto.nyc/api/v1/accounts/verify_credentials event=0 ip=198.41.192.107 type=http
2024-11-21T14:53:22Z ERR  error="Incoming request ended abruptly: context canceled" connIndex=1 event=1 ingressRule=0 originService=http://mastodon-nginx.mastodon:80
2024-11-21T14:53:22Z ERR Request failed error="Incoming request ended abruptly: context canceled" connIndex=1 dest=https://masto.nyc/api/v2/instance event=0 ip=198.41.192.107 type=http
2024-11-21T14:53:22Z ERR  error="Incoming request ended abruptly: context canceled" connIndex=1 event=1 ingressRule=0 originService=http://mastodon-nginx.mastodon:80
2024-11-21T14:53:22Z ERR Request failed error="Incoming request ended abruptly: context canceled" connIndex=1 dest=https://masto.nyc/api/v1/statuses/107043229643297115 event=0 ip=198.41.192.107 type=http
2024-11-21T14:53:22Z ERR  error="Incoming request ended abruptly: context canceled" connIndex=1 event=1 ingressRule=0 originService=http://mastodon-nginx.mastodon:80
2024-11-21T14:53:22Z ERR Request failed error="Incoming request ended abruptly: context canceled" connIndex=1 dest=https://masto.nyc/api/v1/accounts/verify_credentials event=0 ip=198.41.192.107 type=http
2024-11-21T14:53:22Z ERR  error="Incoming request ended abruptly: context canceled" connIndex=1 event=1 ingressRule=0 originService=http://mastodon-nginx.mastodon:80
2024-11-21T14:53:22Z ERR Request failed error="Incoming request ended abruptly: context canceled" connIndex=1 dest=https://masto.nyc/api/v1/accounts/112074437915000549 event=0 ip=198.41.192.107 type=http
2024-11-21T15:01:15Z ERR  error="Incoming request ended abruptly: context canceled" connIndex=1 event=1 ingressRule=0 originService=http://mastodon-nginx.mastodon:80
2024-11-21T15:01:15Z ERR Request failed error="Incoming request ended abruptly: context canceled" connIndex=1 dest=https://masto.nyc/api/v1/followed_tags?limit=200 event=0 ip=198.41.192.107 type=http
2024-11-21T15:01:15Z ERR  error="Incoming request ended abruptly: context canceled" connIndex=1 event=1 ingressRule=0 originService=http://mastodon-nginx.mastodon:80
2024-11-21T15:01:15Z ERR Request failed error="Incoming request ended abruptly: context canceled" connIndex=1 dest=https://masto.nyc/api/v1/statuses/107043229643297115 event=0 ip=198.41.192.107 type=http
@seano-vs seano-vs added Priority: Normal Minor issue impacting one or more users Type: Bug Something isn't working labels Nov 21, 2024
@baptistejamin
Copy link

We have this recurring issue at Crisp (Cloudflare Enterprise customer) that significantly impacts us, especially in the past three weeks. We've been raising this numerous times here as well as Cloudflare Support for years.

From my understanding, the problem lies in Cloudflared’s inability to auto-detect and switch to stable proxy points when a nearby location becomes unstable, from instance during a CF maintenance in a POP.

Example:
Last night (22 Nov 2024), during maintenance on the AMS POP (as per the Cloudflare Status page), Cloudflared continued routing traffic through AMS, despite the POP being marked as unhealthy. Our servers are located in AMS, and instead of dynamically rerouting traffic to a stable proxy (e.g., LON or FRA), Cloudflared persisted with AMS, causing network instability.

Possible issue:

Cloudflared appears to use locations defined in allregions/discovery.go.
It relies on SRV records to determine proxy points.
However, the SRV records used by Cloudflared do not seem to account for the health status or connectivity stability of nearby POPs.

We think issue could be:

  • DNS Caching: Cloudflare Discovery servers reroute traffic dynamically, but Cloudflared may still rely on cached SRV records, delaying the switch to a healthy proxy point.
  • SRV Updates: Discovery servers may not propagate updated SRV records when a location becomes unstable or undergoes maintenance.

Anyway, there is a lack of Stability Checks: Cloudflare does not appear to validate the stability of the connection to the selected proxy point.

For instance:
If AMS is marked unstable, SRV might reroute traffic to LON.
However, there is no mechanism to ensure the LON route is viable, such as detecting packet loss or latency spikes between the two points.

@morpig
Copy link

morpig commented Nov 22, 2024

Last night (22 Nov 2024), during maintenance on the AMS POP (as per the Cloudflare Status page), Cloudflared continued routing traffic through AMS, despite the POP being marked as unhealthy.

Hi, I'll help answer from the knowledge I know.

Cloudflare operates multi data centers (known as PoPs) within one city, therefore, one city (e.x. Amsterdam) marked as "re-routed" or "partially re-routed" doesn't mean that the entire PoPs in Amsterdam is down. unless, cloudflare only operates one data center/PoP within one city (not the case with major cities such as AMS/LON/FRA)

you brought up SRV and DNS caching, it is not the issue here as Cloudflare Tunnel IPs are all anycast to the nearest regions.


However, using Cloudflare Tunnel doesn't guarantee network performance to endusers.

Traffic from enduser's cloudflare pop <-> tunnel POP is carried by best-effort performance. let's say I'm hitting cloudflare's Japan PoP, traffic to your tunnel PoP (AMS) will go over public backbone. if there're any cable cuts, that might have an impact on perrformance.

the solution to have better performance globally is by using Argo.

I don't know what exact issues are you having, so maybe try turn on Argo for a bit to see if there're any differences.

@valeriansaliou
Copy link

valeriansaliou commented Nov 22, 2024

Hi @morpig thank you for your response. I am answering for Baptiste since I’m also from Crisp and handled the downtime this night.

What our cfd indicated when we restarted them was that all 8 HA connections were established with ams00 targets, where 00 is a number that seemed to range from 01 to 16. This clearly indicates that your control plane served us with AMS endpoints, even if AMS was in maintenance at the same time.

We did not observe packet loss in the region during the outage, therefore we linked our tunnel outage to CF AMS maintenance. Our DC is at DigitalOcean AMS, so I expect we cross common exchange points away from the public backbone.

What I would have expected is that the CF control plane would serve backup PoP that cfd could use instead such as FRA which is nearby. We’ve seen our AMS cfd connect to both CF AMS and CF FRA PoPs in the past, as per cfd logs in our syslog. However yesterday CF tunnel control plane only served us with AMS CF PoP for all 8 HA connections.

While I understand the principle of anycast, could some network level routing issue have occurred due to this night maintenance. It’s really hard to tell but I suspect serving different PoPs would have helped with resolving the outage.

@morpig
Copy link

morpig commented Nov 22, 2024

We’ve seen our AMS cfd connect to both CF AMS and CF FRA PoPs in the past, as per cfd logs in our syslog. However yesterday your control plane only served us with AMS CF PoP for all 8 HA connections.

I brought this exact problem sometime ago (and got reply from CF team) #747 (comment)

@valeriansaliou
Copy link

valeriansaliou commented Nov 22, 2024

Interesting you’ve also noticed the same issue as us. That should have degraded smoothly yesterday however given CF response to you in 2022, which clearly wasn’t the case.

To be noted that QUIC was completely down during the outage, and we had to force HTTP2 to get traffic to flow with excruciatingly high latency (5-10s per request).

@flpydsk
Copy link

flpydsk commented Dec 1, 2024

What did support have to say about this?
You will get a much better response from support than in gh.
Tunnels should just select some working 4 from the list served and cfs wider routing decides which pop actually serves the tunnel endpoint. only QUIC being down sounds like some you - cf network weirdness perhaps firewall or on your isp or some other issue with the tunnel machines

If you want you can set the edge addresses used manually using --edge ip:7844 and do your own checks for health

@baptistejamin
Copy link

So far we received from them "I've escalated the issue and we'll get back to you as soon as we have more information."

We don't expect any answer from them; they say it will be escalated every time but do nothing on their end.

@flpydsk
Copy link

flpydsk commented Dec 1, 2024

that is certainly odd. you said you are an ent client? do you have tunnel logs from this time?

@flpydsk
Copy link

flpydsk commented Dec 1, 2024

the SRV Records should not change regardless of the state of the PoP's just CF's routing of those ips. so DNS Cache should not impact anything. There is also health checking of the tunnel connections, however the routes and the health of those is out of scope of cloudflared. if you need dedicated connectivity or more control you should consider CPI/PNI.

What our cfd indicated when we restarted them was that all 8 HA connections were established with ams00 targets, where 00 is a number that seemed to range from 01 to 16. This clearly indicates that your control plane served us with AMS endpoints, even if AMS was in maintenance at the same time.

AMS is a MCP, the appended numbers are the PoP number in that city

What I would have expected is that the CF control plane would serve backup PoP that cfd could use instead such as FRA which is nearby. We’ve seen our AMS cfd connect to both CF AMS and CF FRA PoPs in the past, as per cfd logs in our syslog. However yesterday CF tunnel control plane only served us with AMS CF PoP for all 8 HA connections.

The control plane does not really serve you with a specific pop for tunnel endpoints as you might expect, bgp decides where your tunnel endpoints end up

While I understand the principle of anycast, could some network level routing issue have occurred due to this night maintenance. It’s really hard to tell but I suspect serving different PoPs would have helped with resolving the outage.

Id say this is more likey to be network/isp related than a CF incident. cloudflared logs would help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: Normal Minor issue impacting one or more users Type: Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants