CARVIEW |
Navigation Menu
-
Notifications
You must be signed in to change notification settings - Fork 242
Description
This issue is seen in both AKS and GCP. See notes for AKS at #252 (comment)
Describe the bug
Upon installation of Retina, connectivity can be lost for pods in a GKE cluster using managed Cilium.
To Reproduce
- Go to create a standard GKE cluster.
- Select the
Standard: You manage your cluster
option (see screenshot 1). - Specify GKE version
1.26.11-gke.105500
in theNo channel
channel selector (see screenshot 2). We suspect the issue would occur with other versions too, but we used a specific one for reproducability. - [Optional] Configure the cluster to run in one AZ with fewer nodes than the default to manage cost.
- [Important] In the
Networking
configuration tab for the entire cluster, selectEnable Dataplane V2
to enable managed Cilium-powered networking. - Create the cluster and wait for all default pods in the cluster to come up.
- Install Retina and wait for the agent pods to start.
> VERSION=$( curl -sL https://api.github.com/repos/microsoft/retina/releases/latest | jq -r .name)
helm install retina oci://ghcr.io/microsoft/retina/charts/retina \
--set namespace=kube-system \
--version $VERSION \
--namespace kube-system \
--set image.tag=$VERSION \
--set operator.tag=$VERSION \
--set image.pullPolicy=Always \
--set logLevel=info \
--set operator.enabled=true \
--set operator.enableRetinaEndpoint=true \
--set enabledPlugin_linux="\[packetparser\]" \
--set enablePodLevel=true \
--set remoteContext=true
Note: if you are running a cluster with small nodes, you might need to manually edit the retina-agent DaemonSet to lower resource requests. Wait until retina-agent pods start.
- Identify
metrics-server
running in thekube-system
namespace and check its logs. You will see error logs such as
E0409 15:21:23.378785 1 webhook.go:202] Failed to make webhook authorizer request: Post "https://10.114.192.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews?timeout=10s": context canceled
E0409 15:21:23.378851 1 errors.go:77] Post "https://10.114.192.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews?timeout=10s": context canceled
- Identify the cluster IP and the endpoint IP:
> kubectl get service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.114.192.1 <none> 443/TCP 45m
> kubectl get ep
NAME ENDPOINTS AGE
kubernetes 10.128.0.7:443 45m
- Connect to another pod and check connectivity to these origins. You'll see that there is connectivity to the endpoint IP but not to the service IP.
> kubectl debug -ti --image="nixery.dev/shell/curl" kube-dns-ff4bbcc87-tvzm7 -n kube-system
bash-5.2# curl https://10.114.192.1 -v -k
...
bash-5.2# curl https://10.128.0.7 -v -k
* Trying 10.128.0.7:443...
* Connected to 10.128.0.7 (10.128.0.7) port 443
* ALPN: curl offers h2,http/1.1
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Request CERT (13):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Certificate (11):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
* ALPN: server accepted h2
* Server certificate:
* subject: CN=34.173.138.225
* start date: Apr 9 14:52:44 2024 GMT
* expire date: Apr 8 14:54:44 2029 GMT
* issuer: CN=ca353e3b-048b-4feb-aa93-19a7c8a6aa89
* SSL certificate verify result: unable to get local issuer certificate (20), continuing anyway.
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* using HTTP/2
* [HTTP/2] [1] OPENED stream for https://10.128.0.7/
* [HTTP/2] [1] [:method: GET]
* [HTTP/2] [1] [:scheme: https]
* [HTTP/2] [1] [:authority: 10.128.0.7]
* [HTTP/2] [1] [:path: /]
* [HTTP/2] [1] [user-agent: curl/8.4.0]
* [HTTP/2] [1] [accept: */*]
> GET / HTTP/2
> Host: 10.128.0.7
> User-Agent: curl/8.4.0
> Accept: */*
>
* received GOAWAY, error=0, last_stream=1
< HTTP/2 403
< audit-id: 2c7f6280-d595-4ddf-850f-abf1cadd85d8
< cache-control: no-cache, private
< content-type: application/json
< x-content-type-options: nosniff
< x-kubernetes-pf-flowschema-uid: 759447f6-3823-412a-86a3-09c764ef91eb
< x-kubernetes-pf-prioritylevel-uid: 2707b41b-d15c-402a-a039-b0df8aff1c2d
< content-length: 217
< date: Tue, 09 Apr 2024 15:45:36 GMT
<
{
"kind": "Status",
"apiVersion": "v1",
"metadata": {},
"status": "Failure",
"message": "forbidden: User \"system:anonymous\" cannot get path \"/\"",
"reason": "Forbidden",
"details": {},
"code": 403
* Closing connection
* TLSv1.3 (OUT), TLS alert, close notify (256):
Expected behaviour
No connectivity impact when installing Retina.
Screenshots
Step (2). Select Standard: You manage your cluster
.
Step (3). No channel
when specifying the version, then specify version 1.26.11-gke.1055000
.
Step (4). Select Enable Dataplane V2
in the cluster network configuration tab.
Platform (please complete the following information):
See steps to reproduce.
Additional context
N/A
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Status