Kubernetes Service Proxy
Kubernetes supports Service Proxy in three modes: iptables, IPVS, and Userspace. Analyzes the path of Service request packets according to Service Proxy mode.
1. Service, Pod Info
| |
[Shell 1] shows Service and Pod information for Kubernetes Service Proxy analysis. Three nginx Pods were created using Deployment, and ClusterIP type my-nginx-cluster Service, NodePort type my-nginx-nodeport Service, and LoadBalancer type my-nginx-loadbalancer Service were attached.
2. iptables Mode
![[Figure 1] Service Request Packet Path in iptables Mode](/blog-software/docs/theory-analysis/kubernetes-service-proxy/images/iptables-mode-service-packet-path.png)
[Figure 1] Service Request Packet Path in iptables Mode
| |
| |
| |
| |
| |
| |
| |
| |
The iptables Mode of Service Proxy is a mode that performs Service Proxy using iptables. It is the default proxy mode currently used by Kubernetes. [Figure 1] shows the path of Service request packets in iptables Mode. [NAT Table 1] ~ [NAT Table 8] show the actual contents of the main NAT Tables in [Figure 1]. The NAT Tables in [Figure 1] are configured identically on all nodes that make up the Kubernetes cluster. Therefore, Service request packets can be sent from any node that makes up the Kubernetes cluster.
Since request packets sent from most Pods are delivered to the Host’s Network Namespace through the Pod’s veth, request packets are delivered to the KUBE-SERVICES Table by the PREROUTING Table. Request packets sent from Pods or Host Processes using the Host’s Network Namespace are delivered to the KUBE-SERVICES Table by the OUTPUT Table. In the KUBE-SERVICES Table, if the Dest IP and Dest Port of the request packet match the IP and Port of a ClusterIP Service, the request packet is delivered to the KUBE-SVC-XXX Table, which is the NAT Table of the matching ClusterIP Service.
If the Dest IP of the request packet in the KUBE-SERVICES Table is the Node’s own IP, the request packet is delivered to the KUBE-NODEPORTS Table. In the KUBE-NODEPORTS Table, if the Dest Port of the request packet matches the Port of a NodePort Service, the request packet is delivered to the KUBE-SVC-XXX Table, which is the NAT Table of the NodePort Service. If the Dest IP and Dest Port of the request packet in the KUBE-SERVICES Table match the External IP and Port of a LoadBalancer Service, the request packet is delivered to the KUBE-FW-XXX Table, which is the NAT Table of the matching LoadBalancer Service, and then to the KUBE-SVC-XXX Table, which is the NAT Table of the LoadBalancer Service.
The KUBE-SVC-XXX Table uses iptables’ statistic feature to randomly and evenly load balance request packets to Pods that make up the Service. In [NAT Table 4], since the Service consists of 3 Pods, it can be confirmed that request packets are configured to be randomly and evenly load balanced to 3 KUBE-SEP-XXX Tables. In the KUBE-SEP-XXX Table, request packets perform DNAT to the Pod’s IP and the Port set in the Service. Request packets DNATed to the Pod’s IP are delivered to that Pod through the Container Network built via CNI Plugin.
Since request packets delivered to Services are delivered to Pods through iptables’ DNAT, response packets sent from Pods must be SNATed to the Service’s IP, not the Pod’s IP. There is no explicit SNAT Rule for Services in iptables. However, iptables SNATs response packets received from Service Pods based on TCP Connection information from Linux Kernel’s Conntrack (Connection Tracking).
2.1. Source IP
The Src IP of Service request packets is maintained or SNATed to the Host’s IP through Masquerade. The KUBE-MARK-MASQ Table is a table that marks packets for Masquerade of request packets. Marked packets are Masqueraded in the KUBE-POSTROUTING Table, and the Src IP is SNATed to the Host’s IP. When examining iptables Tables, it can be confirmed that packets to be Masqueraded are marked through the KUBE-MARK-MASQ Table in various places.
![[Figure 2] Packet Path According to externalTrafficPolicy of NodePort, LoadBalancer Service](/blog-software/docs/theory-analysis/kubernetes-service-proxy/images/nodeport-policy.png)
[Figure 2] Packet Path According to externalTrafficPolicy of NodePort, LoadBalancer Service
The externalTrafficPolicy of NodePort and LoadBalancer Services uses Cluster as the default value. If the externalTrafficPolicy value is set to Cluster, the Src IP of request packets is SNATed to the Host’s IP through Masquerade. The left side of [Figure 2] shows a diagram where the externalTrafficPolicy value is set to Cluster and request packets are Masqueraded. It can be confirmed that all packets with the Port of NodePort and LoadBalancer Services as Dest Port in the KUBE-NODEPORTS Table are marked through the KUBE-MARK-MASQ Table.
If the externalTrafficPolicy value is set to Local, the rules related to the KUBE-MARK-MASQ Table disappear from the KUBE-NODEPORTS Table, and Masquerade is not performed. Therefore, the Src IP of request packets is maintained as-is. Additionally, request packets are not load balanced on the Host and are only delivered to Target Pods running on the Host where the request packet was delivered. If a request packet is delivered to a Host without Target Pods, the request packet is dropped. The right side of [Figure 2] shows a diagram where externalTrafficPolicy is set to Local and Masquerade is not performed.
externalTrafficPolicy: Local is mainly used in LoadBalancer Services. This is because the Src IP of request packets can be maintained, and since the Cloud Provider’s Load Balancer performs load balancing, the Host’s load balancing process is unnecessary. If externalTrafficPolicy: Local, packets are dropped on Hosts without Target Pods, so during the Host Health Check process performed by the Cloud Provider’s Load Balancer, Hosts without Target Pods are excluded from load balancing targets. Therefore, the Cloud Provider’s Load Balancer load balances request packets only to Hosts with Target Pods.
![[Figure 3] Packet Path Before/After Applying Hairpinning in iptables Mode](/blog-software/docs/theory-analysis/kubernetes-service-proxy/images/iptables-mode-hairpinning.png)
[Figure 3] Packet Path Before/After Applying Hairpinning in iptables Mode
Masquerade is also necessary when a Pod sends request packets to the IP of the Service it belongs to and the request packet returns to itself. The left side of [Figure 3] shows this case. Request packets are DNATed, so both the Src IP and Dest IP of the packet become the Pod’s IP. Therefore, when sending response packets for request packets returned from the Pod, SNAT is not performed because packets are processed inside the Pod without passing through the Host’s NAT Table.
Using Masquerade, request packets returned to the Pod can be forcibly passed to the Host so that SNAT is performed. This technique of intentionally bypassing packets to receive them is called Hairpinning. The right side of [Figure 3] shows a case where Hairpinning is applied using Masquerade. In the KUBE-SEP-XXX Table, if the Src IP of the request packet is the same as the IP to DNAT, that is, when a Pod receives a packet it sent to the Service, the request packet is marked through the KUBE-MARK-MASQ Table and Masqueraded in the KUBE-POSTROUTING Table. Since the Src IP of the packet received by the Pod is set to the Host’s IP, the Pod’s response is delivered to the Host’s NAT Table and then SNATed and DNATed again to be delivered to the Pod.
3. Userspace Mode
![[Figure 4] Service Request Packet Path in Userspace Mode](/blog-software/docs/theory-analysis/kubernetes-service-proxy/images/userspace-mode-service-packet-path.png)
[Figure 4] Service Request Packet Path in Userspace Mode
| |
| |
| |
| |
The Userspace Mode of Service Proxy is a mode where kube-proxy running in Userspace performs the Service Proxy role. It was the first Proxy Mode provided by Kubernetes. Currently, it is rarely used due to inferior performance compared to iptables Mode. [Figure 4] shows the path of Service request packets in Userspace Mode. [NAT Table 9] ~ [NAT Table 12] show the actual contents of the main NAT Tables in [Figure 4]. The NAT Tables and kube-proxy in [Figure 4] are configured identically on all nodes that make up the Kubernetes cluster. Therefore, Service request packets can be sent from any node that makes up the Kubernetes cluster.
Since request packets sent from most Pods are delivered to the Host’s Network Namespace through the Pod’s veth, request packets are delivered to the KUBE-PORTALS-CONTAINER Table by the PREROUTING Table. In the KUBE-PORTALS-CONTAINER Table, if the Dest IP and Dest Port of the request packet match the IP and Port of a ClusterIP Service, the request packet is Redirected to kube-proxy. If the Dest IP of the request packet is the Node’s own IP, the packet is delivered to the KUBE-NODEPORT-CONTAINER Table. In the KUBE-NODEPORT-CONTAINER Table, if the Dest Port of the request packet matches the Port of a NodePort Service, the request packet is Redirected to kube-proxy. If the Dest IP and Dest Port of the request packet match the External IP and Port of a LoadBalancer Service, that request packet is also Redirected to kube-proxy.
Request packets sent from Pods or Host Processes using the Host’s Network Namespace are delivered to the KUBE-PORTALS-HOST Table by the OUTPUT Table. The subsequent request packet processing in the KUBE-PORTALS-HOST and KUBE-NODEPORT-HOST Tables is similar to request packet processing in the KUBE-PORTALS-CONTAINER and KUBE-NODEPORT-CONTAINER Tables. The difference is that DNAT is performed instead of Redirecting request packets.
All request packets sent to Services through Redirect or DNAT are delivered to kube-proxy. One Service is mapped per Dest Port of request packets received by kube-proxy. Therefore, kube-proxy can identify which Service the request packet should be delivered to through the Dest Port of the Redirected or NATed request packet. kube-proxy evenly load balances received request packets to multiple Pods belonging to the Service that the request packet should be sent to and retransmits them.
Since kube-proxy operates in the Host’s Network Namespace, request packets sent by kube-proxy also pass through Service NAT Tables. However, since the Dest IP of request packets sent by kube-proxy is the Pod’s IP, the request packet is not changed by Service NAT Tables and is delivered to that Pod through the Container Network built via CNI Plugin.
4. IPVS Mode
![[Figure 5] Service Request Packet Path in IPVS Mode](/blog-software/docs/theory-analysis/kubernetes-service-proxy/images/ipvs-mode-service-packet-path.png)
[Figure 5] Service Request Packet Path in IPVS Mode
| |
| |
| |
| |
| |
| |
| |
The IPVS Mode of Service Proxy is a mode where IPVS, an L4 Load Balancer provided by the Linux Kernel, performs the Service Proxy role. Since IPVS shows higher performance than iptables when performing Packet Load Balancing, IPVS Mode shows higher performance than iptables Mode. [Figure 5] shows the path of request packets sent to Services in IPVS Mode. [NAT Table 13] ~ [NAT Table 15] show the actual contents of the main NAT Tables in [Figure 5]. [IPset List 1] shows the main IPset list in IPVS Mode. [IPVS List 1] shows the actual contents of IPVS in [Figure 5]. The NAT Tables, IPset, and IPVS in [Figure 5] are configured identically on all nodes that make up the Kubernetes cluster. Therefore, Service request packets can be sent from any node that makes up the Kubernetes cluster.
Since request packets sent from most Pods are delivered to the Host’s Network Namespace through the Pod’s veth, request packets are delivered to the KUBE-SERVICES Table by the PREROUTING Table. Request packets sent from Pods or Host Processes using the Host’s Network Namespace are delivered to the KUBE-SERVICES Table by the OUTPUT Table. In the KUBE-SERVICES Table, if the Dest IP and Dest Port of the request packet match the IP and Port of a ClusterIP Service, the request packet is delivered to IPVS.
If the Dest IP of the request packet is the Node’s own IP, the request packet is delivered to IPVS via the KUBE-NODE-PORT Table. If the Dest IP and Dest Port of the request packet match the External IP and Port of a LoadBalancer Service, that request packet is delivered to IPVS via the KUBE-LOAD-BALANCER Table. If the Default Rule of the PREROUTING and OUTPUT Tables is Accept, packets delivered to Services are delivered to IPVS even without the KUBE-SERVICES Table, so it does not affect Services.
IPVS performs Load Balancing and DNAT to the Pod’s IP and the Port set in the Service when the Dest IP and Dest Port of the request packet match the Service’s Cluster-IP and Port, when the Dest IP of the request packet is the Node’s own IP and the Dest Port matches the NodePort Service’s NodePort, or when the Dest IP and Dest Port of the request packet match the LoadBalancer Service’s External IP and Port. Request packets DNATed to the Pod’s IP are delivered to that Pod through the Container Network built via CNI Plugin. In [IPVS List 1], it can be confirmed that Load Balancing and DNAT are performed for all IPs associated with Services.
IPVS also uses TCP Connection information from Linux Kernel’s Conntrack, same as iptables. Therefore, response packets for Service Packets that were DNATed and sent by IPVS are SNATed again by IPVS and delivered to the Pod or Host Process that requested the Service. Hairpinning is also applied in IPVS Mode, like in iptables Mode, to solve the SNAT problem of Service response packets. It can be seen that Masquerade is performed in the KUBE-POSTROUTING Table when matching the KUBE-LOOP-BACK IPset rule. It can be confirmed that the KUBE-LOOP-BACK IPset contains all possible cases where the Packet’s Src IP and Dest IP can be the same Pod’s IP.