ServiceMeshCon2021:Kubernetes 和 Service Mesh 升级汽车公司的 IT 基础设施


image-20240911230412328

记录在2021年5月4日在欧洲ServiceMeshCon上发表的技术演讲《Kubernetes and Service Mesh Upgrade Automobile Company’s IT Infrastructure》,分享了一个服务网格在一个客户的实践案例。

议题:

Rapid business development brings a great challenge to automobile manufacturing company’s IT platforms. In this presentation, Chaomeng will share a practice of upgrading the traditional IT built microservice platform to cloud native infrastructure. That is gradually transforming the self-developed inner DNS plus ELB for service discovery and load balance, per VM nginx for inbound traffic management, metric, and access log, to Kubernetes and service mesh.

The practice solves the problems that every service request crosses too many heterogeneous middleware and proxies, and turns out to provide lightweight resource management, auto-scaling, canary, non-intrusive and transparent traffic management, security and observability on a consistent infrastructure, and thus improve ops efficiency, makes ops work simpler and easier.

正文:

Hello everyone, I'm chaomeng from HUAWEI. My topic today is "Kubernetes and Service mesh upgrade automobile Company IT infrastructure". I will introduce a service mesh practice in production environment. Since launched in 2018,HUAWEI Application Service Mesh ASM has served a large-number of customers. The practice is about one of them, a leading automobile manufacture in China. I will introduce how cloud native help upgrade their It infrastructure.

About me. I am architect of huawei application service mesh. Currently develop cloud-native projects, such as service mesh, kubernetes, micro service. And also help promote service mesh and cloud native in China, I am author of book "Cloud Native Service Mesh Istio", help people learn service mesh and istio.

My talks include following three parts. The first part is the customer business background and current solution, The second part is about the challenge of current solution, and most importantly target cloud native solution. To solve these challenge And finally, I will introduce how to migrate current solution to target cloud native solution.

Our customer is a leading automobile manufacture. With the growth of automobile demand in China, the company business developed dramatically, especially their new energy vehicles. And put great pressure on the IT infrastructure. Such as: Increased complexity, as application become complicated, and need to integrate with other platform. Increased Capacity, as number of vehicles increased. And more security requirements, includes access control and service authentication. And heavy ops works and high IT cost are also big problems.

This is the current architecture. The customers IT engineer told us: "they say No to the popular micro service framework several years ago. Instead, build their own micro service platform based on DNS, ELB and Nginx. And provide service discovery, load balancing, in an independent platform rather than in application .
The total architecture greatly depend on the central ELB, provides load balancing for internal service communication. And DNS is responsible for internal service resolution. ingress Nginx provide TLS termination for ingress TLS traffic, Zuul play the role of L7 application gateway, nginx on each node proxy traffic to local service instance.

To better understand the difference of current solution and target cloud native solution, we will compare the abstraction view of two architectures. An abstraction of the current architectures will looks like this. The top is application layer. Applications are developed with Multilanguage. And the third layer is responsible for deploy environment. Currently applications are deployed on Virtual machine and BMS. The central part is the second layer, provide service management, by integrating ELB, DNS and Nginx.
Because of second layer micro service platform, the applications are mainly developed in Spring boot, instead of spring cloud.

And this is an abstraction of target solution, compared with the previous one, the main difference is the second and third layer.

  1. Replace the ELB,DNS,Nginx integrated platform with a unified service mesh infrastructure.
  2. And upgrade deployment from VM and BMS to kubernetes.
    The total architecture does not changed a lot. So our customer engineer proudly called their current solution "a self-developed mesh-like infrastructure before service mesh period"

Next we will introduce the challenges of the current solution through several aspects, and focus on how the target cloud native solution solve these challenges respectively.

First, Let see Service discovery and load balance. With current solution DNS and ELB play an important role in service discovery and load balance That is :

  1. VM service instance bind to ELB
  2. Register service name and ELB IP to DNS
  3. Consumer call DNS and send request to the resolved ELB IP
  4. ELB send traffic to selected VM instance
  5. Per node Nginx proxy request to local service instance

And with target cloud native solution, by Kubernetes and Istio:

  1. No need service registry. Istio automatically retrieve registry data from kubernetes
  2. Client side proxy intercept traffic
  3. And perform service discovery
  4. Then send request to one selected instance
  5. Server side proxy intercept traffic, perform server side traffic management

Compare two solutions by following views. First, As for Service registry, the former need service name and ELB register to DNS, and the latter No need registry. And As for service discovery, the former is by ELB and DNS, the latter is by mesh data plane and control plane. The former Load balance greatly depend on ELB, and the latter is by client side proxy.When deploy a new service, with current solution, administrator has to manually register new service to DNS. But with target solution, Istio can get the service and its instances automatically from kubernetes. With current solution, Consumer need send request to internal ELB, ELB and Nginx act as static proxy; And with target solution, Consumer send request to target service, request intercepted by mesh data plane, which act as transparent proxy, perform service discovery and load balance.

And next is canary release, Canary is one important part in our customers daily work. With current solution, ELB is responsible for routing traffic to vm instances. And traffic to different version depends on the number of instances.
As there are 3 instances of Version one and 2 instances of Version two, so that 60% traffic will be sent to Version one, and 40% will be sent to Version two

With target solution, By Istio traffic weight for different version can be specified. Weights of the version control proportion of traffic each version received. The above rule will route 30% of traffic to instances of version two, and the other 70% traffic to version one no matter how many instances each version has.

With target solution, Service mesh can do canary release for a group of services. Set route for first service, and the other service in the group just following the route. Makes traffic from version one be sent to Version one, and from version two be sent to version two

This table summarize difference of two canary. As for Weighted policy, with current solution, Weight is controlled by instance num. And with target solution, weight can be specified flexibly.
And former does not support Traffic match policy, latter can control route traffic to certain version according to match condition And current solution only support L4 traffic, target solution support both L7 and L4, include TCP, TLS, HTTP, gRPC With current solution, some L7 policy is developed in application code, but with target solution, all is by platform, no need changing application code.

As mentioned in application background, there are increasingly security requirement with the development of business. But, with current solution, only TLS termination is provide for the ingress traffic. That is:

  1. Nginx provide TLS termination.
  2. All applications have to develop https service, maintain key and certificate by themselves.
  3. And Access control is embedded in application, for some confidential interface.

And with target solution, Service mesh provide transparent security. : Provide TLS termination by gateway: Operator upload key and certificate in the form of kubernetes secret. : And Istio offers mutual TLS for transport authentication, just enable it without any service code changes. : Include: Provides each service with an identity representing its role. Secures service-to-service communication. Automate key and certificate generation, rotation, and distribution. Provide authority, access control for target service or interface

This table shows the difference of two security. Both provide TLS termination and JWT authentication. But the former does not provide any service to service security. Latter provides: Mutual TLS authentication by proxy Transparent TLS encryption. A key management system automate key and certificate generation, distribution, and rotation. Flexible access control by custom condition, such as header, source or target IP And target solution meet our customers security requirement perfectly in this practice.

With system becoming complicated, need observability to troubleshoot, and optimize their applications. : With the current solution: Per node nginx generate access log Nginx exporter export metric Tracing agent in each node generate tracing for java application

And With target solution: Istio generates metric, trace, access log, for all service by sidecar. Service developer do not need do any extra work for this. First proxy generates service-oriented metrics. Covers: latency, errors and saturation.
And : proxy generates spans on behalf of the applications Proxy also generates access logs for service traffic.

Compared two solution. The main difference is that service mesh can help collect all the observability data for any language application, with more extensibility, transparency, flexibility. Proxy generate all kinds of metrics, and metadata or dimensions can be configured Proxy Generate spans for application, applications only need to propagate several request header. And Proxy generate access log, format can be configured Based on access metric, topology can be built, give a global view of the application and services.

Okay, this is the target cloud native architecture looks like this: A unified infrastructure Kubernets and Istio work together, not only provide application deploy environment, but also provide service managing platform. Covers: canary release, service discovery, load balance, connection management, circuit breaker, fault injection, traffic mirror, retry, redirection, authentication, authority, metric, tracing and so on. Data plane work as a transparent proxy, perform traffic managing, security, observability on behave of application. Ingress Gateway give more flexibility than Nginx, manage traffic together with sidecars. Control plan is responsible for storing and managing the configuration锛宎nd distribute policies to proxies and gateway. And the solution can easily configured to integrate with customers existing canary, chaos platform, and metric, logging, tracing system. By separating all common functions from application code, and offload to infrastructure, help developers focus on the business logic.

Summary the above aspect by this table. The key difference is architecture and mechanism. The former is an integrated platform provide basic service discovery and load balance. And the latter is an infrastructure designed to handle application communication As for Components , The former platform Consists of different components ELB, DNS,Nginx, and latter is a unified infrastructure, Include control plane and data plane. Both based on proxies, but the former is static proxy. And the latter is transparent proxy intercept and manage service traffic. This is the biggest difference, and the most important characteristic of service mesh. And as for ops work, the former greatly depend on manual operation, the latter mostly works automatically. As for Service management, the former only provide service discovery and loadbalace, Retry is coded in application. Latter Provide powerful service management, covers connection, observability and security. This difference result in both cost and resource reduction, help customer reduce IT investment.

Next, the final part is how to migrate user's current solution to target cloud native solution safely and gracefully.

The main idea is deploy a new environment with target solution, and gradually split the traffic from VM cluster to Kubernetes cluster. To ensure reliability of online service, it is required For each service failover to VM instances when new container instances are unavailable.

First, to failover between VM instance and container instance, it is required VM instance and Container instance bind to one service, share the same service discovery and load balance. And our solution is One service refer to container and VM instances by assigning the same label selector, and route traffic to both containers and VMs.

When consumer service call target service vehicle, traffic can be route to container instance, also can be routed to VM instance.

Next, and most important, it is required: Retry to VM instances when container instances not work And we set retryRemoteLocalities to specify Retry to VM when container instances are not available. The process will looks like this: Request to container instances fails for some environment or runtime reason Automatically retry to remote VM instances. Retry success, and make consumer request success.

Finally, It is required load balance traffic to container instances with high priority, failover to VM instances only when container instances are unhealthy. And our solution is Istio locality load balancing, split a small part of traffic to VM instance when containers are unhealthy. Switch all traffic to VMs only when containers are totally unhealthy As show in the table, even half of container instances are unhealthy, they still get seventy percent traffic. It is required to make sure both primary container instances and secondary VM instance meet traffic capacity requirement.

That is all of the practice, thank you, for your time.

附: