Tanzu Learning Series | One article on how TKG uses GPU resource pools
Source: Chimsen Corporation 2022-10-20
With the acceleration of technological progress and industrial transformation, artificial intelligence (AI) has become a battleground for military strategists. AI is highly valued at various levels such as government, academic institutions, and enterprises, and its development in academic research, technological innovation, talent education, and other areas is showing a new trend. As an important component of the AI market, the AI acceleration market dominated by GPU technology has also experienced rapid development. However, due to the high cost of GPU hardware, the traditional exclusive use of GPU computing power lacks flexibility and economy. At the same time, with the development of cloud native technology, the demand for fine-grained and fast delivery of segmented GPU computing power urgently needs an economical and efficient GPU computing power pooling solution.
As a leader in virtualization and cloud native technology, VMware has always been at the forefront of GPU computing resource pooling, with corresponding GPU resource pooling solutions for different usage scenarios.
The first two solutions are currently widely adopted by customers and are familiar to readers. This article focuses on introducing the VMware HANA Bitfusion GPU pooling solution.
What is VMware Herpes Bitfusion?
VMware Herpes Bitfusion is a powerful feature of VMware 7 that provides remote GPU pooling over a network. Bitfusion virtualizes hardware accelerators, such as graphics processing units (GPUs), to provide a shared resource pool that can be accessed over a network, supporting artificial intelligence (AI) and machine learning (ML) workloads. Bitfusion enables GPUs to be abstracted, partitioned, automated, and shared like computing resources. Assist clients in building a data center level AI computing resource pool, enabling user applications to transparently share and use AI accelerators on any server within the data center without modification.
The BitLocker client can be deployed in bare metal, virtual machines, or containers for use in data center environments. Through the use of vcenter HANA Bitfusion, the health, utilization, efficiency, and availability of all GPU servers in the network can be monitored. In addition, it is possible to monitor the usage of GPUs by clients and allocate quotas and time limits.
Hibernate Bitfusion requires the use of NVIDIA's CUDA framework, which is a development and execution framework for AI/ML programs. Bitfusion mainly implements remote calling of CUDA. Can be used in conjunction with artificial intelligence frameworks such as TensorFlow, PyTorch, TensorRT, PaddlePaddle, etc.
The Bitfusion Server needs to be deployed on a PCIe host with a GPU card, which is assigned to the Bitfusion Server in pass through mode;
AI/ML applications and frameworks such as TensorFlow that require GPU resources can be deployed in VM, physical machines, and container environments. At the same time, the Bitfusion client can be installed, which enables remote GPU partitioning and calling through low latency networks;
The low latency network between Bitfusion server and Bitfusion client is recommended to be separated from the management network and can support TCP/IP, RoCE, or Infiniband.
VMware vSphere Bitfusion 与 kubernetes
Kubernetes has become the de facto standard on resource management and scheduling platforms. So many customers choose to use GPUs in Kubernetes to run AI computing tasks. Kubernetes provides a device plugin mechanism that allows nodes to discover and report device resources for use by Pods. GPU resources are also provided through this method. The benefits of using Kubernetes to schedule GPU agents include: accelerated deployment: avoiding duplicate deployments in complex machine learning environments through container design; Improve cluster resource utilization: Unified scheduling and allocation of cluster resources; Ensure resource exclusivity: Use containers to isolate heterogeneous devices and avoid mutual interference.
By combining Kubernetes' flexible resource scheduling with Bitfusion's GPU computing power splitting and remote calling capabilities, the advantages of both can be fully utilized. Firstly, it accelerates deployment and avoids wasting time on environment preparation. Through container mirroring technology, the entire deployment process is solidified and reused, and many frameworks provide container mirroring. We can use this to improve the efficiency of GPU usage. Through Bitfusion time-division multiplexing, dynamic partitioning, remote calling, and Kubernetes' unified scheduling capabilities, resource users can apply and release resources as soon as they are available, thereby activating the entire GPU resource pool.
VMware China R&D Cloud Native Lab has launched the Bitfusion device plugin and opened sourced the relevant code. This project achieves GPU sharing capability by using Bitfusion in Kubernetes.
Project Address:
https://github.com/vmware/bitfusion-with-kubernetes-integration
To achieve the goal of allowing Kubernetes to use Bitfusion through two components.
bitfusion-device-plugin;
bitfusion-webhook;
Component 1 and Component 2 are built into separate container images.
The bitfusion device plugin runs as a DaemonSet on each worker node where the kubelet is located.
Bitfusion webhook runs as a deployment on the Kubernetes master node.
VMware vSphere Bitfusion与 TKG
Tanzu Kubernetes Grid (TKG) is a product in the Tanzu product family, which is VMware's enterprise release of Kubernetes. It can be deployed in multiple cloud environments, including private and public clouds, providing users with a consistent Kubernetes user experience and fully compatible with the community's Kubernetes.
TKG uses the Bitfusion device plugin to remotely call the Bitfusion GPU resource pool, enabling flexible use of GPU computing power resources.
Next, we will test the TKG and Bitfusion schemes.
Testing of BitLocker and TKG Solutions
Test topology
Test procedure
Related information

Good news! Chimsen Technology has won the top spot in the performance evaluation of Shenzhen Construction and Engineering Bureau for two consecutive quarters!
2024-08-28

Engineering inspection | Building dream quality, safeguarding safety
2024-08-23

Good news! Chimsen has successfully won the bid for the Shenzhen Natural History Museum project
2024-08-06