Optimizing the cost of using GPUs

GPUs (Graphical Processing Unit) were created in the 70s to speed up the creation and manipulation of images. They were quickly adopted by game console manufacturers to improve the fluidity of graphics. It was in the 2000s that the use of GPUs for computing outside of graphics began. Today the popular use of GPUs concerns the AI (artificial intelligence) and ML (Machine Learning). More history on the Wikipedia site: Graphics processing unit – Wikipedia. This is why these GPUs are found in datacenter servers because IA and ML type applications require many calculations that are executed in parallel, which a conventional processor (CPU) would have difficulty doing this function. Indeed a CPU is the central element of a server which is there to execute lots of small sequential tasks, very fast and at low latency, for that it can count on about sixty cores per CPU (at the date of writing this article). A GPU on the other hand, is made to perform thousands of tasks in parallel thanks to its thousands of cores which compose it. The downside to a GPU is its cost, it can run into tens of thousands of dollars, so you have to make sure it is being used properly all the time. The ideal is to be able to share it so that it can be used simultaneously by several applications in order to be close to consuming all the resources. This is the credo of Bitfusion solution acquired by VMware in 2019 and which is now available as an add-on to vSphere. The GPUs are installed on the hypervisors and form a pool of GPUs which will be accessible by the applications directly hosted on these hypervisors or via the IP network if the applications are hosted on other hypervisors, the applications can even be hosted on physical servers or be based on Kubernetes container. The use is reserved for Artificial Intelligence or Machine Learning type applications using CUDA routines. CUDA was developed by NVDIA to allow direct access to GPUs for non-graphics applications. Thank to Bitfusion, Applications can consume a GPU, several GPUs or just a portion of a GPU (in reality it is the memory of the GPU that is shared). Once consumption is complete, the applications release the allocated GPU resources which then return to the pool for future requests.

From a technical point of view, Bitfusion requires the installation of components on both sides. On the hypervisors side that have GPUs, a virtual appliance must be deployed on each one that can be downloaded from the VMware site. For clients (VM, Bare Metal or Kubernetes containers / PODs) who will consume GPUs resources, a Bitfusion client must be installed, which will allow the interception of CUDA calls made by the application to transfer them to the Bitfusion appliances via the IP network. It is transparent to the application.

Since the exchanges between clients and Bitfusion appliances go through the IP network, it is preferable to have at least a 10 Gb/s network, you generally need 1 10 Gb/s network for 4 GPUs.

Customize an Ubuntu VM with Cloud-Init on vSphere

Cloud-Init seems to be the favorite customization tool for major OSs aiming to be installed on different cloud environments (AWS, Azure, GCP, vSphere, …), it is very powerful, heterogeneous but at the beginning, it is difficult to understand how it works.

Personalization information are stored in a file called user-data. This file is transmitted to Cloud-Init using a mechanism specific to each cloud. In our case, it is the VMtools that will transmit the user-data file, once received, Cloud-Init will execute it.

I wasted a tremendous amount of time finding the minimum steps to customize Ubuntu OS with Cloud-Init in a vSphere environment. I was looking for the possibility of customizing the OS when cloning from an OVF template.

Below is the procedure I use for an Ubuntu 20.04.2 LTS, freshly installed and after its first reboot. I kept the default values ​​except for the French keyboard and the installation of the OpenServer SSH option.

Cloud-Init must be told to retrieve the user-data customization file via the OVF parameters of the VM, there is a step to be done on the VM side and an OS side.

OS side:

  • Delete the file that sets the Datasource to none instead of OVF:
    • sudo rm /etc/cloud/cloud.cfg.d/99-installer.cfg
  • If you want the network configuration to be done, delete the file that disables it:
    • sudo rm /etc/cloud/cloud.cfg.d/subiquity-disable-cloudinit-networking.cfg
  • If you are in DHCP, the VM will always retrieve the same IP because it will keep the same machine ID. To avoid this, you must reset the identity of the OS:
    • sudo truncate -s 0 /etc/machine-id

VM side:

  • Get the contents of your cloud-init user-data file and encoded it with base64:
    • Base64 user-data.yaml
  • From the vSphere client, when the VM is stopped, activate the properties of the OVF :
    • Select the VM => Configure => vApp Options => vApp Options are disabled … Edit => Click on “Enable vApp options”
    • In the OVF details tab => => check the VMware Tools case
  • From the vSphere client, when the VM is stopped, add the user-data field in the properties of the OVF:
    • => Select the VM => Configure => vApp Options => Properties => ADD enter the Label and the Key id with the user-data value.
  • From the vSphere client, when the VM is stopped, add the value to the user-data field in the properties of the OVF:

    • Select the VM => Configure => vApp Options => Properties => SET VALUE, a popup appear and set it with the base64 key comming from the user-data file retrieved in the step early

Now from this VM, directly make a clone to make another VM or to make a template. If you want to change the user-data file, when deploying the VM, change only the base64 key to the new key

Which platform for « Share Nothing Architecture » applications?

Organizations are seeking more agility to accelerate their business growth. Developing applications for internal or external usage can directly or indirectly impact that growth. It is important to provide agility to developers for them to write these applications. That’s why public cloud services are attractive. Developers can consume services right away by deploy a data service (e.g database) and connect it to their applications. They don’t have to worry about the infrastructure but instead focus only on developing the application. Bringing that flexibility into the datacenter will allow organizations to provide agility while maintaining security.

VMware Cloud Foundation with Tanzu (previously vSphere with Kubernetes or Projet Pacific) is a platform capable of hosting applications running in virtual machines and applications running in Kubernetes Pods (containers). It also provides networking services, storage,  registry, backup and restore services for those applications. Now, it also incorporates data services.

At the time of writing, two solutions were added: Minio and Cloudian. They are two object storage solutions compatible with S3 API. Two other are currently being integrated: Dell EMC ObjectScale, a object storage compatible with S3 and Datastax, a NoSQL database based on Cassandra. There are more integrations to come.

 

How is it revolutionary?

Unlike the majority of traditional/classic/monolith applications, modern applications also called Cloud Native or Scalable apps do not rely on the infrastructure to optimize their performance and to provide resiliency. They use their own mechanisms for availability, performance and no matter what infrastructure they’re running on. Of course, the infrastructure is essential but only to consume resources like processors, memory or I/O. These applications are often SNA (Shared Nothing Architecture). Each instance of an application uses its own resources on a distinct server and the application distributes the data between these servers. Reading and writing data is distributed for better performances and resilience while taking in consideration a potential loss of a server or a site.

On a physical infrastructure (without virtualization), it’s easy, each instance has its own server and its own resources. However, it creates a financial issue as the servers are dedicated to that usage. It’s not optimal unless all the resources always being consumed. It’s rarely the case.

On a virtual infrastructure, the resources are shared hence not used resources can be use by other applications. It also allows eliminate hardware compatibility issues and to take advantage of other benefits brought by virtualization. Nevertheless, there’s a constraint for SNA applications as the instances are virtualized. We need to ensure these instances and the generated data are distributed on different virtualised servers in case of of server failure.

VMware Cloud Foundation with Tanzu coupled with vSAN Data Persistence platform module (vDPp) is the answer to this problem. Partner editors are able to take advantage of the platform to provide “as a Service” solutions. They can do so by developing an operator to automate the installation, the configuration and simplify keeping it operational.

 

The service is up and running in one click

 

vDPp is aware of the infrastructure, the application knows how to get the best performances and availability. The operator thereby distributes the required number of instances on different virtualized servers.

This vSAN storage policy ensures data protection and keeps the application instance and its data on the same  virtualization host

 

During maintenance operations, the application is informed about the decommission of a virtualization server. vDPp also proactively communicates with the application if the disks start showing signs of failure.

Developers consume these services via APIs and stick to only developing their application. They can use an resilient and performant on-demand data service.

 

In Conclusion,

VMware Cloud Foundation with Tanzu platform coupled with vSAN Data persistence provide great agility to keep the data services operational. Thanks to that, developers can focus solely on application development while keeping on using their traditional tools. They have a cloud platform as it exists on public cloud.

VMware Cloud Foundation with Tanzu should be seen as a complete platform designed for the development and hosting of traditional and modern applications with integrated on-demand services.