Setting up GPU Hypervisors on OpenStack

     

GPU-CLUSTER.pngAt Bright, we have our own private cloud based on Bright OpenStack. We use it to run R&D, dev, and QA workloads for our engineering teams. 

One of the ways we use it is to test the GPU integration of our cluster management software. To do this, we need to expose the GPUs to the VMs via PCI-passthrough. That’s the easy part.

The tricky part is making sure that whenever there are GPUs available for passthrough, our cloud has enough CPU and memory resources to provision a GPU-enable VM. We don’t want non-GPU VMs to consume all the resources and effectively prevent the user from being able to spin up VMs with GPUs.

One way to solve this is to have a dedicated hypervisor node and install all the GPU cards we have on it. Problem solved, right? Not quite. First, it’s inefficient. Second, it creates a single point of failure. Third, it’s nearly impossible to manage performance in such environment. Plus, you need to figure out how to accommodate spikes in demand, and deal with the fact that you can only fit so many GPUs to a single node.

So, the solution we opted for was to install GPU cards in several of our hypervisors, and run a mixture of GPU and non-GPU VMs on them. That solved part of the challenge. What remained was the really interesting part: How to reserve resources for these virtual machines within OpenStack? In other words, how could we configure Nova to use a certain amount of resources for non-GPU VMs, and reserve the rest for the GPU-enabled VMs?

Here’s how we do it:

We are going to operate on a hypervisor called hyper18, it has only one GPU card installed:

[root@hyper18 ~]# lspci | grep -i nvi
86:00.0 3D controller: NVIDIA Corporation 

CONFIGURING GPU PASSTHROUGH

The first step is to configure GPU passthrough on our hypervisor. To do this we added the following values to our Bright Cluster Manager configuration, which writes out the configuration and gives you an easy interface to manage values.

Having Bright Cluster Manager do it is a lot easier than going through text files manually, searching/replacing parameters on all of our hypervisors.

The following sequence of commands show how we did it using Bright Cluster Manager, and it also shows the end result in nova.conf for those of you not using Bright Cluster Manager: 

First, we copied nova.conf to nova-gpu.conf in all of our hypervisors. In a Bright-managed environment this is done by modifying the software image that is being used by all of our hypervisors:

cd /cm/images/default-image/etc/nova/
cp nova.conf nova-gpu.conf

If you don’t use Bright Cluster Manager, you can copy nova.conf file to nova-gpu.conf file on each of your hypervisors that hosts a GPU card.

The following steps add the required configuration values to nova-gpu.conf:

#cmsh
% configurationoverlay
% use openstackhypervisors
% customizations
% add /etc/nova/nova-gpu.conf
% entries
% add default pci_passthrough_whitelist
% set value "[{\"vendor_id\": \"10de\", \"product_id\":\"1024\"}]"
% add default host gpu18
% commit 

Now in our hypervisors, we have the following configuration installed into the default section: 

pci_passthrough_whitelist=[{"vendor_id": "10de", "product_id":"1024"}]

We determined the vendor_id and the product id of our card with the followign command:

#lspci | grep -i nvid
86:00.0 3D controller [0302]: NVIDIA Corporation GK110BGL [Tesla K40c] [10de:1024] (rev a1) 

As you can see, between the square brackets, there is a string that describes the vendor id, and the product ID.

The next step is to configure our nova-api servers, which are hosted on our OpenStack controllers. We have 3 controller nodes, so we use our Cluster Manager to do this job for us: 

#cmsh
% configurationoverlay
% use openstackcontrollers
% customizations
% add /etc/nova/nova.conf
% entries
% add default pci_alias
% set value "{\"vendor_id\": \"10de\", \"product_id\":\"1024\", \"device_type\":\"type-PCI\", \"name\":\"gpu\"}"
% commit 

The end result in our nova.conf configuration across all of our controller nodes will be:

[DEFAULT]
pci_alias={"vendor_id": "10de", "product_id":"1024", "device_type":"type-PCI", "name":"gpu"} 

See how easy that was? 

RESERVING RESOURCES FOR GPU VMs 

Next, how do we use that new configuration?

Our hypervisor nodes will host a mixture of VMs, some of which will have GPU passed-through to them, but many will not. We want to make sure that users needing a GPU-equipped VM will always be able to power it on (provided there are some GPUs which are not being used).

Therefore, the next step is to make sure that our hypervisors equipped with GPUs will always have some RAM/CPU resources set aside to accommodate VMs equipped with GPUs.

To do this, we need to decrease the amount of resources available for our hypervisors. We are going to split our compute nodes into two --: one for normal work loads, and the other for the GPU workloads -- on the same hypervisor. This is how we do it:

#cmsh
%configurationoverlay
%use openstackhypervisors
%customizations
%add /etc/nova/nova.conf
%entries
%add default vcpu_pin_set 0-22 # these are 22 VCPU we are going to use
%add default reserved_host_memory_mb 4096 #we only need 4G of memory
%commit 

The values will be changed on all of our hypervisor nodes in nova.conf file, and the resources allocated to nova-compute will decrease by only 4G of RAM and 2 VCPUs, which is exactly what we want to have reserved for our GPU VMs. It’s also what we need to ensure we didn’t waste resources and power.

What’s next? 

Basically, we are going to create a new nova-compute process on hyper18 which will be able to use the resources which we’ve just set aside:

[root@hyper18 ~]# cd /usr/lib/systemd/system/
[root@hyper18 system]# cat openstack-nova-gpu.service
[Unit]
Description=OpenStack Nova Compute Server
After=syslog.target network.target
[Service]
Environment=LIBGUESTFS_ATTACH_METHOD=appliance
Type=notify
NotifyAccess=all
TimeoutStartSec=0
Restart=always
User=nova
ExecStart=/usr/bin/nova-compute --config-file /etc/nova/nova-gpu.conf
[Install]
WantedBy=multi-user.target 

We are going to add this service as our software image as well. Then we are going to add it as a managed service, to be monitored and started when needed:

[bright->device]% use hyper18
[bright->device[hyper18]]% services
[bright->device[hyper18]->services]% add openstack-nova-gpu
[bright->device*[hyper18*]->services*[openstack-nova-gpu*]]% set autostart yes
[bright->device*[hyper18*]->services*[openstack-nova-gpu*]]% set monitored yes
[bright->device*[hyper18*]->services*[openstack-nova-gpu*]]% commit 

After a very short while, this service will start and we see the following result in our service-list:

| 136 | nova-compute     | gpu         | default  | enabled | up    | 2016-11-10T15:12:39.000000 | -               |

In your hypervisor list you will see a double value of one hostname, but with a different ID. This is your new GPU compute instance. But not so fast, you can’t start instances and schedule it to that host just yet. Why ? Because it has a hostname that is not registered as an OpenVSwitch agent or an L2 agent! So port binding will always fail. How do we solve this? 

Basically, we have added a small check into neutron to change the host_name of gpu18 to hyper18 so that port binding will be on hyper18, which is the host that our gpu18 compute service is started. Here is how we did it. In our software image, we modified this file:

/usr/lib/python2.7/site-packages/neutron/plugins/ml2/plugin.py

In _bind_port method, we have added the following check, just before

self._update_port_dict_binding:
       if orig_binding.host == "gpu":
               orig_binding.host="hyper18"

We have added a simple “if” statement, to change the hostname from GPU, which is what we used for this specific nova compute instance, which is hosted on hyper18.

This is an example on how to add another gpu instance, hosted on hyper16:

       if orig_binding.host == "gpu2":/
               orig_binding.host="hyper16"

As you can see, we check the hostname of the GPU nova instance. If it matches gpu2, the hostname is then changed to the hostname of the host it was started on, which is hyper16 in this case, which is configured in it’s nova.conf file like:

host=gpu2 

We can of course make this more sophisticated by using oslo.conf to introduce more configuration variables, or read the values from files, but this use case is simple so we do not need to add any complex code.

The reason we change the hostname of the binding host is that when nova tries to allocate a network port on an L2 agent, it binds this port to an L2 agent of the same hostname as the compute instance. However, we do not have a dedicated host and we can not just start another L2 agent, so we change the hostname of the GPU compute to hyper18 that we share the resources with. That’s why, VIF binding will work on the intended host, and the machine will be able to bind and connect to that port. Now we need to test our changes. First of all, let's create a host aggregate:

$ openstack aggregate show gpu
+-------------------+----------------------------+
| Field             | Value                      |
+-------------------+----------------------------+
| availability_zone | None                       |
| created_at        | 2016-10-30T22:58:30.000000 |
| deleted           | False                      |
| deleted_at        | None                       |
| hosts             | [u'gpu']                   |
| id                | 22                         |
| name              | gpu                        |
| properties        | gpu='true'                 |
| updated_at        | None                       |
+-------------------+----------------------------+ 

Now, let's add hypervisors to this aggregate:

$ openstack aggregate add host gpu18 gpu

Now let's create a GPU flavor:

openstack flavor show 3849773f-e50a-4b9d-9c64-66462a1f4130
+----------------------------+-------------------------------------------+
| Field                      | Value                                     |
+----------------------------+-------------------------------------------+
| OS-FLV-DISABLED:disabled   | False                                     |
| OS-FLV-EXT-DATA:ephemeral  | 0                                         |
| disk                       | 40                                        |
| id                         | 3849773f-e50a-4b9d-9c64-66462a1f4130      |
| name                       | g1.gpu                                    |
| os-flavor-access:is_public | False                                     |
| properties                 | gpu='true', pci_passthrough:alias='gpu:1' |
| ram                        | 4096                                      |
| rxtx_factor                | 1.0                                       |
| swap                       |                                           |
| vcpus                      | 2                                         |
+----------------------------+-------------------------------------------+ 

Now let's create a virtual instance with that flavor:

[root@krusty]# openstack server show 8db03eb8-666a-4839-ace7-e156fae30937
+--------------------------------------+------------------------------------------+
| Field                                | Value                                    |
+--------------------------------------+------------------------------------------+
| OS-DCF:diskConfig                    | AUTO                                     |
| OS-EXT-AZ:availability_zone          | default                                  |
| OS-EXT-SRV-ATTR:host                 | gpu                                      |
| OS-EXT-SRV-ATTR:hypervisor_hostname  | hyper18.cm.cluster                       |
| OS-EXT-SRV-ATTR:instance_name        | instance-00000ee0                        |
| OS-EXT-STS:power_state               | 1                                        |
| OS-EXT-STS:task_state                | None                                     |
| OS-EXT-STS:vm_state                  | active                                   |
| OS-SRV-USG:launched_at               | 2016-11-03T11:01:56.000000               |
| OS-SRV-USG:terminated_at             | None                                     |
| accessIPv4                           |                                          |
| accessIPv6                           |                                          |
| addresses                            | ahmed_network=192.162.162.3, 10.2.61.52  |
| config_drive                         |                                          |
| created                              | 2016-11-03T11:01:48Z                     |
| flavor                               | g1.gpu (<id>)                            |
| id                                   | 8db03eb8-666a-4839-ace7-e156fae30937     |
| image                                | Centos7_RAW (<id>)                       |
| key_name                             |                                          |
| name                                 | GPU                                      |
| os-extended-volumes:volumes_attached | []                                       |
| progress                             | 0                                        |
| project_id                           | 7a90f423f6d14bf78ba0ad569cd8f77d         |
| properties                           |                                          |
| security_groups                      | [{u'name': u'default'}]                  |
| status                               | ACTIVE                                   |
| updated                              | 2016-11-03T11:01:56Z                     |
| user_id                              | 126b67836fdf475ebd9e36433e4459c7         |
+--------------------------------------+------------------------------------------+

Let’s log into our instance and check if we can see the passed through GPU:

Last login: Thu Nov  3 11:48:02 2016 from 10.2.184.4
[centos@gpu ~]$ sudo -i
[root@gpu ~]# lspci | grep -i nvi
00:05.0 3D controller: NVIDIA Corporation GK110BGL [Tesla K40c] (rev a1) 

Done :)