NVIDIA GRID troubleshooting on Nvidia-SMI issues

 

 

 

‘Chrome is not responding’. This is the message users often get when loading 10 tabs filled with multimedia content. It is one of the most frequent remarks we receive on SBC/VDI these days. The standard workload of a typical office worker has changed dramatically over the years.

GPU intensive applications have outgrown the typical “graphical engineering applications“ like AutoCAD, Solidworks… Browsers, Office and even the Windows OS are now using graphical resources more than ever.

Around June 2016? Nvidia released the Maxwell based M10 Tesla GPU. The product itself is rather simple to install and configure. Troubleshooting the installation on the other hand, appears to be much trickier. There is a lot of information on the internet regarding this topic, but it is difficult to separate the wheat from the chaff. To help you with the troubleshooting process, we created a simple procedure. Please note that this is based on our personal experiences. There might be other solutions available as well.

Step 1: verify if the hardware is installed and detected

This may seem like an unnecessary step in the troubleshooting process, but I still wanted to include it to provide a full procedure. When the server boots, enter the BIOS of the server system. Verify if the Tesla card is present and shows up under ‘integrated devices’.

Next, you have to check the BIOS to verify if the card is not being used as the default ESXi console output.. If this turns out to be the case, it will lock that specific GPU and pass-through will be disabled. Check the BIOS setting: Integrated devices -> Embedded Video controller -> enabled

When the feature is enabled, but the GRID GPU is still selected as ESXi console output, it is due to PCI enumeration during the server boot.

When multiple add-in graphics cards are installed onto the system, the first card discovered during PCI/device enumeration will automatically be selected as the primary video. The embedded GPU however should normally be the first one on the device list. The root cause of this ‘problem’, could be a BIOS issue, malfunctioning hardware… When this occurs, you have to check if BIOS/firmware updates are available for the server. Currently, there are no PCI enumeration order options in the BIOS.

As a solution, you could re-arrange the cards in the slots to control the device enumeration. This means that the GPU needs to be physically moved to another slot in the server. CAUTION: Verify with your vendor if this does not impact the warranty of the server. A support case should be opened with your hardware vendor or Nvidia support.

Step 2: load Hypervisor (ESX) and open Direct Shell or SSH session

Step 3: Verify if you have the last version of the NVIDIA VIB for your hypervisor

esxcli software vib list | grep NVIDIA

NVIDIA-kepler-VMware_ESXi_6.5_Host_Driver  367.64-1OEM.650.0.0.4240417          NVIDIA  VMwareAccepted    2017-06-13

Above you can see the output of the latest version available. If your output differs, upgrading the VIB (vSphere Installation Bundle) or re-installing it might solve the issue (see NVIDIA-SMI – step 4).

Step 4: Verify the NVIDIA-smi output

nvidia-smi

NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

If the output you get is something like this, then you have to make sure you are using the correct VIB. The only supported VIB is the one downloaded from the Nvidia Licensing portal.

You might also get the following output:

nvidia-smi

+—————————————————————————–+

| NVIDIA-SMI 367.92                 Driver Version: 367.92                    |

|——————————-+———————-+———————-+

| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |

| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |

|===============================+======================+======================|

|   0  Tesla M10           On   | 0000:06:00.0     Off |                  N/A |

| N/A   30C    P8    10W /  53W |     18MiB /  8191MiB |      0%      Default |

+——————————-+———————-+———————-+

|   1  Tesla M10           On   | 0000:07:00.0     Off |                  N/A |

| N/A   30C    P8    10W /  53W |     18MiB /  8191MiB |      0%      Default |

+——————————-+———————-+———————-+

|   2  Tesla M10           On   | 0000:08:00.0     Off |                  N/A |

| N/A   25C    P8    10W /  53W |     18MiB /  8191MiB |      0%      Default |

+——————————-+———————-+———————-+

|   3  Tesla M10           On   | 0000:09:00.0     Off |                  N/A |

| N/A   27C    P8    10W /  53W |     18MiB /  8191MiB |      0%      Default |

+——————————-+———————-+———————-+

This means that the GPU has been detected correctly.

Step 5: Check Xorg and VMKernel logs for possible cause

If the SMI output was correct, we need to pinpoint the possible root cause of the problem.

Verify the logs if something regarding the Nvidia module/GPU has failed loading.

A correct output should contain following:

cat /var/log/vmkernel.log

2017-06-13T13:32:49.377Z cpu9:69412)Loading module nvidia …

2017-06-13T13:32:49.388Z cpu9:69412)Elf: 2043: module nvidia has license NVIDIA

2017-06-13T13:32:49.531Z cpu9:69412)NVRM: vmk_MemPoolCreate passed for 4194304 pages.

2017-06-13T13:32:49.783Z cpu9:69412)NVRM: loading NVIDIA UNIX x86_64 Kernel Module  367.64  Sat Nov  5 21:57:06 PDT 2016

2017-06-13T13:32:49.783Z cpu9:69412)

2017-06-13T13:32:49.783Z cpu9:69412)Device: 191: Registered driver ‘nvidia’ from 95

2017-06-13T13:32:49.785Z cpu9:69412)Mod: 4968: Initialization of nvidia succeeded with module ID 95.

2017-06-13T13:32:49.785Z cpu9:69412)nvidia loaded successfully.

Step 6: Verify the ESX host graphics settings

The Hypervisor (ESX) also has a specific setting that needs to be enabled to use the GPU profiles correctly.

Other ways vGPU-enabled machines could throw the following error when being started:

The amount of graphics resource available in the parent resource pool is insufficient for the operation.

This issue is due to the default graphics type that has not been set correctly.

To solve this, you need to change the following from ‘Shared’ to ‘Shared Direct’. This can be found in the Host configuration under graphics.

 

 

 

 

 

 

 

 

 

 

 

 

Step 7: inside the Windows VM

If you are using Microsoft RDS (or a technology on top of that) you need to make sure you activate this GPO setting:

 

 

 

If not, the Windows explorer, Internet Explorer and other browsers won’t get graphically accelerated. Even if the browser setting is configured correctly!

Troubleshooting the vGPU within the VM can be done using the same nvidia-smi command, it gives an output of which processes are using the vGPU and the usage of memory and compute resources.

This is the result of an empty Windows Server 2016 VM with 1 session without special graphical load:

C:\Program Files\NVIDIA Corporation\NVSMI>nvidia-smi.exe

+—————————————————————————–+

| NVIDIA-SMI 370.12                 Driver Version: 370.12                    |

|——————————-+———————-+———————-+

| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |

| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |

|===============================+======================+======================|

|   0  GRID M10-8A        WDDM  | 0000:02:02.0     Off |                  N/A |

| N/A   N/A    P0    N/A /  N/A |    725MiB /  8192MiB |      2%      Default |

+——————————-+———————-+———————-+

+—————————————————————————–+

| Processes:                                                       GPU Memory |

|  GPU       PID  Type  Process name                               Usage      |

|=============================================================================|

|    0       836  C+G   Insufficient Permissions                     N/A      |

|    0      1188  C+G   …ost_cw5n1h2txyewy\ShellExperienceHost.exe N/A      |

|    0      1980  C+G   Insufficient Permissions                     N/A      |

|    0      1996  C+G   C:\Windows\explorer.exe                      N/A      |

|    0      6104  C+G   Insufficient Permissions                     N/A      |

|    0      6492  C+G   …indows.Cortana_cw5n1h2txyewy\SearchUI.exe N/A      |

+—————————————————————————–+

You can see the explorer.exe process and windows shell are already using the vGPU.

When launching a few instances of HTML5 Fish bowl (https://testdrive-archive.azurewebsites.net/performance/fishbowl/), you instantly see a rise in Memory usage and GPU utilization. This can also be seen on the physical host level using the nvidia-smi command.

C:\Program Files\NVIDIA Corporation\NVSMI>nvidia-smi.exe

+—————————————————————————–+

| NVIDIA-SMI 370.12                 Driver Version: 370.12                    |

|——————————-+———————-+———————-+

| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |

| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |

|===============================+======================+======================|

|   0  GRID M10-8A        WDDM  | 0000:02:02.0     Off |                  N/A |ac

| N/A   N/A    P0    N/A /  N/A |   1435MiB /  8192MiB |     12%      Default |

+——————————-+———————-+———————-+

+—————————————————————————–+

| Processes:                                                       GPU Memory |

|  GPU       PID  Type  Process name                               Usage      |

|=============================================================================|

|    0       836  C+G   Insufficient Permissions                     N/A      |

|    0      1188  C+G   …ost_cw5n1h2txyewy\ShellExperienceHost.exe N/A      |

|    0      1980  C+G   Insufficient Permissions                     N/A      |

|    0      1996  C+G   C:\Windows\explorer.exe                      N/A      |

|    0      6104  C+G   Insufficient Permissions                     N/A      |

|    0      6216  C+G   C:\Windows\System32\mstsc.exe                N/A      |

|    0      6412  C+G   …iles (x86)\Internet Explorer\iexplore.exe N/A      |

|    0      6492  C+G   …indows.Cortana_cw5n1h2txyewy\SearchUI.exe N/A      |

|    0      6676  C+G   …x86)\Google\Chrome\Application\chrome.exe N/A      |

+—————————————————————————–+

Step 8: When combined with Citrix XenApp/XenDesktop and (Igel) Linux clients

If you see a rise in vGPU resource usage and you still experience bad visual results (blurry, compressed movies) on the endpoint, it’s worth to check the following settings.

In the Citrix (User) policies, switch the option “Use video codec for compression” from “Use when preferred” to “For the entire screen”:

 

 

 

 

 

 

 

 

 

This can be verified by running the Remote Display Analyzer tool (https://www.rdanalyzer.com/), which shows the Video Codec usage:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

If you are running the session on (Igel) Linux Thin Clients, make sure you have licensed and activated the Multimedia Codec Pack, as such:

 

 

 

 

 

 

 

 

 

 

 

 

If not, the Video codec Citrix policy won’t change anything. If you don’t use the video codec for compression, the video result is far less optimal and looks really compressed. If the entire chain is configured correctly, the video playback (even in 4K) is very close to a native fat-client experience.

If all else fails!

If this guide did not help you troubleshoot the issue, I would suggest opening a support case on the Nvidia portal. They will provide you with the necessary additional help in troubleshooting your problem.

To wrap things up, we would like to address once more that this guide is based on our personal experiences. We have spent many hours troubleshooting issues and found out that these steps can provide you a quick and easy indication to pinpoint the root cause.


Written by Stefan Achten and Jens Herremans – Virtualization Consultants at SecureLink Belgium 

Contact

Do you have a question about this blog post Nvidia? Please, do not hesitate to contact us.