Adam Gradzki's Personal Website - Softwarehttps://adamgradzki.com/2022-12-19T00:00:00-05:00Visual Studio Code fix for Python 3.11 debugging2022-12-19T00:00:00-05:002022-12-19T00:00:00-05:00Adam Gradzkitag:adamgradzki.com,2022-12-19:/visual-studio-code-fix-for-python-311-debugging.html<p>Quick fix for a problem Microsoft isn't fixing in a timely manner.</p><p>Symptom:</p>
<div class="highlight"><pre><span></span><code>PyInt_FromLong not found.
PyInt_FromLong not found.
PyInt_FromLong not found.
</code></pre></div>
<p>Solution:</p>
<div class="highlight"><pre><span></span><code>git clone --depth <span class="m">1</span> https://github.com/microsoft/debugpy.git ~/.debugpy
rm -rf ~/.vscode/extensions/ms-python.python-2022.20.1/pythonFiles/lib/python/debugpy/
mv -v ~/.debugpy/src/debugpy/ ~/.vscode/extensions/ms-python.python-2022.20.1/pythonFiles/lib/python
rm -rf ~/.debugpy
</code></pre></div>CPU specific optimized Python on AWS2022-11-30T00:00:00-05:002022-11-30T00:00:00-05:00Adam Gradzkitag:adamgradzki.com,2022-11-30:/cpu-specific-optimized-python-on-aws.html<p>Build an optimized Python on AWS Gravitron ARM for better performance and cloud cost savings</p><p>Recently there has been a need to more efficiently host our software services on AWS. As of publication this is most readily achievable for general purpose Python code with (AWS Graviton instances)[https://aws.amazon.com/ec2/graviton/]</p>
<p>Starting from a Debian 11 ARM EC2 t4g instance, the following commands are able to create a Python build from source optimized for the CPU architecture running on AWS. Note, LTO is enabled in the build script and it uses a huge amount of RAM, so make sure you have enough RAM or use a swapfile as I demonstrate below since I chose a t4g.small instance.</p>
<div class="highlight"><pre><span></span><code>sudo apt update -y
sudo apt upgrade -y
sudo apt install -y <span class="se">\</span>
git <span class="se">\</span>
build-essential <span class="se">\</span>
gdb <span class="se">\</span>
lcov <span class="se">\</span>
libbz2-dev <span class="se">\</span>
libffi-dev <span class="se">\</span>
libgdbm-dev <span class="se">\</span>
liblzma-dev <span class="se">\</span>
libncurses5-dev <span class="se">\</span>
libreadline6-dev <span class="se">\</span>
libsqlite3-dev <span class="se">\</span>
libssl-dev <span class="se">\</span>
lzma <span class="se">\</span>
lzma-dev <span class="se">\</span>
tk-dev <span class="se">\</span>
uuid-dev <span class="se">\</span>
libxml2-dev <span class="se">\</span>
libxml2 <span class="se">\</span>
libxslt1-dev <span class="se">\</span>
libxslt1.1 <span class="se">\</span>
xvfb <span class="se">\</span>
zlib1g-dev
apt build-dep python3 -y
curl https://pyenv.run <span class="p">|</span> bash
<span class="c1"># If you have less than 8 GB RAM, create a swapfile and enable it</span>
<span class="c1"># Don't put the swapfile on EBS.</span>
<span class="c1"># It should be on your instance root volume for performance.</span>
sudo fallocate -l 8G /swapfile
sudo chmod <span class="m">600</span> /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
<span class="c1">################################################</span>
<span class="c1"># Append the following to the end of ~/.bashrc #</span>
<span class="c1">################################################</span>
<span class="nb">export</span> <span class="nv">PYENV_ROOT</span><span class="o">=</span><span class="s2">"</span><span class="nv">$HOME</span><span class="s2">/.pyenv"</span>
<span class="nb">command</span> -v pyenv >/dev/null <span class="o">||</span> <span class="nb">export</span> <span class="nv">PATH</span><span class="o">=</span><span class="s2">"</span><span class="nv">$PYENV_ROOT</span><span class="s2">/bin:</span><span class="nv">$PATH</span><span class="s2">"</span>
<span class="nb">eval</span> <span class="s2">"</span><span class="k">$(</span>pyenv init -<span class="k">)</span><span class="s2">"</span>
<span class="c1"># Restart your shell for the changes to take effect.</span>
<span class="c1"># Load pyenv-virtualenv automatically by adding</span>
<span class="c1"># the following to ~/.bashrc:</span>
<span class="nb">eval</span> <span class="s2">"</span><span class="k">$(</span>pyenv virtualenv-init -<span class="k">)</span><span class="s2">"</span>
<span class="c1">################################################</span>
<span class="c1"># Restart your shell to continue #</span>
<span class="c1">################################################</span>
<span class="nv">CFLAGS</span><span class="o">=</span><span class="s2">"-march=native -mtune=native"</span> <span class="nv">CONFIGURE_OPTS</span><span class="o">=</span><span class="s2">"--enable-optimizations --with-lto=full"</span> pyenv install <span class="m">3</span>.11.0 --verbose
pyenv virtualenv <span class="m">3</span>.11.0 MY_VIRTUAL_ENVIRONMENT_NAME
</code></pre></div>Golang for loops and the range-for2020-06-17T00:00:00-04:002020-06-17T00:00:00-04:00Adam Gradzkitag:adamgradzki.com,2020-06-17:/golang-for-loops-and-the-range-for.html<p>The for loop you choose in golang may have performance implications</p><p>I was unable to find a find a clear answer online to what are the performance implications of using a standard for loop (e.g., for i := 0; i < len(arr); i++ {}) vs range-for (e.g., for idx, val := range arr {}) in Golang. By analyzing the assembler output I determined that as of Go 1.14 they produce similar CPU instructions for common 64-bit Intel/AMD CPUs.</p>
<p>The range-for loop produces two additional assembly instructions:</p>
<div class="highlight"><pre><span></span><code><span class="w"> </span><span class="nf">movq</span><span class="w"> </span><span class="s">""</span><span class="nv">.arr</span><span class="o">+</span><span class="mi">8</span><span class="p">(</span><span class="nb">SP</span><span class="p">),</span><span class="w"> </span><span class="nb">AX</span><span class="w"></span>
<span class="w"> </span><span class="nf">pcdata</span><span class="w"> </span><span class="kc">$</span><span class="mi">0</span><span class="p">,</span><span class="w"> </span><span class="kc">$</span><span class="mi">0</span><span class="w"></span>
</code></pre></div>
<p>This brings the total instruction count for function count2 to 13 compared to 11 for function count1.</p>
<p>I used https://go.godbolt.org/ at the suggestion of dominikh on Freenode IRC #golang to map the Golang functions with the corresponding assembler output.</p>
<p>My code for an apples-to-apples comparison between the two:</p>
<div class="highlight"><pre><span></span><code><span class="kn">package</span><span class="w"> </span><span class="nx">main</span><span class="w"></span>
<span class="kd">func</span><span class="w"> </span><span class="nx">count1</span><span class="p">(</span><span class="nx">arr</span><span class="w"> </span><span class="p">[]</span><span class="kt">int</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"></span>
<span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="nx">i</span><span class="w"> </span><span class="o">:=</span><span class="w"> </span><span class="mi">0</span><span class="p">;</span><span class="w"> </span><span class="nx">i</span><span class="w"> </span><span class="p"><</span><span class="w"> </span><span class="nb">len</span><span class="p">(</span><span class="nx">arr</span><span class="p">);</span><span class="w"> </span><span class="nx">i</span><span class="o">++</span><span class="w"> </span><span class="p">{</span><span class="w"></span>
<span class="w"> </span><span class="nx">_</span><span class="w"> </span><span class="p">=</span><span class="w"> </span><span class="nx">i</span><span class="p">;</span><span class="w"> </span><span class="nx">_</span><span class="w"> </span><span class="p">=</span><span class="w"> </span><span class="nx">arr</span><span class="p">[</span><span class="nx">i</span><span class="p">]</span><span class="w"></span>
<span class="w"> </span><span class="p">}</span><span class="w"> </span>
<span class="p">}</span><span class="w"></span>
<span class="kd">func</span><span class="w"> </span><span class="nx">count2</span><span class="p">(</span><span class="nx">arr</span><span class="w"> </span><span class="p">[]</span><span class="kt">int</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"></span>
<span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="nx">i</span><span class="p">,</span><span class="w"> </span><span class="nx">v</span><span class="w"> </span><span class="o">:=</span><span class="w"> </span><span class="k">range</span><span class="w"> </span><span class="nx">arr</span><span class="w"> </span><span class="p">{</span><span class="w"></span>
<span class="w"> </span><span class="nx">_</span><span class="w"> </span><span class="p">=</span><span class="w"> </span><span class="nx">i</span><span class="p">;</span><span class="w"> </span><span class="nx">_</span><span class="w"> </span><span class="p">=</span><span class="w"> </span><span class="nx">v</span><span class="w"></span>
<span class="w"> </span><span class="p">}</span><span class="w"></span>
<span class="p">}</span><span class="w"></span>
<span class="kd">func</span><span class="w"> </span><span class="nx">main</span><span class="p">()</span><span class="w"> </span><span class="p">{</span><span class="w"></span>
<span class="w"> </span><span class="nx">arr</span><span class="w"> </span><span class="o">:=</span><span class="w"> </span><span class="nb">make</span><span class="p">([]</span><span class="kt">int</span><span class="p">,</span><span class="w"> </span><span class="mi">100</span><span class="p">)</span><span class="w"></span>
<span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="nx">i</span><span class="w"> </span><span class="o">:=</span><span class="w"> </span><span class="mi">0</span><span class="p">;</span><span class="w"> </span><span class="nx">i</span><span class="w"> </span><span class="p"><</span><span class="w"> </span><span class="nb">len</span><span class="p">(</span><span class="nx">arr</span><span class="p">);</span><span class="w"> </span><span class="nx">i</span><span class="o">++</span><span class="w"> </span><span class="p">{</span><span class="w"></span>
<span class="w"> </span><span class="nx">arr</span><span class="p">[</span><span class="nx">i</span><span class="p">]</span><span class="w"> </span><span class="p">=</span><span class="w"> </span><span class="nx">i</span><span class="w"></span>
<span class="w"> </span><span class="p">}</span><span class="w"></span>
<span class="w"> </span><span class="nx">count1</span><span class="p">(</span><span class="nx">arr</span><span class="p">)</span><span class="w"></span>
<span class="w"> </span><span class="nx">count2</span><span class="p">(</span><span class="nx">arr</span><span class="p">)</span><span class="w"></span>
<span class="p">}</span><span class="w"></span>
</code></pre></div>
<p>count1() produces:</p>
<div class="highlight"><pre><span></span><code><span class="w"> </span><span class="nf">pcdata</span><span class="w"> </span><span class="kc">$</span><span class="mi">0</span><span class="p">,</span><span class="w"> </span><span class="kc">$</span><span class="mi">0</span><span class="w"></span>
<span class="w"> </span><span class="nf">pcdata</span><span class="w"> </span><span class="kc">$</span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="kc">$</span><span class="mi">1</span><span class="w"></span>
<span class="w"> </span><span class="nf">movq</span><span class="w"> </span><span class="s">""</span><span class="nv">.arr</span><span class="o">+</span><span class="mi">16</span><span class="p">(</span><span class="nb">SP</span><span class="p">),</span><span class="w"> </span><span class="nb">AX</span><span class="w"></span>
<span class="w"> </span><span class="nf">xorl</span><span class="w"> </span><span class="nb">CX</span><span class="p">,</span><span class="w"> </span><span class="nb">CX</span><span class="w"></span>
<span class="w"> </span><span class="nf">jmp</span><span class="w"> </span><span class="nv">count1_pc12</span><span class="w"></span>
<span class="nl">count1_pc9:</span><span class="w"></span>
<span class="w"> </span><span class="nf">incq</span><span class="w"> </span><span class="nb">CX</span><span class="w"></span>
<span class="nl">count1_pc12:</span><span class="w"></span>
<span class="w"> </span><span class="nf">cmpq</span><span class="w"> </span><span class="nb">CX</span><span class="p">,</span><span class="w"> </span><span class="nb">AX</span><span class="w"></span>
<span class="w"> </span><span class="nf">jlt</span><span class="w"> </span><span class="nv">count1_pc9</span><span class="w"></span>
<span class="w"> </span><span class="nf">pcdata</span><span class="w"> </span><span class="kc">$</span><span class="mi">0</span><span class="p">,</span><span class="w"> </span><span class="kc">$</span><span class="o">-</span><span class="mi">1</span><span class="w"></span>
<span class="w"> </span><span class="nf">pcdata</span><span class="w"> </span><span class="kc">$</span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="kc">$</span><span class="o">-</span><span class="mi">1</span><span class="w"></span>
<span class="w"> </span><span class="nf">ret</span><span class="w"></span>
</code></pre></div>
<p>count2() produces:</p>
<div class="highlight"><pre><span></span><code><span class="w"> </span><span class="nf">pcdata</span><span class="w"> </span><span class="kc">$</span><span class="mi">0</span><span class="p">,</span><span class="w"> </span><span class="kc">$</span><span class="mi">1</span><span class="w"></span>
<span class="w"> </span><span class="nf">pcdata</span><span class="w"> </span><span class="kc">$</span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="kc">$</span><span class="mi">1</span><span class="w"></span>
<span class="w"> </span><span class="nf">movq</span><span class="w"> </span><span class="s">""</span><span class="nv">.arr</span><span class="o">+</span><span class="mi">8</span><span class="p">(</span><span class="nb">SP</span><span class="p">),</span><span class="w"> </span><span class="nb">AX</span><span class="w"></span>
<span class="w"> </span><span class="nf">pcdata</span><span class="w"> </span><span class="kc">$</span><span class="mi">0</span><span class="p">,</span><span class="w"> </span><span class="kc">$</span><span class="mi">0</span><span class="w"></span>
<span class="w"> </span><span class="nf">movq</span><span class="w"> </span><span class="mi">8</span><span class="p">(</span><span class="nb">AX</span><span class="p">),</span><span class="w"> </span><span class="nb">AX</span><span class="w"></span>
<span class="w"> </span><span class="nf">xorl</span><span class="w"> </span><span class="nb">CX</span><span class="p">,</span><span class="w"> </span><span class="nb">CX</span><span class="w"></span>
<span class="w"> </span><span class="nf">jmp</span><span class="w"> </span><span class="nv">count2_pc16</span><span class="w"></span>
<span class="nl">count2_pc13:</span><span class="w"></span>
<span class="w"> </span><span class="nf">incq</span><span class="w"> </span><span class="nb">CX</span><span class="w"></span>
<span class="nl">count2_pc16:</span><span class="w"></span>
<span class="w"> </span><span class="nf">cmpq</span><span class="w"> </span><span class="nb">CX</span><span class="p">,</span><span class="w"> </span><span class="nb">AX</span><span class="w"></span>
<span class="w"> </span><span class="nf">jlt</span><span class="w"> </span><span class="nv">count2_pc13</span><span class="w"></span>
<span class="w"> </span><span class="nf">pcdata</span><span class="w"> </span><span class="kc">$</span><span class="mi">0</span><span class="p">,</span><span class="w"> </span><span class="kc">$</span><span class="o">-</span><span class="mi">1</span><span class="w"></span>
<span class="w"> </span><span class="nf">pcdata</span><span class="w"> </span><span class="kc">$</span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="kc">$</span><span class="o">-</span><span class="mi">1</span><span class="w"></span>
<span class="w"> </span><span class="nf">ret</span><span class="w"></span>
</code></pre></div>
<p>count2 produces more assembler instructions which <em>may</em> suggest it is slower code. However, this is not always the case. To understand the real performance implications of these instructions benchmarks need to be conducted in a future article.</p>Faster Virtual Machines on Linux Hosts with GPU Acceleration2020-04-06T00:00:00-04:002020-04-08T00:00:00-04:00Adam Gradzkitag:adamgradzki.com,2020-04-06:/faster-virtual-machines-on-linux-hosts-with-gpu-acceleration.html<p>Virtual machines don't all draw to screen efficiently, but some new approaches are getting there.</p><div class="toc"><span class="toctitle">Table of Contents</span><ul>
<li><a href="#overview">Overview</a></li>
<li><a href="#introduction">Introduction</a></li>
<li><a href="#gpu-command-architectures">GPU command architectures</a><ul>
<li><a href="#vga-emulation-ve">VGA Emulation (VE)</a></li>
<li><a href="#api-forwarding-af">API forwarding (AF)</a></li>
<li><a href="#direct-pass-through-dpt">Direct Pass-Through (DPT)</a></li>
<li><a href="#full-gpu-virtualization-fgv">Full GPU Virtualization (FGV)</a></li>
</ul>
</li>
<li><a href="#hypervisors">Hypervisors</a><ul>
<li><a href="#oracle-virtualbox">Oracle VirtualBox</a></li>
<li><a href="#vmware-vsphereesxi">VMWare vSphere/ESXi</a></li>
<li><a href="#vmware-workstation">VMWare Workstation</a></li>
<li><a href="#qemu">QEMU</a></li>
</ul>
</li>
<li><a href="#conclusion">Conclusion</a></li>
</ul>
</div>
<h1 id="overview">Overview</h1>
<p>Open source virtualization technologies widely available in the Linux software ecosystem lack the ease of use of graphical performance enhancements available in commercial virtualization technologies such as <a href="https://www.vmware.com/products/workstation-pro.html">VMWare Workstation</a> or <a href="https://www.vmware.com/products/esxi-and-esx.html">VMWare vSphere/ESXi</a>. <a href="https://projectacrn.github.io/latest/developer-guides/hld/hld-APL_GVT-g.html">Intel GVT-g</a> is a virtual graphics acceleration technology which can be accessed with the <a href="https://www.qemu.org/">QEMU</a> virtualization system. <a href="https://www.qemu.org/">QEMU</a> serves as an open-source alternative to technologies such as <a href="https://www.vmware.com/products/workstation-pro.html">VMWare Workstation</a> or <a href="https://www.vmware.com/products/esxi-and-esx.html">VMWare vSphere/ESXi</a>. <a href="https://projectacrn.github.io/latest/developer-guides/hld/hld-APL_GVT-g.html">Intel GVT-g</a> was configured on a <a href="https://wiki.archlinux.org/index.php/Lenovo_ThinkPad_X1_Carbon_(Gen_6)">Thinkpad X1 Generation 6 laptop</a> containing Intel integrated graphics resulting in successful GPU acceleration on a UEFI Windows 10 64-bit guest without relying on proprietary software aside from the guest operating system itself. Substantially improved virtualization performance is possible due to working Intel GVT-g GPU acceleration on Linux hosts.</p>
<h1 id="introduction">Introduction</h1>
<p>Computer users rely on software written for many different operating systems. Virtual machines allow computer users to simultaneously run different operating systems and switch between them easily. <a href="https://www.vmware.com/solutions/virtualization.html">Virtualization has benefits</a> such as being able to migrate installed systems to other physical machines with lower downtime, the ability to contain <a href="https://rosettacode.org/wiki/Untrusted_environment">untrusted code</a> in a sandbox that is difficult to escape from, maintain operation of legacy systems that are difficult to keep running on obsolete hardware, or simply running a Windows-only program on a Linux <a href="https://pediaa.com/what-is-the-difference-between-host-and-guest-operating-system/">host</a>.</p>
<p>Virtual machines with graphical user interfaces typically suffer from input lag and stuttering, both of which lead to a degraded user experience. Additionally, software which relies on heavy computation such as photo editing or engineering is dependent on efficient GPU access to speed up calculations by an order of magnitude or more over the host machine CPU to finish calculations in a reasonable time period. Unfortunately, not all virtualization solutions are able to leverage the physical chips on the host machine in an efficient manner, regardless of cost.</p>
<h1 id="gpu-command-architectures">GPU command architectures</h1>
<ol>
<li>VGA Emulation (VE)<ul>
<li>Universally available on all virtualization platforms</li>
</ul>
</li>
<li>API forwarding (AF)<ul>
<li>Intel GVT-s</li>
<li>VMWare Virtual Shared Graphics Acceleration (vSGA)</li>
<li>Oracle VirtualBox 3D Acceleration</li>
</ul>
</li>
<li>Direct Pass-Through (DPT)<ul>
<li>Intel GVT-d</li>
<li>VMWare Virtual Dedicated Graphics Acceleration (vDGA)</li>
<li>Not available in VMWare Workstation</li>
</ul>
</li>
<li>Full GPU Virtualization (FGV)<ul>
<li>Intel GVT-g</li>
<li>VMWare Virtual Shared Pass-Through Graphics Acceleration (vGPU or MxGPU)</li>
<li>Not available in VMWare Workstation</li>
</ul>
</li>
</ol>
<h2 id="vga-emulation-ve">VGA Emulation (VE)</h2>
<p>The most primitive graphics display for any virtual machine is VGA Emulation (VE). This mode is also the most inefficient. <a href="https://github.com/qemu/qemu/blob/master/hw/display/cirrus_vga.c">QEMU emulates a Cirrus Logic GD5446 Video card.</a> All Windows versions starting from Windows 95 should recognize and use this graphic card.</p>
<p>Most hypervisors which advertise some form of <a href="https://developer.ibm.com/technologies/linux/tutorials/l-pci-passthrough/">"hardware acceleration"</a> use API Forwarding (AF), which is a high performance proxy service that requires specialized drivers on both the host and guest to create a high performance instruction pipeline.</p>
<h2 id="api-forwarding-af">API forwarding (AF)</h2>
<p>API Forwarding (AF) works by:</p>
<ol>
<li>intercepting the GPU command requested by a piece of software</li>
<li>proxying the GPU command to the host hypervisor</li>
<li>executing the captured GPU command on the host from the hypervisor</li>
<li>bubbling the response back up to the virtual machine</li>
</ol>
<p>This mode is very useful when many virtual machines are competing for resources of a single GPU and Full GPU Virtualization (FGV) is not possible. The hypervisor queues graphics card operations from one or more virtual machine and schedules virtual execution and memory slots for each virtual machine on a single physical GPU resource. Each virtual machine sees its own graphics card while the hypervisor splits the single physical resource up. A key drawback of AF is that usually only OpenGL and DirectX interfaces are supported by the GPU instruction proxy.</p>
<p>The process by which API Forwarding (AF) works is known as <a href="https://www.unf.edu/~sahuja/cloudcourse/Fullandparavirtualization.pdf">paravirtualization</a>.</p>
<h2 id="direct-pass-through-dpt">Direct Pass-Through (DPT)</h2>
<p>Direct Pass-Through (DPT) is a system which exposes the GPU as a PCI device which is directly addressable by the virtual machine. Nothing besides the virtual machine can reference any resources on the GPU and it cannot be shared with the physical machine or any other virtual machines. Many devices have only one graphics card installed and using this system would mean making the graphical user interface inoperable. This method is most useful when:</p>
<ul>
<li>the physical graphics card does not support Full GPU Virtualization (FGV)</li>
<li>two or more graphics cards are attached to a system</li>
<li>paravirtualized drivers are not available or do not work with the installed physical GPU, host hypervisor, or guest operating system</li>
</ul>
<h2 id="full-gpu-virtualization-fgv">Full GPU Virtualization (FGV)</h2>
<p>Sharing a GPU natively among multiple virtual machines is possible with Full GPU Virtualization (FGV) solutions such as Intel GVT-g. This process is also known as Hardware Assisted Virtualization (HVM), not to be confused with Paravirtualization (PV). In this mode the IOMMU hardware exposes a GPU memory interface to each virtual machine while it internally handles the memory address mappings between what it exposes to virtual machines and the actual physical memory on the GPU.</p>
<p>In "IOMMU and Virtualization," <a href="https://www.linkedin.com/in/susanta">Susanta Nanda</a> <a href="https://vmmworld.blogspot.com/2006/05/iommu-and-virtualization.html">writes</a>:</p>
<blockquote>
<p>IOMMU provides two main functionalities: virtual-to-physical address translation and access protection on the memory ranges that an I/O device is trying to operate on. </p>
</blockquote>
<p>Peak media workload performance is 95% of the native host alone when running one virtual machine and the average performance is 85% of the native host alone on media workloads according to Intel engineer Zhenyu Wang in <a href="https://www.x.org/wiki/Events/XDC2017/wang_gvt.pdf">XDC2017 presentation "Full GPU virtualization in mediated pass-through way"</a> </p>
<h1 id="hypervisors">Hypervisors</h1>
<p>The hypervisors I use are software systems that enable multiple virtual machines to run simultaneously on a single physical machine. Linux users have many <a href="https://en.wikipedia.org/wiki/Hypervisor">hypervisor</a> options. <a href="https://opensourceforu.com/2019/04/the-top-open-source-hypervisor-technologies/">A longer list is available here</a>. These are a sample of some Linux hypervisors:</p>
<ol>
<li><a href="https://www.virtualbox.org/">Oracle VirtualBox</a></li>
<li><a href="https://www.vmware.com/products/esxi-and-esx.html">VMWare vSphere/ESXi</a> (technically runs beneath Linux)</li>
<li><a href="https://www.vmware.com/products/workstation-pro.html">VMWare Workstation</a></li>
<li><a href="https://www.qemu.org/">QEMU</a></li>
</ol>
<h2 id="oracle-virtualbox">Oracle VirtualBox</h2>
<p>Possibly the most widely-used hypervisor is <a href="https://www.virtualbox.org/">VirtualBox by Oracle</a>. VirtualBox is open source software with the exception of the optional extension pack.</p>
<p>The Oracle VirtualBox extension pack provides many features which are not available in the free version. The PCI passthrough module was shipped as a Oracle VM VirtualBox extension package until the <a href="https://github.com/mdaniel/virtualbox-org-svn-vbox-trunk/commit/5178e479b2ac1e33454f203854de9fe8f85a9196">feature was scrapped</a>.</p>
<p>These features are available gratis for personal and non-commerical use only.</p>
<p>Possession of the VirtualBox Extension Pack without a license <a href="https://www.reddit.com/r/sysadmin/comments/8ffcg3/oracle_is_looking_under_the_couch_cushions_for/">can be problematic</a>:</p>
<blockquote>
<p>Got an email today informing me (Urgent Virtual Box Licensing Information for Company X) that there have been TWELVE (12!) downloads of the VirtualBox Extension Pack at my employer in the past year. And since the extensions are licensed differently than the base product, they'd love for us to call them and talk about how much money we owe them. Their report attached to email listed source IPs and AS number, as well as date/product/version. Out of the twelve (12!), there were always two on the same day of the same version, so really six (6!) downloads. We'll probably end up giving them $150, and I'll make sure they never get any business from places I work, because fuck Oracle. I wouldn't piss on Larry Ellison if he was on fire.</p>
</blockquote>
<p>VirtualBox Linux hosts do not support GPU DPT (Direct Pass-Through) at all. All of the preliminary PCI pass-through work for Linux hosts which is needed for GPU DPT was <a href="https://github.com/mdaniel/virtualbox-org-svn-vbox-trunk/commit/5178e479b2ac1e33454f203854de9fe8f85a9196">completely stripped out on December 5th, 2019</a> with this message:</p>
<blockquote>
<p>Linux host: Drop PCI passthrough, the current code is too incomplete (cannot handle PCIe devices at all), i.e. not useful enough</p>
</blockquote>
<p>VirtualBox 2D and 3D acceleration both work <a href="https://docs.oracle.com/en/virtualization/virtualbox/6.1/user/guestadd-video.html">according to the same principle</a>: API forwarding (AF)</p>
<blockquote>
<p>Oracle VM VirtualBox implements 3D acceleration by installing an additional hardware 3D driver inside the guest when the Guest Additions are installed. This driver acts as a hardware 3D driver and reports to the guest operating system that the virtual hardware is capable of 3D hardware acceleration. When an application in the guest then requests hardware acceleration through the OpenGL or Direct3D programming interfaces, these are sent to the host through a special communication tunnel implemented by Oracle VM VirtualBox. The host then performs the requested 3D operation using the host's programming interfaces.</p>
</blockquote>
<h2 id="vmware-vsphereesxi">VMWare vSphere/ESXi</h2>
<p>VMWare vSphere/ESXi is a bare metal hypervisor which runs beneath any end-user operating systems. This property makes it a <a href="https://www.ibm.com/cloud/learn/hypervisors">Type 1 Hypervisor</a>. It supports all GPU acceleration technologies.</p>
<ul>
<li>Virtual Shared Graphics Acceleration (vSGA) is a form of API forwarding (AF).</li>
<li>Virtual Dedicated Graphics Acceleration (vDGA) technology is a form of Direct Pass-Through (DPT).</li>
<li>VMWare Virtual Shared Pass-Through Graphics Acceleration (vGPU or MxGPU) is a form of Full GPU Virtualization (FGV).</li>
</ul>
<p>The main limitation of VMWare vSphere/ESXi GPU acceleration is the graphics card selection. GPU passthrough is possible only with a small set of GPUs because NVIDIA drivers disable consumer-market GPUs such as the GeForce series when the drivers detect that they are running in a virtual environment. <a href="https://www.vmware.com/resources/compatibility/search.php?deviceCategory=sptg&details=1">Comprehensive list of all supported graphics cards for any hardware acceleration purposes.</a></p>
<h2 id="vmware-workstation">VMWare Workstation</h2>
<p>VMWare Workstation only provides Virtual Shared Graphics Acceleration (vSGA), a form of API forwarding (AF). In this regard, the GPU acceleration story is identical to Oracle VirtualBox.</p>
<h2 id="qemu">QEMU</h2>
<p><a href="https://www.qemu.org/">QEMU</a> is an open source virtual machine platform that is also capable of translating instructions between wholly unrelated computer architectures. It is widely available in most Linux distributions and is used extensively in industry.</p>
<p>After enabling GVT-g in QEMU you must also recompile QEMU with the 60 fps fix to get smooth video as of the publication date of this article.</p>
<h1 id="conclusion">Conclusion</h1>
<p>QEMU eclipses VirtualBox in features and exceeds VMWare capabilities. VirtualBox is limited to API forwarding (AF)
since it is not able to allow virtual machines to address graphics hardware directly in any way. VMWare solutions
support all types of GPU addressing but most graphic cards made by NVIDIA disable themselves when they detect being
called in Direct Pass-Through (DPT) or Full GPU Virtualization (FGV) modes. QEMU exceeds the features and provides
hardware use flexibility beyond that of VMWare to bring near-native graphics performance to guest operating systems
such as Windows 10 with truly minimal driver support required in the guest operating system. I recommend using QEMU
on Linux when high graphics performance and low operational costs are prioritized for deploying a virtual machine
environment.</p>
<!---
References (TODO)
https://software.intel.com/en-us/blogs/2009/06/25/understanding-vt-d-intel-virtualization-technology-for-directed-io
http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/vt-directed-io-spec.pdf
https://wiki.qemu.org/Features/VT-d
http://neg-serg.github.io/2017/06/pci-pass/
https://wiki.gentoo.org/wiki/QEMU
https://www.reddit.com/r/VFIO/comments/fn20eu/doom_eternal_cpu_host_needed/
https://blog.wikichoon.com/2014/07/enabling-hyper-v-enlightenments-with-kvm.html
https://blog.bepbep.co/posts/gvt/
https://nixos.wiki/wiki/IGVT-g
https://null-src.com/posts/qemu-vfio-pci/post.php
https://github.com/intel/gvt-linux/wiki/Dma_Buf_User_Guide#b-gtk
https://gist.github.com/mcastelino/7ab9dba51b0dbb230bd18c448d935312
https://bugzilla.redhat.com/show_bug.cgi?id=1337510
https://github.com/cardi/qemu-windows-10
https://github.com/cardi/qemu-windows-10/blob/master/start.sh
https://dennisnotes.com/note/20180614-ubuntu-18.04-qemu-setup/
https://wiki.archlinux.org/index.php/Intel_GVT-g#Using_DMA-BUF_display
--->Thinkpad X1 Carbon GPU Undervolting in Arch Linux2020-04-03T00:00:00-04:002022-11-27T00:00:00-05:00Adam Gradzkitag:adamgradzki.com,2020-04-03:/thinkpad-x1-carbon-gpu-undervolting-in-arch-linux.html<p>Decrease your laptop CPU voltage to increase battery life</p><p>After successfully installing <a href="https://www.archlinux.org/about/">Arch Linux</a> on my <a href="https://www.notebookcheck.net/Lenovo-ThinkPad-X1-Carbon-2018-20KHCTO1WW.319032.0.html">Thinkpad X1 Carbon Generation 6</a> I began experimenting with laptop component <a href="https://www.techopedia.com/definition/26921/undervolting">undervolting</a> to reduce heat and improve performance. Unfortunately it seems that the set points are being ignored to some extent.</p>
<p>Indeed, <a href="https://github.com/erpalma/throttled/issues/185#issuecomment-608279702">Francesco Palmarini reported on Github</a>:</p>
<blockquote>
<p>It is kind of impossible that your GPU can work at -1V offset voltage. On XTU I get an instant reboot on -500mV</p>
</blockquote>
<p>However, despite the apparent physical impossibility of applying such a large voltage offset, there seems to be a real effect on performance. Applying a -500 mV voltage offset resulted in approximately 2 watts lower peak power consumption as measured by <a href="https://github.com/erpalma/throttled">throttled monitoring</a>.</p>
<p>I benchmarked GPU performance at different voltage offsets to see if they are ignored after a certain point. The result shows diminishing returns beyond 300 millivolt undervoltage. Still, it is not clear whether the larger undervoltage settings are actually being set correctly because it should not be possible for the CPU to operate when undervolted 900 mV.</p>
<div class="highlight"><pre><span></span><code><span class="c1"># config.toml</span><span class="w"></span>
<span class="k">[voltage]</span><span class="w"></span>
<span class="n">start</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">0</span><span class="w"></span>
<span class="n">stop</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">-1000</span><span class="w"></span>
<span class="n">step</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">-100</span><span class="w"></span>
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="c1"># test.py</span>
<span class="kn">import</span> <span class="nn">toml</span>
<span class="kn">from</span> <span class="nn">subprocess</span> <span class="kn">import</span> <span class="n">run</span>
<span class="kn">from</span> <span class="nn">multiprocessing</span> <span class="kn">import</span> <span class="n">Process</span>
<span class="kn">from</span> <span class="nn">tempfile</span> <span class="kn">import</span> <span class="n">NamedTemporaryFile</span>
<span class="kn">from</span> <span class="nn">jinja2</span> <span class="kn">import</span> <span class="n">Template</span>
<span class="kn">from</span> <span class="nn">tqdm</span> <span class="kn">import</span> <span class="n">tqdm</span>
<span class="k">def</span> <span class="nf">undervolt_gpu</span><span class="p">(</span><span class="n">voltage</span><span class="p">):</span>
<span class="n">log</span> <span class="o">=</span> <span class="sa">f</span><span class="s2">"undervolt___</span><span class="si">{</span><span class="n">voltage</span><span class="si">}</span><span class="s2">.log"</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"undervolt_gpu: saving to </span><span class="si">{</span><span class="n">log</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s2">"lenovo_fix.conf"</span><span class="p">)</span> <span class="k">as</span> <span class="n">fd</span><span class="p">:</span>
<span class="n">tpl</span> <span class="o">=</span> <span class="n">Template</span><span class="p">(</span><span class="n">fd</span><span class="o">.</span><span class="n">read</span><span class="p">())</span><span class="o">.</span><span class="n">render</span><span class="p">(</span><span class="n">voltage</span><span class="o">=</span><span class="n">voltage</span><span class="p">)</span>
<span class="k">with</span> <span class="n">NamedTemporaryFile</span><span class="p">(</span><span class="n">mode</span><span class="o">=</span><span class="s2">"w"</span><span class="p">)</span> <span class="k">as</span> <span class="n">ntf</span><span class="p">:</span>
<span class="n">ntf</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">tpl</span><span class="p">)</span>
<span class="n">ntf</span><span class="o">.</span><span class="n">flush</span><span class="p">()</span>
<span class="n">cmd</span> <span class="o">=</span> <span class="sa">f</span><span class="s2">"sudo /usr/lib/throttled/lenovo_fix.py --monitor --log </span><span class="si">{</span><span class="n">log</span><span class="si">}</span><span class="s2"> --config </span><span class="si">{</span><span class="n">ntf</span><span class="o">.</span><span class="n">name</span><span class="si">}</span><span class="s2">"</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"voltage"</span><span class="p">,</span> <span class="n">voltage</span><span class="p">,</span> <span class="s2">"cmd"</span><span class="p">,</span> <span class="n">cmd</span><span class="p">)</span>
<span class="n">run</span><span class="p">(</span><span class="n">cmd</span><span class="p">,</span> <span class="n">check</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">shell</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s2">"__main__"</span><span class="p">:</span>
<span class="n">cfg</span> <span class="o">=</span> <span class="n">toml</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="s2">"config.toml"</span><span class="p">)</span>
<span class="n">voltages</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="n">cfg</span><span class="p">[</span><span class="s2">"voltage"</span><span class="p">][</span><span class="s2">"start"</span><span class="p">],</span> <span class="n">cfg</span><span class="p">[</span><span class="s2">"voltage"</span><span class="p">][</span><span class="s2">"stop"</span><span class="p">],</span> <span class="n">cfg</span><span class="p">[</span><span class="s2">"voltage"</span><span class="p">][</span><span class="s2">"step"</span><span class="p">]))</span>
<span class="nb">print</span><span class="p">(</span><span class="n">voltages</span><span class="p">)</span>
<span class="k">for</span> <span class="n">voltage</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="n">voltages</span><span class="p">):</span>
<span class="n">p</span> <span class="o">=</span> <span class="n">Process</span><span class="p">(</span><span class="n">target</span><span class="o">=</span><span class="n">undervolt_gpu</span><span class="p">,</span> <span class="n">args</span><span class="o">=</span><span class="p">(</span><span class="n">voltage</span><span class="p">,))</span>
<span class="n">p</span><span class="o">.</span><span class="n">start</span><span class="p">()</span>
<span class="n">cmd</span> <span class="o">=</span> <span class="sa">f</span><span class="s2">"glmark2 -b :duration=2.0 --fullscreen | tee glmark___</span><span class="si">{</span><span class="n">voltage</span><span class="si">}</span><span class="s2">.log"</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"process launched"</span><span class="p">)</span>
<span class="n">run</span><span class="p">(</span><span class="n">cmd</span><span class="p">,</span> <span class="n">check</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">shell</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"process terminating"</span><span class="p">)</span>
<span class="n">p</span><span class="o">.</span><span class="n">terminate</span><span class="p">()</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"process terminated"</span><span class="p">)</span>
</code></pre></div>
<p>I generated a CSV report by regex parsing the data logged from <code>glmark2</code> and <code>lenovo_fix.py</code>:</p>
<div class="highlight"><pre><span></span><code><span class="c1"># report.py</span>
<span class="kn">import</span> <span class="nn">re</span>
<span class="kn">import</span> <span class="nn">os.path</span>
<span class="kn">import</span> <span class="nn">toml</span>
<span class="kn">import</span> <span class="nn">csv</span>
<span class="n">cfg</span> <span class="o">=</span> <span class="n">toml</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="s2">"config.toml"</span><span class="p">)</span>
<span class="n">voltages</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="n">cfg</span><span class="p">[</span><span class="s2">"voltage"</span><span class="p">][</span><span class="s2">"start"</span><span class="p">],</span> <span class="n">cfg</span><span class="p">[</span><span class="s2">"voltage"</span><span class="p">][</span><span class="s2">"stop"</span><span class="p">],</span> <span class="n">cfg</span><span class="p">[</span><span class="s2">"voltage"</span><span class="p">][</span><span class="s2">"step"</span><span class="p">]))</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s1">'undervolting.csv'</span><span class="p">,</span> <span class="s1">'w'</span><span class="p">,</span> <span class="n">newline</span><span class="o">=</span><span class="s1">''</span><span class="p">)</span> <span class="k">as</span> <span class="n">csvfile</span><span class="p">:</span>
<span class="n">fieldnames</span> <span class="o">=</span> <span class="p">[</span><span class="s1">'glmark'</span><span class="p">,</span> <span class="s1">'voltage'</span><span class="p">,</span> <span class="s1">'package_watts'</span><span class="p">,</span> <span class="s1">'graphics_watts'</span><span class="p">,</span> <span class="s1">'dram_watts'</span><span class="p">]</span>
<span class="n">writer</span> <span class="o">=</span> <span class="n">csv</span><span class="o">.</span><span class="n">DictWriter</span><span class="p">(</span><span class="n">csvfile</span><span class="p">,</span> <span class="n">fieldnames</span><span class="o">=</span><span class="n">fieldnames</span><span class="p">)</span>
<span class="n">writer</span><span class="o">.</span><span class="n">writeheader</span><span class="p">()</span>
<span class="k">for</span> <span class="n">voltage</span> <span class="ow">in</span> <span class="n">voltages</span><span class="p">:</span>
<span class="n">glmark_p</span> <span class="o">=</span> <span class="sa">f</span><span class="s2">"glmark___</span><span class="si">{</span><span class="n">voltage</span><span class="si">}</span><span class="s2">.log"</span>
<span class="n">undervolt_p</span> <span class="o">=</span> <span class="sa">f</span><span class="s2">"undervolt___</span><span class="si">{</span><span class="n">voltage</span><span class="si">}</span><span class="s2">.log"</span>
<span class="k">try</span><span class="p">:</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">glmark_p</span><span class="p">)</span> <span class="k">as</span> <span class="n">fd</span><span class="p">:</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">glmark_s</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="s2">"(?:glmark2 Score: )(\d+)"</span><span class="p">,</span> <span class="n">fd</span><span class="o">.</span><span class="n">read</span><span class="p">())</span><span class="o">.</span><span class="n">group</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="k">except</span> <span class="ne">AttributeError</span><span class="p">:</span>
<span class="k">continue</span>
<span class="k">except</span> <span class="ne">FileNotFoundError</span><span class="p">:</span>
<span class="k">continue</span>
<span class="k">try</span><span class="p">:</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">undervolt_p</span><span class="p">)</span> <span class="k">as</span> <span class="n">fd</span><span class="p">:</span>
<span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">fd</span><span class="p">:</span>
<span class="n">package</span><span class="p">,</span> <span class="n">graphics</span><span class="p">,</span> <span class="n">dram</span> <span class="o">=</span> <span class="nb">getattr</span><span class="p">(</span><span class="n">re</span><span class="o">.</span><span class="n">search</span><span class="p">(</span>
<span class="s2">"(?:Package: )(\d*\.?\d+)(?: W - Graphics: )(\d*\.?\d+)(?: W - DRAM: )(\d*\.?\d+)"</span><span class="p">,</span> <span class="n">line</span>
<span class="p">),</span> <span class="s2">"groups"</span><span class="p">,</span> <span class="k">lambda</span><span class="p">:</span> <span class="p">(</span><span class="s2">"0"</span><span class="p">,</span> <span class="s2">"0"</span><span class="p">,</span> <span class="s2">"0"</span><span class="p">,))()</span>
<span class="n">writer</span><span class="o">.</span><span class="n">writerow</span><span class="p">(</span><span class="nb">dict</span><span class="p">(</span>
<span class="n">glmark</span><span class="o">=</span><span class="n">glmark_s</span><span class="p">,</span>
<span class="n">voltage</span><span class="o">=</span><span class="n">voltage</span><span class="p">,</span>
<span class="n">package_watts</span><span class="o">=</span><span class="n">package</span><span class="p">,</span>
<span class="n">graphics_watts</span><span class="o">=</span><span class="n">graphics</span><span class="p">,</span>
<span class="n">dram_watts</span><span class="o">=</span><span class="n">dram</span><span class="p">))</span>
<span class="k">except</span> <span class="ne">FileNotFoundError</span><span class="p">:</span>
<span class="k">continue</span>
</code></pre></div>
<!-- ![](/static/undervolt_v4.svg){:class="img-responsive"} -->
<!-- This GPU benchmark performs best and diminishing returns begin around -400 mV GPU voltage offset. -->
<p><a href="https://github.com/erpalma/throttled">throttled</a> GPU voltage offset adjustment has some effect on GPU performance, but it is unclear to what extent this is meaningful with the Linux implementation. GPU voltage offset adjustment parameters are not on the same scale as those within the Windows-based Intel XTU utility. The relationship between throttled GPU voltage offset and Intel XTU voltage offset warrants further research to fully unlock the battery life and calculation performance potential of our machines.</p>
<p>To set your GPU voltage offset to -400 mV, edit <code>lenovo_fix.conf</code> which is located at <code>/etc/lenovo_fix.conf</code> on Arch Linux if you installed throttled from the <a href="https://www.archlinux.org/packages/community/any/throttled/">throttled package</a>.</p>