Adam Gradzki's Personal Website - Software

Visual Studio Code fix for Python 3.11 debugging

2022-12-19T00:00:00-05:00

Symptom:

PyInt_FromLong not found.
PyInt_FromLong not found.
PyInt_FromLong not found.

Solution:

git clone --depth 1 https://github.com/microsoft/debugpy.git ~/.debugpy
rm -rf ~/.vscode/extensions/ms-python.python-2022.20.1/pythonFiles/lib/python/debugpy/
mv -v ~/.debugpy/src/debugpy/ ~/.vscode/extensions/ms-python.python-2022.20.1/pythonFiles/lib/python
rm -rf ~/.debugpy

CPU specific optimized Python on AWS

2022-11-30T00:00:00-05:00

Recently there has been a need to more efficiently host our software services on AWS. As of publication this is most readily achievable for general purpose Python code with (AWS Graviton instances)[https://aws.amazon.com/ec2/graviton/]

Starting from a Debian 11 ARM EC2 t4g instance, the following commands are able to create a Python build from source optimized for the CPU architecture running on AWS. Note, LTO is enabled in the build script and it uses a huge amount of RAM, so make sure you have enough RAM or use a swapfile as I demonstrate below since I chose a t4g.small instance.

sudo apt update -y
sudo apt upgrade -y
sudo apt install -y \
    git \
    build-essential \
    gdb \
    lcov \
    libbz2-dev \
    libffi-dev \
    libgdbm-dev \
    liblzma-dev \
    libncurses5-dev \
    libreadline6-dev \
    libsqlite3-dev \
    libssl-dev \
    lzma \
    lzma-dev \
    tk-dev \
    uuid-dev \
    libxml2-dev \
    libxml2 \
    libxslt1-dev \
    libxslt1.1 \
    xvfb \
    zlib1g-dev
apt build-dep python3 -y
curl https://pyenv.run | bash

# If you have less than 8 GB RAM, create a swapfile and enable it
# Don't put the swapfile on EBS.
# It should be on your instance root volume for performance.
sudo fallocate -l 8G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

################################################
# Append the following to the end of ~/.bashrc #
################################################

export PYENV_ROOT="$HOME/.pyenv"
command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"
eval "$(pyenv init -)"

# Restart your shell for the changes to take effect.

# Load pyenv-virtualenv automatically by adding
# the following to ~/.bashrc:

eval "$(pyenv virtualenv-init -)"

################################################
#        Restart your shell to continue        #
################################################

CFLAGS="-march=native -mtune=native" CONFIGURE_OPTS="--enable-optimizations --with-lto=full" pyenv install 3.11.0 --verbose
pyenv virtualenv 3.11.0 MY_VIRTUAL_ENVIRONMENT_NAME

Golang for loops and the range-for

2020-06-17T00:00:00-04:00

I was unable to find a find a clear answer online to what are the performance implications of using a standard for loop (e.g., for i := 0; i < len(arr); i++ {}) vs range-for (e.g., for idx, val := range arr {}) in Golang. By analyzing the assembler output I determined that as of Go 1.14 they produce similar CPU instructions for common 64-bit Intel/AMD CPUs.

The range-for loop produces two additional assembly instructions:

        movq    "".arr+8(SP), AX
        pcdata  $0, $0

This brings the total instruction count for function count2 to 13 compared to 11 for function count1.

I used https://go.godbolt.org/ at the suggestion of dominikh on Freenode IRC #golang to map the Golang functions with the corresponding assembler output.

My code for an apples-to-apples comparison between the two:

package main

func count1(arr []int) {
    for i := 0; i < len(arr); i++ {
        _ = i; _ = arr[i]
    }    
}

func count2(arr []int) {
    for i, v := range arr {
        _ = i; _ = v
    }
}

func main() {
    arr := make([]int, 100)
    for i := 0; i < len(arr); i++ {
        arr[i] = i
    }
    count1(arr)
    count2(arr)
}

count1() produces:

        pcdata  $0, $0
        pcdata  $1, $1
        movq    "".arr+16(SP), AX
        xorl    CX, CX
        jmp     count1_pc12
count1_pc9:
        incq    CX
count1_pc12:
        cmpq    CX, AX
        jlt     count1_pc9
        pcdata  $0, $-1
        pcdata  $1, $-1
        ret

count2() produces:

        pcdata  $0, $1
        pcdata  $1, $1
        movq    "".arr+8(SP), AX
        pcdata  $0, $0
        movq    8(AX), AX
        xorl    CX, CX
        jmp     count2_pc16
count2_pc13:
        incq    CX
count2_pc16:
        cmpq    CX, AX
        jlt     count2_pc13
        pcdata  $0, $-1
        pcdata  $1, $-1
        ret

count2 produces more assembler instructions which may suggest it is slower code. However, this is not always the case. To understand the real performance implications of these instructions benchmarks need to be conducted in a future article.

Faster Virtual Machines on Linux Hosts with GPU Acceleration

2020-04-06T00:00:00-04:00

Table of Contents

Overview
Introduction
GPU command architectures
Hypervisors
Conclusion

Overview

Open source virtualization technologies widely available in the Linux software ecosystem lack the ease of use of graphical performance enhancements available in commercial virtualization technologies such as VMWare Workstation or VMWare vSphere/ESXi. Intel GVT-g is a virtual graphics acceleration technology which can be accessed with the QEMU virtualization system. QEMU serves as an open-source alternative to technologies such as VMWare Workstation or VMWare vSphere/ESXi. Intel GVT-g was configured on a Thinkpad X1 Generation 6 laptop containing Intel integrated graphics resulting in successful GPU acceleration on a UEFI Windows 10 64-bit guest without relying on proprietary software aside from the guest operating system itself. Substantially improved virtualization performance is possible due to working Intel GVT-g GPU acceleration on Linux hosts.

Introduction

Computer users rely on software written for many different operating systems. Virtual machines allow computer users to simultaneously run different operating systems and switch between them easily. Virtualization has benefits such as being able to migrate installed systems to other physical machines with lower downtime, the ability to contain untrusted code in a sandbox that is difficult to escape from, maintain operation of legacy systems that are difficult to keep running on obsolete hardware, or simply running a Windows-only program on a Linux host.

Virtual machines with graphical user interfaces typically suffer from input lag and stuttering, both of which lead to a degraded user experience. Additionally, software which relies on heavy computation such as photo editing or engineering is dependent on efficient GPU access to speed up calculations by an order of magnitude or more over the host machine CPU to finish calculations in a reasonable time period. Unfortunately, not all virtualization solutions are able to leverage the physical chips on the host machine in an efficient manner, regardless of cost.

GPU command architectures

VGA Emulation (VE)
- Universally available on all virtualization platforms
API forwarding (AF)
- Intel GVT-s
- VMWare Virtual Shared Graphics Acceleration (vSGA)
- Oracle VirtualBox 3D Acceleration
Direct Pass-Through (DPT)
- Intel GVT-d
- VMWare Virtual Dedicated Graphics Acceleration (vDGA)
- Not available in VMWare Workstation
Full GPU Virtualization (FGV)
- Intel GVT-g
- VMWare Virtual Shared Pass-Through Graphics Acceleration (vGPU or MxGPU)
- Not available in VMWare Workstation

VGA Emulation (VE)

The most primitive graphics display for any virtual machine is VGA Emulation (VE). This mode is also the most inefficient. QEMU emulates a Cirrus Logic GD5446 Video card. All Windows versions starting from Windows 95 should recognize and use this graphic card.

Most hypervisors which advertise some form of "hardware acceleration" use API Forwarding (AF), which is a high performance proxy service that requires specialized drivers on both the host and guest to create a high performance instruction pipeline.

API forwarding (AF)

API Forwarding (AF) works by:

intercepting the GPU command requested by a piece of software
proxying the GPU command to the host hypervisor
executing the captured GPU command on the host from the hypervisor
bubbling the response back up to the virtual machine

This mode is very useful when many virtual machines are competing for resources of a single GPU and Full GPU Virtualization (FGV) is not possible. The hypervisor queues graphics card operations from one or more virtual machine and schedules virtual execution and memory slots for each virtual machine on a single physical GPU resource. Each virtual machine sees its own graphics card while the hypervisor splits the single physical resource up. A key drawback of AF is that usually only OpenGL and DirectX interfaces are supported by the GPU instruction proxy.

The process by which API Forwarding (AF) works is known as paravirtualization.

Direct Pass-Through (DPT)

Direct Pass-Through (DPT) is a system which exposes the GPU as a PCI device which is directly addressable by the virtual machine. Nothing besides the virtual machine can reference any resources on the GPU and it cannot be shared with the physical machine or any other virtual machines. Many devices have only one graphics card installed and using this system would mean making the graphical user interface inoperable. This method is most useful when:

the physical graphics card does not support Full GPU Virtualization (FGV)
two or more graphics cards are attached to a system
paravirtualized drivers are not available or do not work with the installed physical GPU, host hypervisor, or guest operating system

Full GPU Virtualization (FGV)

Sharing a GPU natively among multiple virtual machines is possible with Full GPU Virtualization (FGV) solutions such as Intel GVT-g. This process is also known as Hardware Assisted Virtualization (HVM), not to be confused with Paravirtualization (PV). In this mode the IOMMU hardware exposes a GPU memory interface to each virtual machine while it internally handles the memory address mappings between what it exposes to virtual machines and the actual physical memory on the GPU.

In "IOMMU and Virtualization," Susanta Nanda writes:

IOMMU provides two main functionalities: virtual-to-physical address translation and access protection on the memory ranges that an I/O device is trying to operate on.

Peak media workload performance is 95% of the native host alone when running one virtual machine and the average performance is 85% of the native host alone on media workloads according to Intel engineer Zhenyu Wang in XDC2017 presentation "Full GPU virtualization in mediated pass-through way"

Hypervisors

The hypervisors I use are software systems that enable multiple virtual machines to run simultaneously on a single physical machine. Linux users have many hypervisor options. A longer list is available here. These are a sample of some Linux hypervisors:

Oracle VirtualBox
VMWare vSphere/ESXi (technically runs beneath Linux)
VMWare Workstation
QEMU

Oracle VirtualBox

Possibly the most widely-used hypervisor is VirtualBox by Oracle. VirtualBox is open source software with the exception of the optional extension pack.

The Oracle VirtualBox extension pack provides many features which are not available in the free version. The PCI passthrough module was shipped as a Oracle VM VirtualBox extension package until the feature was scrapped.

These features are available gratis for personal and non-commerical use only.

Possession of the VirtualBox Extension Pack without a license can be problematic:

Got an email today informing me (Urgent Virtual Box Licensing Information for Company X) that there have been TWELVE (12!) downloads of the VirtualBox Extension Pack at my employer in the past year. And since the extensions are licensed differently than the base product, they'd love for us to call them and talk about how much money we owe them. Their report attached to email listed source IPs and AS number, as well as date/product/version. Out of the twelve (12!), there were always two on the same day of the same version, so really six (6!) downloads. We'll probably end up giving them $150, and I'll make sure they never get any business from places I work, because fuck Oracle. I wouldn't piss on Larry Ellison if he was on fire.

VirtualBox Linux hosts do not support GPU DPT (Direct Pass-Through) at all. All of the preliminary PCI pass-through work for Linux hosts which is needed for GPU DPT was completely stripped out on December 5th, 2019 with this message:

Linux host: Drop PCI passthrough, the current code is too incomplete (cannot handle PCIe devices at all), i.e. not useful enough

VirtualBox 2D and 3D acceleration both work according to the same principle: API forwarding (AF)

Oracle VM VirtualBox implements 3D acceleration by installing an additional hardware 3D driver inside the guest when the Guest Additions are installed. This driver acts as a hardware 3D driver and reports to the guest operating system that the virtual hardware is capable of 3D hardware acceleration. When an application in the guest then requests hardware acceleration through the OpenGL or Direct3D programming interfaces, these are sent to the host through a special communication tunnel implemented by Oracle VM VirtualBox. The host then performs the requested 3D operation using the host's programming interfaces.

VMWare vSphere/ESXi

VMWare vSphere/ESXi is a bare metal hypervisor which runs beneath any end-user operating systems. This property makes it a Type 1 Hypervisor. It supports all GPU acceleration technologies.

Virtual Shared Graphics Acceleration (vSGA) is a form of API forwarding (AF).
Virtual Dedicated Graphics Acceleration (vDGA) technology is a form of Direct Pass-Through (DPT).
VMWare Virtual Shared Pass-Through Graphics Acceleration (vGPU or MxGPU) is a form of Full GPU Virtualization (FGV).

The main limitation of VMWare vSphere/ESXi GPU acceleration is the graphics card selection. GPU passthrough is possible only with a small set of GPUs because NVIDIA drivers disable consumer-market GPUs such as the GeForce series when the drivers detect that they are running in a virtual environment. Comprehensive list of all supported graphics cards for any hardware acceleration purposes.

VMWare Workstation

VMWare Workstation only provides Virtual Shared Graphics Acceleration (vSGA), a form of API forwarding (AF). In this regard, the GPU acceleration story is identical to Oracle VirtualBox.

QEMU

QEMU is an open source virtual machine platform that is also capable of translating instructions between wholly unrelated computer architectures. It is widely available in most Linux distributions and is used extensively in industry.

After enabling GVT-g in QEMU you must also recompile QEMU with the 60 fps fix to get smooth video as of the publication date of this article.

Conclusion

QEMU eclipses VirtualBox in features and exceeds VMWare capabilities. VirtualBox is limited to API forwarding (AF) since it is not able to allow virtual machines to address graphics hardware directly in any way. VMWare solutions support all types of GPU addressing but most graphic cards made by NVIDIA disable themselves when they detect being called in Direct Pass-Through (DPT) or Full GPU Virtualization (FGV) modes. QEMU exceeds the features and provides hardware use flexibility beyond that of VMWare to bring near-native graphics performance to guest operating systems such as Windows 10 with truly minimal driver support required in the guest operating system. I recommend using QEMU on Linux when high graphics performance and low operational costs are prioritized for deploying a virtual machine environment.

Thinkpad X1 Carbon GPU Undervolting in Arch Linux

2020-04-03T00:00:00-04:00

After successfully installing Arch Linux on my Thinkpad X1 Carbon Generation 6 I began experimenting with laptop component undervolting to reduce heat and improve performance. Unfortunately it seems that the set points are being ignored to some extent.

Indeed, Francesco Palmarini reported on Github:

It is kind of impossible that your GPU can work at -1V offset voltage. On XTU I get an instant reboot on -500mV

However, despite the apparent physical impossibility of applying such a large voltage offset, there seems to be a real effect on performance. Applying a -500 mV voltage offset resulted in approximately 2 watts lower peak power consumption as measured by throttled monitoring.

I benchmarked GPU performance at different voltage offsets to see if they are ignored after a certain point. The result shows diminishing returns beyond 300 millivolt undervoltage. Still, it is not clear whether the larger undervoltage settings are actually being set correctly because it should not be possible for the CPU to operate when undervolted 900 mV.

# config.toml

[voltage]
start = 0
stop = -1000
step = -100

# test.py

import toml
from subprocess import run
from multiprocessing import Process
from tempfile import NamedTemporaryFile
from jinja2 import Template
from tqdm import tqdm

def undervolt_gpu(voltage):
    log = f"undervolt___{voltage}.log"
    print(f"undervolt_gpu: saving to {log}")
    with open("lenovo_fix.conf") as fd:
        tpl = Template(fd.read()).render(voltage=voltage)
    with NamedTemporaryFile(mode="w") as ntf:
        ntf.write(tpl)
        ntf.flush()
        cmd = f"sudo /usr/lib/throttled/lenovo_fix.py --monitor --log {log} --config {ntf.name}"
        print("voltage", voltage, "cmd", cmd)
        run(cmd, check=True, shell=True)

if __name__ == "__main__":
    cfg = toml.load("config.toml")
    voltages = list(range(cfg["voltage"]["start"], cfg["voltage"]["stop"], cfg["voltage"]["step"]))
    print(voltages)
    for voltage in tqdm(voltages):
        p = Process(target=undervolt_gpu, args=(voltage,))
        p.start()
        cmd = f"glmark2 -b :duration=2.0 --fullscreen | tee glmark___{voltage}.log"
        print("process launched")
        run(cmd, check=True, shell=True)
        print("process terminating")
        p.terminate()
        print("process terminated")

I generated a CSV report by regex parsing the data logged from glmark2 and lenovo_fix.py:

# report.py

import re
import os.path
import toml
import csv

cfg = toml.load("config.toml")
voltages = list(range(cfg["voltage"]["start"], cfg["voltage"]["stop"], cfg["voltage"]["step"]))

with open('undervolting.csv', 'w', newline='') as csvfile:
    fieldnames = ['glmark', 'voltage', 'package_watts', 'graphics_watts', 'dram_watts']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    for voltage in voltages:
        glmark_p = f"glmark___{voltage}.log"
        undervolt_p = f"undervolt___{voltage}.log"
        try:
            with open(glmark_p) as fd:
                try:
                    glmark_s = re.search("(?:glmark2 Score: )(\d+)", fd.read()).group(1)
                except AttributeError:
                    continue
        except FileNotFoundError:
            continue
        try:
            with open(undervolt_p) as fd:
                for line in fd:
                    package, graphics, dram = getattr(re.search(
                        "(?:Package: )(\d*\.?\d+)(?: W - Graphics: )(\d*\.?\d+)(?: W - DRAM: )(\d*\.?\d+)", line
                    ), "groups", lambda: ("0", "0", "0",))()
                    writer.writerow(dict(
                        glmark=glmark_s,
                        voltage=voltage,
                        package_watts=package,
                        graphics_watts=graphics,
                        dram_watts=dram))
        except FileNotFoundError:
            continue

throttled GPU voltage offset adjustment has some effect on GPU performance, but it is unclear to what extent this is meaningful with the Linux implementation. GPU voltage offset adjustment parameters are not on the same scale as those within the Windows-based Intel XTU utility. The relationship between throttled GPU voltage offset and Intel XTU voltage offset warrants further research to fully unlock the battery life and calculation performance potential of our machines.

To set your GPU voltage offset to -400 mV, edit lenovo_fix.conf which is located at /etc/lenovo_fix.conf on Arch Linux if you installed throttled from the throttled package.