My a11y journey

2025-06-20 01:11
[personal profile] mjg59
23 years ago I was in a bad place. I'd quit my first attempt at a PhD for various reasons that were, with hindsight, bad, and I was suddenly entirely aimless. I lucked into picking up a sysadmin role back at TCM where I'd spent a summer a year before, but that's not really what I wanted in my life. And then Hanna mentioned that her PhD supervisor was looking for someone familiar with Linux to work on making Dasher, one of the group's research projects, more usable on Linux. I jumped.

The timing was fortuitous. Sun were pumping money and developer effort into accessibility support, and the Inference Group had just received a grant from the Gatsy Foundation that involved working with the ACE Centre to provide additional accessibility support. And I was suddenly hacking on code that was largely ignored by most developers, supporting use cases that were irrelevant to most developers. Being in a relatively green field space sounds refreshing, until you realise that you're catering to actual humans who are potentially going to rely on your software to be able to communicate. That's somewhat focusing.

This was, uh, something of an on the job learning experience. I had to catch up with a lot of new technologies very quickly, but that wasn't the hard bit - what was difficult was realising I had to cater to people who were dealing with use cases that I had no experience of whatsoever. Dasher was extended to allow text entry into applications without needing to cut and paste. We added support for introspection of the current applications UI so menus could be exposed via the Dasher interface, allowing people to fly through menu hierarchies and pop open file dialogs. Text-to-speech was incorporated so people could rapidly enter sentences and have them spoke out loud.

But what sticks with me isn't the tech, or even the opportunities it gave me to meet other people working on the Linux desktop and forge friendships that still exist. It was the cases where I had the opportunity to work with people who could use Dasher as a tool to increase their ability to communicate with the outside world, whose lives were transformed for the better because of what we'd produced. Watching someone use your code and realising that you could write a three line patch that had a significant impact on the speed they could talk to other people is an incomparable experience. It's been decades and in many ways that was the most impact I've ever had as a developer.

I left after a year to work on fruitflies and get my PhD, and my career since then hasn't involved a lot of accessibility work. But it's stuck with me - every improvement in that space is something that has a direct impact on the quality of life of more people than you expect, but is also something that goes almost unrecognised. The people working on accessibility are heroes. They're making all the technology everyone else produces available to people who would otherwise be blocked from it. They deserve recognition, and they deserve a lot more support than they have.

But when we deal with technology, we deal with transitions. A lot of the Linux accessibility support depended on X11 behaviour that is now widely regarded as a set of misfeatures. It's not actually good to be able to inject arbitrary input into an arbitrary window, and it's not good to be able to arbitrarily scrape out its contents. X11 never had a model to permit this for accessibility tooling while blocking it for other code. Wayland does, but suffers from the surrounding infrastructure not being well developed yet. We're seeing that happen now, though - Gnome has been performing a great deal of work in this respect, and KDE is picking that up as well. There isn't a full correspondence between X11-based Linux accessibility support and Wayland, but for many users the Wayland accessibility infrastructure is already better than with X11.

That's going to continue improving, and it'll improve faster with broader support. We've somehow ended up with the bizarre politicisation of Wayland as being some sort of woke thing while X11 represents the Roman Empire or some such bullshit, but the reality is that there is no story for improving accessibility support under X11 and sticking to X11 is going to end up reducing the accessibility of a platform.

When you read anything about Linux accessibility, ask yourself whether you're reading something written by either a user of the accessibility features, or a developer of them. If they're neither, ask yourself why they actually care and what they're doing to make the future better.
[personal profile] mjg59
I'm lucky enough to have a weird niche ISP available to me, so I'm paying $35 a month for around 600MBit symmetric data. Unfortunately they don't offer static IP addresses to residential customers, and nor do they allow multiple IP addresses per connection, and I'm the sort of person who'd like to run a bunch of stuff myself, so I've been looking for ways to manage this.

What I've ended up doing is renting a cheap VPS from a vendor that lets me add multiple IP addresses for minimal extra cost. The precise nature of the VPS isn't relevant - you just want a machine (it doesn't need much CPU, RAM, or storage) that has multiple world routeable IPv4 addresses associated with it and has no port blocks on incoming traffic. Ideally it's geographically local and peers with your ISP in order to reduce additional latency, but that's a nice to have rather than a requirement.

By setting that up you now have multiple real-world IP addresses that people can get to. How do we get them to the machine in your house you want to be accessible? First we need a connection between that machine and your VPS, and the easiest approach here is Wireguard. We only need a point-to-point link, nothing routable, and none of the IP addresses involved need to have anything to do with any of the rest of your network. So, on your local machine you want something like:

[Interface]
PrivateKey = privkeyhere
ListenPort = 51820
Address = localaddr/32

[Peer]
Endpoint = VPS:51820
PublicKey = pubkeyhere
AllowedIPs = VPS/0


And on your VPS, something like:

[Interface]
Address = vpswgaddr/32
SaveConfig = true
ListenPort = 51820
PrivateKey = privkeyhere

[Peer]
PublicKey = pubkeyhere
AllowedIPs = localaddr/32


The addresses here are (other than the VPS address) arbitrary - but they do need to be consistent, otherwise Wireguard is going to be unhappy and your packets will not have a fun time. Bring that interface up with wg-quick and make sure the devices can ping each other. Hurrah! That's the easy bit.

Now you want packets from the outside world to get to your internal machine. Let's say the external IP address you're going to use for that machine is 321.985.520.309 and the wireguard address of your local system is 867.420.696.005. On the VPS, you're going to want to do:

iptables -t nat -A PREROUTING -p tcp -d 321.985.520.309 -j DNAT --to-destination 867.420.696.005

Now, all incoming packets for 321.985.520.309 will be rewritten to head towards 867.420.696.005 instead (make sure you've set net.ipv4.ip_forward to 1 via sysctl!). Victory! Or is it? Well, no.

What we're doing here is rewriting the destination address of the packets so instead of heading to an address associated with the VPS, they're now going to head to your internal system over the Wireguard link. Which is then going to ignore them, because the AllowedIPs statement in the config only allows packets coming from your VPS, and these packets still have their original source IP. We could rewrite the source IP to match the VPS IP, but then you'd have no idea where any of these packets were coming from, and that sucks. Let's do something better. On the local machine, in the peer, let's update AllowedIps to 0.0.0.0/0 to permit packets form any source to appear over our Wireguard link. But if we bring the interface up now, it'll try to route all traffic over the Wireguard link, which isn't what we want. So we'll add table = off to the interface stanza of the config to disable that, and now we can bring the interface up without breaking everything but still allowing packets to reach us. However, we do still need to tell the kernel how to reach the remote VPN endpoint, which we can do with ip route add vpswgaddr dev wg0. Add this to the interface stanza as:

PostUp = ip route add vpswgaddr dev wg0
PreDown = ip route del vpswgaddr dev wg0


That's half the battle. The problem is that they're going to show up there with the source address still set to the original source IP, and your internal system is (because Linux) going to notice it has the ability to just send replies to the outside world via your ISP rather than via Wireguard and nothing is going to work. Thanks, Linux. Thinux.

But there's a way to solve this - policy routing. Linux allows you to have multiple separate routing tables, and define policy that controls which routing table will be used for a given packet. First, let's define a new table reference. On the local machine, edit /etc/iproute2/rt_tables and add a new entry that's something like:

1 wireguard


where "1" is just a standin for a number not otherwise used there. Now edit your wireguard config and replace table=off with table=wireguard - Wireguard will now update the wireguard routing table rather than the global one. Now all we need to do is to tell the kernel to push packets into the appropriate routing table - we can do that with ip rule add from localaddr lookup wireguard, which tells the kernel to take any packet coming from our Wireguard address and push it via the Wireguard routing table. Add that to your Wireguard interface config as:

PostUp = ip rule add from localaddr lookup wireguard
PreDown = ip rule del from localaddr lookup wireguard

and now your local system is effectively on the internet.

You can do this for multiple systems - just configure additional Wireguard interfaces on the VPS and make sure they're all listening on different ports. If your local IP changes then your local machines will end up reconnecting to the VPS, but to the outside world their accessible IP address will remain the same. It's like having a real IP without the pain of convincing your ISP to give it to you.
graydon2: (Default)
[personal profile] graydon2
Elsewhere I've been asked about the task of replaying the bootstrap process for rust. I figured it would be fairly straightforward, if slow. But as we got into it, there were just enough tricky / non-obvious bits in the process that it's worth making some notes here for posterity.

UPDATE: someone has also scripted many of the subsequent snapshot builds covering many years of rust's post-bootstrap development. Consider the rest of this post just a verbose primer for interpreting their work.

context


Rust started its life as a compiler written in ocaml, called rustboot. This compiler did not use LLVM, it just emitted 32-bit i386 machine code in 3 object file formats (Linux PE, macOS Mach-O, and Windows PE).

We then wrote a second compiler in Rust called rustc that did use LLVM as its backend (and which, yes, is the genesis of today's rustc) and ran rustboot on rustc to produce a so-called "stage0 rustc". Then stage0 rustc was fed the sources of rustc again, producing a stage1 rustc. Successfully executing this stage0 -> stage1 step (rather than just crashing mid-compilation) is what we're going to call "bootstrapping". There's also a third step: running stage1 rustc on rustc's sources again to get a stage2 rustc and checking that it is bit-identical to the stage1 rustc. Successfully doing that we're going to call "fixpoint".

Shortly after we reached the fixpoint we discarded rustboot. We stored stage1 rustc binaries as snapshots on a shared download server and all subsequent rust builds were based on downloading and running that. Any time there was an incompatible language change made, we'd add support and re-snapshot the resulting stage1, gradually growing a long list of snapshots marking the progress of rust over time.

time travel and bit rot


Each snapshot can typically only compile rust code in the rust repository written between its birth and the next snapshot. This makes replaying the entire history awkward (see above). We're not going to do that here. This post is just about replaying the initial bootstrap and fixpoint, which happened back in April 2011, 14 years ago.

Unfortunately all the tools involved -- from the host OS and system libraries involved to compilers and compiler-components -- were and are moving targets. Everything bitrots. Some examples discovered along the way:

  • Modern clang and gcc won't compile the LLVM used back then (C++ has changed too much -- and I tried several CXXFLAGS=-std=c++NN variants!)
  • Modern gcc won't even compile the gcc used back then (apparently C as well!)
  • Modern ocaml won't compile rustboot (ditto)
  • 14-year-old git won't even connect to modern github (ssh and ssl have changed too much)


debian


We're in a certain amount of luck though:

  • Debian has maintained both EOL'ed docker images and still-functioning fetchable package archives at the same URLs as 14 years ago. So we can time-travel using that. A VM image would also do, and if you have old install media you could presumably build one up again if you are patient.
  • It is easier to use i386 since that's all rustboot emitted. There's some indication in the Makefile of support for multilib-based builds from x86-64 (I honestly don't remember if my desktop was 64 bit at the time) but 32bit is much more straightforward.
  • So: docker pull --platform linux/386 debian/eol:squeeze gets you an environment that works.
  • You'll need to install rust's prerequisites also: g++, make, ocaml, ocaml-native-compilers, python.


rust


The next problem is figuring out the code to build. Not totally trivial but not too hard. The best resource for tracking this period of time in rust's history is actually the rust-dev mailing list archive. There's a copy online at mail-archive.com (and Brian keeps a public backup of the mbox file in case that goes away). Here's the announcement that we hit a fixpoint in April 2011. You kinda have to just know that's what to look for. So that's the rust commit to use: 6daf440037cb10baab332fde2b471712a3a42c76. This commit still exists in the rust-lang/rust repo, no problem getting it (besides having to copy it into the container since the container can't contact github, haha).

LLVM


Unfortunately we only started pinning LLVM to specific versions, using submodules, after bootstrap, closer to the initial "0.1 release". So we have to guess at the LLVM version to use. To add some difficulty: LLVM at the time was developed on subversion, and we were developing rust against a fork of a git mirror of their SVN. Fishing around in that repo at least finds a version that builds -- 45e1a53efd40a594fa8bb59aee75bb0984770d29, which is "the commit that exposed LLVMAddEarlyCSEPass", a symbol used in the rustc LLVM interface. I bootstrapped with that (brson/llvm) commit but subversion also numbers all commits, and they were preserved in the conversion to the modern LLVM repo, so you can see the same svn id 129087 as e4e4e3758097d7967fa6edf4ff878ba430f84f6e over in the official LLVM git repo, in case brson/llvm goes away in the future.

Configuring LLVM for this build is also a little bit subtle. The best bet is to actually read the rust 0.1 configure script -- when it was managing the LLVM build itself -- and work out what it would have done. But I have done that and can now save you the effort: ./configure --enable-targets=x86 --build=i686-unknown-linux-gnu --host=i686-unknown-linux-gnu --target=i686-unknown-linux-gnu --disable-docs --disable-jit --enable-bindings=none --disable-threads --disable-pthreads --enable-optimized

So: configure and build that, stick the resulting bin dir in your path, and configure and make rust, and you're good to go!
root@65b73ba6edcc:/src/rust# sha1sum stage*/rustc
639f3ab8351d839ede644b090dae90ec2245dfff  stage0/rustc
81e8f14fcf155e1946f4b7bf88cefc20dba32bb9  stage1/rustc
81e8f14fcf155e1946f4b7bf88cefc20dba32bb9  stage2/rustc


Observations


On my machine I get: 1m50s to build stage0, 3m40s to build stage1, 2m2s to build stage2. Also stage0/rustc is a 4.4mb binary whereas stage1/rustc and stage2/rustc are (identical) 13mb binaries.

While this is somewhat congruent with my recollections -- rustboot produced code faster, but its code ran slower -- the effect size is actually much less than I remember. I'd convinced myself retroactively that rustboot was produced abysmally worse code than rustc-with-LLVM. But out-of-the-gate LLVM only boosted performance by 2x (and cost of 3x the code size)! Of course I also have a faster machine now. At the time bootstrap cycles took about a half hour each (according to this: 15 minutes for the 2nd stage).

Of course you can still see this as a condemnation of the entire "super slow dynamic polymorphism" model of rust-at-the-time, either way. It may seem funny that this version of rustc bootstraps faster than today's rustc, but this "can barely bootstrap" version was a mere 25kloc. Today's rustc is 600kloc. It's really comparing apples to oranges.
fanf: (Default)
[personal profile] fanf

After I found some issues with my benchmark which invalidated my previous results, I have substantially revised my previous blog entry. There are two main differences:

  • A proper baseline revealed that my amd64 numbers were nonsense because I wasn’t fencing enough, and after tearing my hair out and eventually fixing that I found that the bithack conversion is one or two cycles faster.

  • A newer compiler can radically improve the multiply conversion on arm64 so it’s the same speed as the bithack conversion; I've added some source and assembly snippets to the blog post to highlight how nice arm64 is compared to amd64 for this task.

fanf: (Default)
[personal profile] fanf

https://dotat.at/@/2025-06-08-floats.html

A couple of years ago I wrote about random floating point numbers. In that article I was mainly concerned about how neat the code is, and I didn't pay attention to its performance.

Recently, a comment from Oliver Hunt and a blog post from Alisa Sireneva prompted me to wonder if I made an unwarranted assumption. So I wrote a little benchmark, which you can find in pcg-dxsm.git.

(Note 2025-06-09: I've edited this post substantially after discovering some problems with the results.)

recap

Briefly, there are two basic ways to convert a random integer to a floating point number between 0.0 and 1.0:

  • Use bit fiddling to construct an integer whose format matches a float between 1.0 and 2.0; this is the same span as the result but with a simpler exponent. Bitcast the integer to a float and subtract 1.0 to get the result.

  • Shift the integer down to the same range as the mantissa, convert to float, then multiply by a scaling factor that reduces it to the desired range. This produces one more bit of randomness than the bithacking conversion.

(There are other less basic ways.)

code

The double precision code for the two kinds of conversion is below. (Single precision is very similar so I'll leave it out.)

It's mostly as I expect, but there are a couple of ARM instructions that surprised me.

bithack

The bithack function looks like:

double bithack52(uint64_t u) {
    u = ((uint64_t)(1023) << 52) | (u >> 12);
    return(bitcast(double, u) - 1.0);
}

It translates fairly directly to amd64 like this:

bithack52:
    shr     rdi, 12
    movabs  rax, 0x3ff0000000000000
    or      rax, rdi
    movq    xmm0, rax
    addsd   xmm0, qword ptr [rip + .number]
    ret
.number:
    .quad   0xbff0000000000000

On arm64 the shift-and-or becomes one bfxil instruction (which is a kind of bitfield move), and the constant -1.0 is encoded more briefly. Very neat!

bithack52:
    mov     x8, #0x3ff0000000000000
    fmov    d0, #-1.00000000
    bfxil   x8, x0, #12, #52
    fmov    d1, x8
    fadd    d0, d1, d0
    ret

multiply

The shift-convert-multiply function looks like this:

double multiply53(uint64_t u) {
    return ((double)(u >> 11) * 0x1.0p-53);
}

It translates directly to amd64 like this:

multiply53:
    shr       rdi, 11
    cvtsi2sd  xmm0, rdi
    mulsd     xmm0, qword ptr [rip + .number]
    ret
.number:
    .quad     0x3ca0000000000000

GCC and earlier versions of Clang produce the following arm64 code, which is similar though it requires more faff to get the constant into the right register.

multiply53:
    lsr     x8, x0, #11
    mov     x9, #0x3ca0000000000000
    ucvtf   d0, x8
    fmov    d1, x9
    fmul    d0, d0, d1
    ret

Recent versions of Clang produce this astonishingly brief two instruction translation: apparently you can convert fixed-point to floating point in one instruction, which gives us the power of two scale factor for free!

multiply53:
    lsr     x8, x0, #11
    ucvtf   d0, x8, #53
    ret

benchmark

My benchmark has 2 x 2 x 2 tests:

  • bithacking vs multiplying

  • 32 bit vs 64 bit

  • sequential integers vs random integers

I ran the benchmark on my Apple M1 Pro and my AMD Ryzen 7950X.

These functions are very small and work entirely in registers so it has been tricky to measure them properly.

To prevent the compiler from inlining and optimizing the benchmark loop to nothing, the functions are compiled in a separate translation unit from the test harness. This is not enough to get plausible measurements because the CPU overlaps successive iterations of the loop, so we also use fence instructions.

On arm64, a single ISB (instruction stream barrier) in the loop is enough to get reasonable measurements.

I have not found an equivalent of ISB on amd64, so I'm using MFENCE. It isn't effective unless I pass the argument and return values via pointers (because it's a memory fence) and place MFENCE instructions just before reading the argument and just after writing the result.

results

In the table below, the leftmost column is the number of random bits; "old" is arm64 with older clang, "arm" is newer clang, "amd" is gcc.

The first line is a baseline do-nothing function, showing the overheads of the benchmark loop, function call, load argument, store return, and fences.

The upper half measures sequential numbers, the bottom half is random numbers. The times are nanoseconds per operation.

         old    arm    amd

    00  21.44  21.41  21.42

    23  24.28  24.31  22.19
    24  25.24  24.31  22.94
    52  24.31  24.28  21.98
    53  25.32  24.35  22.25

    23  25.59  25.56  22.86
    24  26.55  25.55  23.03
    52  27.83  27.81  23.93
    53  28.57  27.84  25.01

The times vary a little from run to run but the difference in speed of the various loops is reasonably consistent.

The numbers on arm64 are reasonably plausible. The most notable thing is that the "old" multiply conversion is about 3 or 4 clock cycles slower, but with a newer compiler that can eliminate the multiply, it's the same speed as the bithacking conversion.

On amd64 the multiply conversion is about 1 or 2 clock cycles slower than the bithacking conversion.

conclusion

The folklore says that bithacking floats is faster than normal integer to float conversion, and my results generally agree with that, apart from on arm64 with a good compiler. It would be interesting to compare other CPUs to get a better idea of when the folklore is right or wrong -- or if any CPUs perform the other way round!