You are here

Josef Sipek

Subscribe to Josef Sipek feed
Wasting time effectively blahgd
Updated: 1 week 3 days ago

Rust Pointers for C Programmers

Sun, 01/28/2018 - 17:47

I’ve been eyeing Rust for about a year now. Here and there, I tried to use it to make a silly little program, or to implement some simple function in it to see for myself how ergonomic it really was, and what sort of machine code rustc spit out. But last weekend I found a need for a tool to clean up some preprocessor mess, and so instead of hacking together some combination of shell and Python, I decided to write it in Rust.

From my earlier attempts, I knew that there are a lot of different “pointers” but I found all the descriptions of them lacking or confusing. Specifically, Rust calls itself a systems programming language, yet I found no clear description of how the different pointers map to C—the systems programming language. Eventually, I stumbled across The Periodic Table of Rust Types, which made things a bit clearer, but I still didn’t feel like I truly understood.

During my weekend expedition to Rust land, I think I’ve grokked things enough to write this explanation of how Rust does things. As always, feedback is welcomed.

I’ll describe what happens in terms of C. To keep things simple, I will:

  • assume that you are well-versed in C
  • assume that you can read Rust (any intro will teach you enough)
  • not bother with const for the C snippets
  • not talk about mutability

In the following text, I assume that we have some struct T. The actual contents don’t matter. In other words:

struct T { /* some members */ };

With that out of the way, let’s dive in!

*const T and *mut T

These are raw pointers. In general, you shouldn’t use them since only unsafe code can dereference them, and the whole point of Rust is to write as much safe code as possible.

Raw pointers are just like what you have in C. If you make a pointer, you end up using sizeof(struct T *) bytes for the pointer. In other words:

struct T *ptr;

&T and &mut T

These are borrowed references. They use the same amount of space as raw pointers and behave same exact way in the generated machine code. Consider this trivial example:

#[no_mangle] pub fn raw(p: *mut usize) { unsafe { *p = 5; } } #[no_mangle] pub fn safe(p: &mut usize) { *p = 5; }

A rustc invocation later, we have:

raw() raw: 55 pushq %rbp raw+0x1: 48 89 e5 movq %rsp,%rbp raw+0x4: 48 c7 07 05 00 00 movq $0x5,(%rdi) 00 raw+0xb: 5d popq %rbp raw+0xc: c3 ret safe() safe: 55 pushq %rbp safe+0x1: 48 89 e5 movq %rsp,%rbp safe+0x4: 48 c7 07 05 00 00 movq $0x5,(%rdi) 00 safe+0xb: 5d popq %rbp safe+0xc: c3 ret

Note that the two functions are bit-for-bit identical.

The only differences between borrowed references and raw pointers are:

  1. references will never point at bogus addresses (i.e., they are never NULL or uninitialized),
  2. the compiler doesn’t let you do arbitrary pointer arithmetic on references,
  3. the borrow checker will make you question your life choices for a while.

(#3 gets better over time.)

Box<T>

These are owned “pointers”. If you are a C++ programmer, you are already familiar with them. Never having truly worked with C++, I had to think about this a bit until it clicked, but it is really easy.

No matter what all the documentation and tutorials out there say, Box<T> is not a pointer but rather a structure containing a pointer to heap allocated memory just big enough to hold T. The heap allocation and freeing is handled automatically. (Allocation is done in the Box::new function, while freeing is done via the Drop trait, but that’s not relevant as far as the memory layout is concerned.) In other words, Box<T> is something like:

struct box_of_T { struct T *heap_ptr; };

Then, when you make a new box you end up putting only what amounts to sizeof(struct T *) on the stack and it magically starts pointing to somewhere on the heap. In other words, the Rust code like this:

let x = Box::new(T { ... });

is roughly equivalent to:

struct box_of_t x; x.heap_ptr = malloc(sizeof(struct T)); if (!x.heap_ptr) oom(); *x.heap_ptr = ...;

&[T] and &mut [T]

These are borrowed slices. This is where things get interesting. Even though it looks like they are just references (which, as stated earlier, translates into a simple C-style pointer), they are much more. These types of references use fat pointers—that is, a combination of a pointer and a length.

struct fat_pointer_to_T { struct T *ptr; size_t nelem; };

This is incredibly powerful, since it allows bounds checking at runtime and getting a subset of a slice is essentially free!

&[T; n] and &mut [T; n]

These are borrowed references to arrays. They are different from borrowed slices. Since the length of an array is a compile-time constant (the compiler will yell at you if n is not a constant), all the bounds checking can be performed statically. And therefore there is no need to pass around the length in a fat pointer. So, they are passed around as plain ol’ pointers.

struct T *ptr;

T, [T; n], and [T]

While these aren’t pointers, I thought I’d include them here for completeness’s sake.

T

Just like in C, a struct uses as much space as its type requires (i.e., sum of the sizes of its members plus padding).

[T; n]

Just like in C, an array of structs uses n times the size of the struct.

[T]

The simple answer here is that you cannot make a [T]. That actually makes perfect sense when you consider what that type means. It is saying that we have some variable sized slice of memory that we want to access as elements of type T. Since this is variable sized, the compiler cannot possibly reserve space for it at compile time and so we get a compiler error.

The more complicated answer involves the Sized trait, which I’ve skillfully managed to avoid thus far and so you are on your own.

Summary

That was a lot of text, so I decided to compact it and make the following table. In the table, I assume that our T struct is 100 bytes in size. In other words:

/* Rust */ struct T { stuff: [u8; 100], } /* C */ struct T { uint8_t stuff[100]; };

Now, the table in its full glory:

Rust C Size on
ILP32/LP64
(bytes) Value let x: T; struct T x; 100/100 Raw pointer let x: *const T; let x: *mut T; struct T *x; 4/8 Reference let x: &T; let x: &mut T; struct T *x; 4/8 Box let x: Box<T>; struct box_of_T { struct T *heap_ptr; }; struct box_of_T x; 4/8 Array of 2 let x: [T; 2]; struct T x[2]; 200/200 Reference to
an array of 2 let x: &[T; 2]; struct T *x; 4/8 A slice let x: [T]; struct T x[]; unknown at
compile time A reference
to a slice let x: &[T]; struct fat_ptr_to_T { struct T *ptr; size_t nelem; }; struct fat_ptr_to_T x; 8/16

A word of caution: I assume that the sizes of the various pointers are actually implementation details and shouldn’t be relied on to be that way. (Well, with the exception of raw pointers - without those being fixed FFI would be unnecessarily complicated.)

I didn’t cover str, &str, String, and Vec<T> since I don’t consider them fundamental types, but rather convenience types built on top of slices, structs, references, and boxes.

Anyway, I hope you found this useful. If you have any feedback (good or bad), let me know.

CBOR vs. JSON vs. libnvpair

Mon, 11/20/2017 - 15:34

My blahg uses nvlists for logging extra information about its operation. Historically, it used Sun libnvpair. That is, it used its data structures as well as the XDR encoding to serialize the data to disk.

A few months ago, I decided to replace libnvpair with my own nvlist implementation—one that was more flexible and better integrated with my code. (It is still a bit of a work-in-progress, but it is looking good.) The code conversion went smoothly, and since then all the new information was logged in JSON.

Last night, I decided to convert a bunch of the previously accumulated libnvpair data files into the new JSON-based format. After whipping up a quick conversion program, I ran it on the data. The result surprised me—the JSON version was about 55% of the size of the libnvpair encoded input!

This piqued my interest. I re-ran the conversion but with CBOR (RFC 7049) as the output format. The result was even better with the output being 45% of libnvpair’s encoding.

This made me realize just how inefficient libnvpair is when serialized. At least part of it is because XDR (the way libnvpair serializes data) uses a lot of padding, while both JSON and CBOR use a more compact encoding for many data types (e.g., an unsigned number in CBOR uses 1 byte for the type and 0, 1, 2, 4, or 8 additional bytes based on its magnitude, while libnvpair always encodes a uint64_t as 8 bytes plus 4 bytes for the type).

Since CBOR is 79% of JSON’s size (and significantly less underspecified compared to the minefield that is JSON), I am hoping to convert everything that makes sense to CBOR. (CBOR being a binary format makes it harder for people to hand-edit it. If hand-editing is desirable, then it makes sense to stick with JSON or other text-based formats.)

The Data & Playing with Compression

The blahg-generated dataset that I converted consisted of 230866 files, each containing an nvlist. The following byte counts are a simple concatenation of the files. (A more complicated format like tar would add a significant enough overhead to make the encoding efficiency comparison flawed.)

Format Size % of nvpair nvpair 471 MB 100% JSON 257 MB 54.6% CBOR 203 MB 45.1%

I also took each of the concatenated files and compressed it with gzip, bzip2, and xz. In each case, I used the most aggressive compression by using -9. The percentages in parentheses are comparing the compressed size to the same format’s uncompressed size. The results:

Format Uncomp. gzip bzip2 xz nvpair 471 MB 37.4 MB (7.9%) 21.0 MB (4.5%) 15.8 MB (3.3%) JSON 257 MB 28.7 MB (11.1%) 17.9 MB (7.0%) 14.5 MB (5.6%) CBOR 203 MB 26.8 MB (13.2%) 16.9 MB (8.3%) 13.7 MB (6.7%)

(The compression ratios are likely artificially better than normal since each of the 230k files has the same nvlist keys.)

Since tables like this are hard to digest, I turned the same data into a graph:

CBOR does very well uncompressed. Even after compressing it with a general purpose compression algorithm, it outperforms JSON with the same algorithm by about 5%.

I look forward to using CBOR everywhere I can.

Categories: illumos

2017-11-14

Wed, 11/15/2017 - 03:34

Doug Engelbart Institute — Online exhibits, historic videos, texts, archive photos, and stories about Doug Engelbart, the inventor of the mouse, hypertext, and GUIs…all in the 1960s

Flight recorders data inspection by Airbus

Parsing JSON is a Minefield

Completely Painless Programmer’s Guide to XYZ, RGB, ICC, xyY, and TRCs — Brain-hurting amount of information about color profiles, etc.

darktable — A Lightroom-like open source software

World plugs — Info about every electric plug form factor in the world

Categories: illumos

Google Traffic, iOS edition

Mon, 11/13/2017 - 23:33

Several years ago, I wrote about how Google gets traffic information and how to turn off this location reporting on Android phones. Since then I’ve switched to iPhones. While I normally use the built-in Maps app, I keep Google Maps installed as a fallback—just in case.

I upgraded my phone recently and so I spent some time going through all the apps and making sure they worked and didn’t have more access than necessary. This is when I discovered that the Google Maps app for iOS defaults to collecting location information and sharing it with Google. Given my previous post, this isn’t really surprising.

Turning it off

Anyway, as with the Android post, here are the steps to limit Google’s collection of location information.

First of all, in Settings → Privacy → “Location Services”, I changed Google Maps’s permission to “While Using the App”. If I’m not using the app, then it doesn’t need to know where I am.

Second, in the app itself, go to: Settings → “About, terms & privacy” → “Location data collection”. That’s right, this setting is buried in what appears to be a page with the boring legal notices.

And then turn off the the toggle:

That should do it…at least assuming that Google honors the settings in its own app.

Categories: illumos