Peeking into Rust

Peeking into Rust

Featured on Hashnode

Recently, I was working on a Rust exam that had the following question:

/// Consider this Rust code:

fn sort(items: &mut Vec<u32>) {
    // Implementation goes here
}

///
/// The signature above makes sense for a sorting algorithm that:
///
/// - a) Returns a copy of the data in a new Vec in sorted order.
/// - b) Returns a copy of the data in a new slice in sorted order.
/// - c) Returns a new subslice of the same data in sorted order
/// - d) Sorts the data "in place" by moving elements around.
/// - e) Sorts the data and removes duplicate items.

If you are an experienced Rust programmer, this isn't a difficult question for you to answer. However, if you are new to Rust, it may not be as clear.

After I finished this exam, it occurred to me that this question might be an interesting question to explore in code. Let's figure out how to get Rust itself to help us understand what is happening with the memory usage and internal data representation, here.

To do this, let's add an implementation to that method signature so that some code can be executed:

// Implement a bubble sort.
fn sort(array: &mut Vec<u32>) {
    for i in 0..array.len() {
        for j in 0..array.len() - i - 1 {
            if array[j + 1] < array[j] {
                array.swap(j, j + 1);
            }
        }
    }
}

fn main() {
    let mut array = vec![3_u32, 5, 7, 9, 1, 0, 4, 8, 2, 6];
    sort(&mut array);
    println!("{:?}", array);
}

If you compile and run that program, it will print the sorted array. Feel free to do that, just so that you know that the code is working.

To be able to verify what is happening when that code runs, it would be nice to have a way to see what Rust is doing with the memory. If we could see the memory fingerprint of the array before it is sorted, and again after it is sorted, we would have the information that we need to prove which answer is the correct answer.

Fortuitously, Rust has a feature that can help with this. The into_raw_parts() method can be used to deconstruct a Vec into three values: a pointer to the start of the Vec data, a usize containing the current length of the Vec, and a usize containing the current maximum capacity of the Vec.

If you want to try this yourself, please note that this feature is currently only available in the Rust nightly builds. Enabling this using rustup is simple, however. Once that is done, you can tell Rust to allow the use of this feature via the following line:

#![feature(vec_into_raw_parts)]

Before going further, it is important to understand how Rust represents a Vec in memory. Rust uses two different regions of memory, managed separately, for storing program data.

The first is called the stack. In Rust, the stack is a region of memory used to store data with a known, fixed size, like scalar types, tuples, and enumerations. It is called the "stack" because it operates on a last-in, first-out (LIFO) basis. If three items, "A", "B", and "C" are allocated onto the stack, then when deallocation happens, it will happen in the reverse order: "C", "B", and then "A". Thus, when a function is called, its local variables and parameters are "pushed" onto the top of the stack, and when the function returns, they are "popped" off the top of the stack.

In Rust, the stack is used to store a variety of types of data, including:

  • Local variables: These are variables that are defined within a function or block of Scalar types: These are types that represent a single value, such as integers (i32, u64, etc.), floating-point numbers (f32, f64), and boolean values (bool).

  • Tuples: A tuple is a fixed-size group of values, each with a distinct type.

  • Enumerations: An enumeration, or enum, is a type that can have one of a fixed number of values, each of which can be associated with zero or more data.

  • References: A reference is a type of pointer that refers to a value stored elsewhere in memory.

  • Closures: A closure is a function that can capture values from its environment.

The second is called the heap. In Rust, the heap is a region of memory that is dynamically allocated at runtime and is used to store data that doesn't have a fixed size that can be known at compile time.

Allocating memory on the heap is a more expensive operation than allocating memory on the stack because it requires the program to ask the operating system for additional memory, which may involve more complex memory management operations. However, the heap has no fixed size, so it can grow and shrink as needed to accommodate the program's needs.

In Rust, the heap is used to store a variety of types of data, including:

  1. Boxes (Box<T>): A box is a smart pointer that represents ownership of a value on the heap. When a box goes out of scope, the value it points to is deallocated.

  2. Vectors (Vec<T>): A vector is a dynamically-sized array that can grow or shrink as needed. Vectors are stored on the heap because their size is not known at compile time and may change during the course of the program.

  3. Strings (String): A string is a dynamically-sized string type that is stored on the heap. It can grow or shrink as needed to accommodate new characters.

  4. Hash maps (HashMap<K, V>): A hash map is a data structure that stores key-value pairs and allows for efficient insertion, deletion, and lookup of values based on their keys. Hash maps are stored on the heap because their size is not known at compile time and may change during the course of the program.

  5. Other collections: Many other collection types in Rust, such as linked lists (LinkedList<T>) and binary trees (BTreeMap<K, V>), are also stored on the heap.

It's worth noting that some data types in Rust can be stored either on the stack or on the heap, depending on how they are used. For example, a struct can be stored on the stack if all of its fields are stack-allocated, or it can be stored on the heap if it contains a pointer to heap-allocated data.

This description doesn't capture the full details regarding a Vec, however. The truth of the matter is that a Vec has data on both the stack and the heap. As mentioned earlier, a Vec is composed of three discrete parts. It has a length, a capacity, and a buffer that holds the data elements of the Vec.

The first two of these, the length and the capacity, are stored on the stack. Along with these, also on the stack, is a pointer to the region of memory on the heap that contains the actual data for the Vec.

So, continuing, let's add some lines to show both the address of the Vec itself, and the address of the data buffer that holds the actual data elements of the Vec.

fn main() {
    let array = vec![3_u32, 5, 7, 9, 1, 0, 4, 8, 2, 6];

    // Print the original vec.
    println!("before: {:#?}", array);

    // Break the vec down into its constituent parts
    // When this happens, "array" can no longer be
    // used, because it has been decomposed into its
    // parts, a pointer, a length, and a capacity.

    println!("pointer to the vec's stack allocated structure: {:p}", &array);

    let (ptr, _len, _cap) = array.into_raw_parts();
    // Show that pointer.
    println!("pointer to raw vec buffer: {:p}", ptr);
}

To make it clear what is happening, the first thing that the code does is to print the unsorted Vec. After doing that, the code uses {:p} to print a human-readable value of the reference to array, &array. This will be the location on the stack of the length, capacity, and raw pointer to the data element buffer for the Vec.

next, it uses into_raw_parts() to break that Vec into those constituent parts, and it assigns them to local variables, and then it prints the location of that heap-allocated data element buffer.

The output of running this should be something similar to this:

❯ ./sort
before: [
    3,
    5,
    7,
    9,
    1,
    0,
    4,
    8,
    2,
    6,
]
pointer to the vec's stack allocated structure: 0x7ffd0d9ccf30
pointer to raw vec buffer: 0x56358bcf7ae0

A Rust program (like all programs) operates within a virtual memory space. Any given address, like 0x7ffd0d9ccf30 above, does not refer to a specific physical memory location. It is a virtual location, and the kernel of the operating system takes care of the very low-level details of mapping that to a physical memory location.

The exact details of the memory allocation for a Rust program will vary depending on the operating system and architecture, but they all tend to follow a similar pattern. Within that virtual memory space for a Rust program, there is a text segment, often at the bottom of the virtual memory space, that holds the actual executable code of the program. There is a data segment, typically above the text segment, that holds fixed, unchanging data items like constants. Above that is typically the heap. The heap's size will grow as needed and will grow upwards in the memory space. The stack is typically allocated at the top of the space, with stack allocations proceeding from high to low.

For this article, everything was executed on an Ubuntu Linux system running on an x86_64 architecture. In that environment, the highest address in the virtual memory space for a Rust program space is 0x7fffffffffff. Thus, you can see that the address of &array, at &7ffd0d9ccf30 is quite high. It is in the stack.

The other address, though, the pointer to the data element buffer for the Vec, is in a very different location, 0x56358bcf7ae0. This much lower memory address is on the heap, illustrating what was mentioned earlier about how Vec uses both stack and heap-allocated data.

To continue with our memory detective work, the next thing that needs to happen is that the Vec should be passed to the sort() function, and then the pointer should be examined again to see if it is pointing to the same place in memory, or to a different place. However, there is a complication that must first be addressed.

When array is deconstructed into the separate pointer, length, and capacity values, array can no longer be accessed. If you try to do so, Rust will complain:

fn main() {
    let array = vec![3_u32, 5, 7, 9, 1, 0, 4, 8, 2, 6];

    // Print the original vec.
    println!("before: {:#?}", array);

    // Break the vec down into its constituent parts
    // When this happens, "array" can no longer be
    // used, because it has been decomposed into its
    // parts, a pointer, a length, and a capacity.

    println!("pointer to the vec's stack allocated structure: {:p}", &array);

    let (ptr, _len, _cap) = array.into_raw_parts();
    // Show that pointer.
    println!("pointer to raw vec buffer: {:p}", ptr);

   // Sort it.
   sort(&mut array);
}

Running this will result in:

error[E0382]: borrow of moved value: `array`
   --> b.rs:32:9
    |
15  |     let array = vec![3_u32, 5, 7, 9, 1, 0, 4, 8, 2, 6];
    |         ----- move occurs because `array` has type `Vec<u32>`, which does not implement the `Copy` trait
...
27  |     let (ptr, _len, _cap) = array.into_raw_parts();
    |                                   ---------------- `array` moved due to this method call
...
32  |    sort(&mut array);
    |         ^^^^^^^^^^ value borrowed here after move

The solution is to rebuild the original Vec from the raw parts before trying to sort it. It is also very important to note here that when the Vec is broken down into its constituent parts, Rust assumes that the caller of into_raw_parts() is now responsible for the memory that contains the Vec. Rust's normal memory management will not manage that memory until it is turned back into a Vec.

fn main() {
    let array = vec![3_u32, 5, 7, 9, 1, 0, 4, 8, 2, 6];

    // Print the original vec.
    println!("before: {:#?}", array);

    // Break the vec down into its constituent parts
    // When this happens, "array" can no longer be
    // used, because it has been decomposed into its
    // parts, a pointer, a length, and a capacity.

    println!("pointer to the vec's stack allocated structure: {:p}", &array);

    let (ptr, len, cap) = array.into_raw_parts();

    // Show that pointer.
    println!("pointer to raw vec buffer: {:p}", ptr);

    // Reconstitute the broken-down vec back into a Vec.
    let mut array = unsafe {
        let ptr = ptr as *mut u32;

        Vec::from_raw_parts(ptr, len, cap)
    };

    // Sort it.
    sort(&mut array);
}

That will run without an error, but there is a caveat that I would be remiss to not mention. This code uses the unsafe keyword. When the Rust compiler encounters unsafe, it assumes that the programmer knows what they are doing with whatever that keyword modifies. It becomes incumbent on the programmer to ensure that the code which is being marked as unsafe is, in fact, safe. In this case, the act of rebuilding a Vec from a section of memory, a length, and a capacity is not something that the compiler can make safety guarantees about; there is no way for the compiler to know if it is being done correctly. So that code must be wrapped in an unsafe block for the compiler to allow it. Generally speaking, Rust programmers should avoid using unsafe unless there is no other option, and even then it should only be used if the programmer is certain that the code is safe.

With that out of the way, all that remains is to print &array again, and to print the pointer to the Vec data element buffer a second time and see if it has changed. To do that, into_raw_parts() must be called a second time after array has been sorted. The entire completed program should look something like this:

#![feature(vec_into_raw_parts)]

// Implement a bubble sort.
fn sort(array: &mut Vec<u32>) {
    for i in 0..array.len() {
        for j in 0..array.len() - i - 1 {
            if array[j + 1] < array[j] {
                array.swap(j, j + 1);
            }
        }
    }
}

fn main() {
    let array = vec![3_u32, 5, 7, 9, 1, 0, 4, 8, 2, 6];

    // Print the original vec.
    println!("before: {:#?}", array);

    // Break the vec down into its constituent parts
    // When this happens, "array" can no longer be
    // used, because it has been decomposed into its
    // parts, a pointer, a length, and a capacity.

    println!("pointer to the vec's stack allocated structure: {:p}", &array);

    let (ptr, len, cap) = array.into_raw_parts();

    // Show that pointer.
    println!("pointer to raw vec buffer: {:p}", ptr);

    // Reconstitute the broken-down vec back into a Vec.
    let mut array = unsafe {
        let ptr = ptr as *mut u32;

        Vec::from_raw_parts(ptr, len, cap)
    };

    // Sort it.
    sort(&mut array);

    // Show the sorted values.
    println!("after: {:#?}", array);

    println!("pointer to the vec's stack allocated structure: {:p}", &array);
    // Extract that pointer, length, and capacity again.
    let (ptr, _len, _cap) = array.into_raw_parts();

    // Show that pointer. Did the sort result in a new pointer?
    println!("pointer to raw vec buffer: {:p}", ptr);
}

If you run the code with that version of main(), you will see something similar to this:

❯ ./sort
before: [
    3,
    5,
    7,
    9,
    1,
    0,
    4,
    8,
    2,
    6,
]
pointer to the vec's stack allocated structure: 0x7ffe7248a880
pointer to raw vec buffer: 0x55cab9b8bae0
after: [
    0,
    1,
    2,
    3,
    4,
    5,
    6,
    7,
    8,
    9,
]
pointer to the vec's stack allocated structure: 0x7ffe7248a9a0
pointer to raw vec buffer: 0x55cab9b8bae0

There are two interesting things here. First, take a look at the two pointers to the vec's stack-allocated structure. They are different. When the Vec is rebuilt, using Vec::from_raw_parts(), it creates a new stack-allocated Vec segment. However, if you look at the two pointers to the raw Vec buffer, you see that they both point to the same location in memory. This location is unchanged, both by the deconstruction/reconstruction of the Vec with the into_raw_parts()/from_raw_parts() calls, and from the call to the sort() function. So, going back to the original question:

/// Consider this Rust code:

fn sort(items: &mut Vec<u32>) {
    // Implementation goes here
}

///
/// The signature above makes sense for a sorting algorithm that:
///
/// - a) Returns a copy of the data in a new Vec in sorted order.
/// - b) Returns a copy of the data in a new slice in sorted order.
/// - c) Returns a new subslice of the same data in sorted order
/// - d) Sorts the data "in place" by moving elements around.
/// - e) Sorts the data and removes duplicate items.

It can't be a, b, or c for a couple of reasons.

The most blatant reason is that the method signature, sort(items: &mut Vec<u32>), does not have a return value, which means that the method doesn't return anything. This eliminates those first three options. However, the above exercise also clearly shows that the underlying memory structure of the Vec isn't changing, which further reinforces the fact that there is nothing new being created by the function.

Also, e doesn't make any sense. That option is referencing implementation details that have nothing to do with the method signature - we have no way of knowing if duplicate items will be removed or not. So while it's conceivable that e could be a correct answer, it is not a clearly correct answer.

The correct answer, and the one that you see visual evidence of when you run this code, is d - it sorts the data "in place" by moving elements around. The fact that the method signature doesn't return anything already suggests that it must be acting on the Vec in place, altering it, and that is reinforced by the evidence above showing that no new underlying, internal memory buffer is ever allocated.

If you are new to Rust, that question might have been a bit tricky to follow at first. The keys are to pay attention both to what is passed into a function as well as what a function returns. If you are still unsure about what is going on, Rust does provide introspection tools, and even when an answer is clear, it can be interesting to dive into the weeds and see what is happening under the surface.