Vectors

In computer science, a vector (also known as a dynamic array) is an array that can dynamically adjust its size depending on the application needs. This is crucial when handling variable-sized data in any kind of application.

Vectors and memory

In Systems Programming languages such as C, C++, Rust, etc, you have some freedom to decide where in memory your data should be stored:

On the stack: This is usually the default behavior since allocating memory on the stack is very fast.
On the heap: You often need to opt-in for this because is a more convoluted and slow process; the memory allocator has to find a slot of memory big enough to store your data.

By its nature, then, Vectors need to be stored on the heap, otherwise we would corrupt the stack when trying to resize it.

A common implementation of a Vector in these languages would be a struct of three fields:

Capacity: How many values can be stored in the Vector without having to resize it.
Length: How many values are stored in the Vector, in other words, what chunk of its capacity is being used to actually store user data.
Pointer to data buffer: A pointer to the actual buffer on the heap where the user data is being stored.

Locality

Access to memory have some cost depending on the type of memory we are accessing. To give you and idea, here's a comparison with approximate cost in CPU cycles¹:

Memory	Cost in cycles	Primary storage	Secondary storage
Registers	1	✔
Caches (L1/L2/L3)	~10	✔
Main memory	~100	✔
Flash disk	~1M		✔
Traditional disk	~10M		✔
Network attached storage			✔

Because of that, processors use the heuristic of locality when reading data from main memory. The heuristic consists on that, if one application accesses one memory address, then it will probably need to access nearby address soon (sort of like a psychologist says that the best predictor of future behavior is past behavior).

I highly recommend checking out this computerphile video from few years ago illustrating the radical performance gains of locality.

And this talk from Chandler Carruth is also very illuminating on the role of vectors in speeding up the llvm compiler.

Resizing strategy

The strategy for adjusting the size of a Vector may vary depending on your needs, but one wildly used because of its amortized time complexity is the exponential resizing.

Exponential resizing

This strategy works like this:

Start with a capacity of 1.
During insertion
- Check if the Vector is full.
- If it is, create a new data buffer with twice the capacity.
- Copy all the data from the old buffer to the newly created one.
During removal
- Check if the Vector is at 1/4 of its capacity (The reason is done like described is to avoid Thrashing).
- If it is, create a new data buffer with halve the capacity.
- Copy all the data from the old buffer to the newly created one, which will be half full.

Amortized analysis

For an array of capacity $n$ there must have been $\log_2{n}$ resizes, and we can calculate an upper bound for the total copy operations after such resizes.

Total = 1 + 2 + 4 + ... + \frac{n}{4} + \frac{n}{2} + n

We can flip that and get a better picture of what is happening:

Total = n + \frac{n}{2} + \frac{n}{4} + ... + 4 + 2 + 1 = n + (\frac{n}{2} + \frac{n}{4} + ... + 4 + 2 + 1) \le n + n \le 2n

For a $\log_2{n}$ number of insertions, we need to perform at most $2n$ copy operations. On the other hand, there is no need to perform copy operations for remaining $n - log_2{n}$ insertions. Total copy operation per insertion is $2n / n = 2$ . So on average, each $n$ element moves only two times, which is constant.

Expanding the array size by a factor of 2 ensures that inserting $n$ elements takes $O(n)$ time overall. In other words, each insertion will take $O(1)$ time on average, almost equivalent to the usual array situation.

Vector visualization

Capacity: 1
Lenght: 0

References and bookmarks

Book: Dive Into Systems ↩