Skip to content

doublethink.co.uk

Getting Started with Nitrokey HSM using Rust

I recently decided to experiment with a Nitrokey HSM 2 using the cryptoki Rust crate for performing PKCS11 operations with it. This documents my experiences.

The Nitrokey HSM is a tamperproof smart card embedded into a USB device, it can be interacted with through PKCS11 and provided tools. HSM come in all shapes and sizes, but this one is a cheap, easily integrated and opensource module for storing secrets outside of the memory of the computer itself.

The cryptoki API is a good Rust wrapper around the PKCS11 primitives, but it had a few gotchas to getting started with a real device. There didn't seem to be any straight forward and easy to follow examples using the crate. I decided to publish this as a simple example and explain some of the non-obvious steps required in this post.

Objective

The repository code describes attaching to the device, which is expected to have a pre-existing RSA key stored. It then extracts the public key in order to encrypt data on the host and then finally tests the decryption via the device.

Preparation

Prior to starting this assumes that the Nitrokey device has been set up following the getting started instructions, an RSA key has been generated, and the opensc libraries have been installed onto the host.

Walk through

In the code firstly we set up the PKCS11 client, passing the path of the shared object for the type of device we're interacting with. On Arch (btw) it was /usr/lib/opensc-pkcs11.so after installing all the dependencies from the Nitrokey installation instructions. The initialize call needs to be called, but it seems to currently only have one possible argument.

let pkcs11client = Pkcs11::new(opt.module)?;
pkcs11client.initialize(CInitializeArgs::OsThreads)?;

Next set up the PIN for the type of user you want to login to the HSM as. For this example we use the regular 'User' PIN because we're later going to be accessing the secrets. This needs to be set before the login attempt. The slot here is usually a parameter to allow multiple devices.

pkcs11client.set_pin(slot, opt.pin.as_str())?;

Next up the flags now need to be set up to start a logged-in session with the device. Importantly the 'serial session' flag must be set on all sessions and without it it'll reject it as a deprecated access.

let mut flags = Flags::new();
flags.set_serial_session(true);

Once we have opened the session, we log in as the 'User' type.

let session = pkcs11client.open_session_no_callback(slot, flags)?;
session.login(cryptoki::types::session::UserType::User)?;

Next we want to find the objects at a particular Key ID location on the device, this is a user-provided ID to identify a stored key. The location is split into separate public and private objects, suitable for encrypting and decrypting respectively. The session is used to query for both the Id and Encrypt/Decrypt attributes.

let enc_objects = session.find_objects(&[
    Attribute::Encrypt(true.into()),
    Attribute::Id(keyid.clone()),
])?;
let dec_objects = session.find_objects(&[
    Attribute::Decrypt(true.into()),
    Attribute::Id(keyid.clone()),
])?;

if enc_objects.len() != 1 && dec_objects.len() != 1 {
    bail!("Can't uniquely determine encryption and decryption objects for key id: {}", opt.id);
}

One we have an ObjectHandle we can query its stored attributes. Attributes are optional, but the interface supports multiple at once, so we define a couple of helper functions to extract the public key parts of an RSA key.

fn extract_modulus(session: &Session, object: ObjectHandle) -> Result<BigUint> {
    let attributes = session.get_attributes(object, &[AttributeType::Modulus])?;

    if let Some(Attribute::Modulus(vec)) = attributes.get(0) {
        Ok(BigUint::from_bytes_be(&vec))
    } else {
        bail!("Modulus Attribute is not available");
    }
}

fn extract_public_exponent(session: &Session, object: ObjectHandle) -> Result<BigUint> {
    // ... Similar implementation ...
}

We then use the helper functions to extract the modulus and public exponents to generate the public key.

let modulus = extract_modulus(&session, enc_objects[0])?;
let pubexp = extract_public_exponent(&session, enc_objects[0])?;

The excellent RustCrypto project provides cryptographic primitives, at the time of writing there's been an audit of the RSA crate but it's results have not been published. However, for this example program it's clearly good enough. We construct the public key and then encrypt a test message with a provided padding scheme.

It's important to note that the device doesn't support encryption on the device itself, so we have to do it on the host. Although cryptoki does expose the encryption action, the Nitrokey HSM device doesn't support it.

let mut rng = OsRng;
let pubkey = rsa::RSAPublicKey::new(modulus, pubexp)?;
let secret = "This is my secret".as_bytes().to_vec();
let output = pubkey.encrypt(&mut rng, PaddingScheme::new_pkcs1v15_encrypt(), &secret)?;

Lastly as the key-material only lives on the device, conversely we must do the decrypt using the device. This is hilariously slow for a 4096-bit RSA operation, ie multiple seconds, as this is just a low-powered smart card type device. As these devices are designed to be readily available, easily installed and used only for securing key encryption keys it's not a problem.

let plaintext = session.decrypt(
    &cryptoki::types::mechanism::Mechanism::RsaPkcs,
    dec_objects[0],
    &output,
)?;

And finally...

This was a fun exercise in trying out PKCS11, and was surprisingly easy (with a few quirks) to get going on a rainy Saturday. I'm excited to see Rust crypto crates working well and getting audited; however, as with any project like this: this is for educational purposes only, do not trust it as safe or secure production code.

Rust Thread-Safety Patterns

I wanted to write a short post to highlight one of main features I'm excited about from Rust and how I think it's differentiated from other languages, namely it's thread-safety. I want to build up to show a neat pattern which demonstrates the power of Rust's type system, but before that we'll cover some basics. I'll assume some interest in Rust but not necessarily experience with anything other than C++.

Coming from a language such as C++ a lot of focus is initially directed towards the memory-safe nature of Rust and the lack of garbage collection via the borrow checking mechanism. This is understandable given that the borrow checker is rather unique; however, I believe it provides less of 'killer feature' for experienced C++ devs than the thread-safety guarantees that Rust can provide.

Rust leverages its type system to prevent types from being used in a thread-unsafe manner. It does this through two marker traits, Send and Sync. For a C++ developer these can be thought of as similar to type-traits in C++ but they should understand that traits are used much more heavily and ergonomically in Rust, as they form the primary way to describe behaviours that a type may have such as Concepts in C++20.

The Send marker trait is used to signify that a value can be safely moved between threads, which is true for most types. The compiler and the type system will enforce this requirement when a value is passed to an API that might cause it to be moved onto a different thread.

Correspondingly the Sync marker trait is used to signify that the value can be used from the context of multiple threads. Rust will prevent any value without this trait from being accessed in a context it can't determine to be necessarily single-threaded.

For both these marker traits, it's useful to know that they also compose together, such that any type constructed from only Sync types will be Sync itself by default. For most users, and especially when writing safe Rust, you'll not implement Send or Sync yourself and instead rely on composition and standard library types that introduce these marker trait properties.

The most obvious of these types introducing Sync to start with is the std::sync::Mutex type, here illustrated with a contrived example:

let locked_data: Mutex<HashMap<Key, Value>> = Mutex::new(HashMap::new());
let mut data = locked_data.lock().unwrap();
data.insert(key, value);

To unpack what is happening here: the Mutex type in Rust is generic over a type, ie Mutex<T>, and even if T is not Sync itself, the Mutex<T> type will be Sync. It has an API and implementation that can provide those guarantees for the wrapped type. It's up to the Mutex author to ensure that using unsafe Rust.

What is nice here from an API design is that the Mutex<T> always wraps the data it protects, so it becomes impossible to access the value outside of how the Mutex's API provides the guarantees it needs to for Sync. Normally the access is gated through the lock() method, which give or take some error-handling, returns a MutexGuard which enforces the underlying locking and unlocking happens through RAII. The value being protected is only accessible during the lifetime of this guard object.

Between the Sync requirement for safe multi-threaded access and design of the above API design enforcing the underlying type is only accessible when certain run-time requirements are met, this ensures we always have thread-safety provided by the type system.

Compare that to a similar example in C++:

std::mutex lock;
std::unordered_map<Key, Value> map;

std::lock_guard<std::mutex> lock_guard(lock);
map.insert({key, value});

Here there is no additional protection of the map value, and nothing ties the lock object to the map. If the locking were accidentally omited the compiler would not notice. Of course, there's always a variety of tools that might help detect it: an experienced code reviewer, some static analysis tools or sanitizers. Conversely in Rust this would always be an immediate compilation error, probably informing you that the map type didn't have the Sync type.

There are of course other thread-safety problems you can still run into in Rust, such as deadlock, but just removing this class of error is a great step forward and means you're free to concentrate on other parts of the program.

There are design implications that will be different in programming Rust versus C++ because of these constraints. A simple example is that if you want one lock guarding multiple data, you'd separate the data into its own type and have it as the T in the the Mutex<T> rather than having the mutex in the same object as the list of data.

A more complicated pattern is one I came across and wanted to get to walk us through, because it shows some more power that Rust has when carefully crafted. This pattern allows iteration over a collection while a lock is being held.

We'll start by defining our collection of values:

pub struct ValueMap {
    values: Arc<Mutex<HashMap<Key, Value>>>,
}

As a sidenote here, Arc<T> can be thought of like a std::shared_ptr, and is used here to ensure that there can be multiple references to the same memory (the mutex) alive from multiple threads, and the reference counting determines the lifetime, not solely the borrow-checker.

Next we'll define a helper type that just wraps a MutexGuard object with a defined lifetime 'a. This lifetime is an annotation that here is basically ensuring that the LockedValueMap will last at least as long as the values referenced inside the MutexGuard. This means that

pub struct LockedValueMap<'a> {
    items: MutexGuard<'a, HashMap<Key, Value>>,
}

Next we'll add a function lock() onto the original ValueMap that locks the ValueMap's mutex and returns the above LockedValueMap type bound to the lifetime of the contents of locked guard object.

impl ValueMap {
    pub fn lock(&self) -> LockedValueMap {
        LockedValueMap {
            items: self.values.lock().unwrap(),
        }
    }
}

On the LockedValueMap we also introduce a method returning an object that implements the Iterator trait from the enclosed (now unlocked) value:

impl<'a> LockedValueMap<'a> {
    pub fn iter(&self) -> impl Iterator<Item = (&Key, &Value)> {
        self.items.iter()
    }
}

These pieces acting together allow the following type of rich iteration:

let x = map
         .lock()                        // Lock the collection
         .iter()                        // Produce an iterator
         .filter(|x| *x.0 > 100)        // Only keys greater than 100
         .fold(0, |acc, x| acc + x.1);  // Fold the values together

This is clear and succinct and allows full use of the powerful Iterator type in Rust, while ensuring that the Mutex is held for the duration of the iteration. If the programmer attempted to do anything such as storing one of a iterated references into another datastructure which has a lifetime longer than the 'a, the borrow-checker would prevent it, ensuring no dangling or un-mutex'd references are allowed.

For me, as a recovering C++ programmer, these thread-safety features make programming concurrent Rust programs much easier and safer than how we'd traditionally do it without resorting to a completely different model such as golang/channels.

More things I've been up to

Public information on one of my team's projects in recent years, bringing RabbitMQ into wide use within Bloomberg. The proxy for AMQP mentioned in this is one of my more recent projects where I've got to code.

Modern Hashtables

As it has been a decade since last learning any compsci I've recently been looking over some more advanced and modern hashtable variations that weren't taught much back and came across a blog series by Emmanuel Goossaert that covered these advancements in open addressed hashtables that's worth highlighting:

Robin Hood Linear Probing

A slight variation on linear probing where when dealing with collisions the item to be inserted is updated with an existing item and the existing item's slot is used if the original item to insert would have a further distance from its intended bucket. Thus stealing from the rich (nearer their intended bucket) and giving to the poor (the more displaced). Main article and a follow up specifically about deletions.

Hopscotch Hashing

A variation where the entry to be inserted will be placed within the neighbourhood of the intended bucket, when the neighbourhood is full an item is displaced recursively until all the items have found a place in their respective neighbourhoods. This algorithm is likely to be very friendly to cache performance, similar to linear probing. Main article.

Cuckoo Hashing

A collision resolution that has a list of distinct hashes to check for each key on lookup, and on insertion guarantees value will be inserted into one of them by displacing one of the existing values (which then needs to find a home).

Article and another discussing the ability to do lock-free cuckoo hashes.

None of these are likely to be make it into a general purpose implementation such as std::unordered_map, that will likely used chained collision resolution, but they are interesting to know for specialized applications and where an open addressed hashtable is most suited (disk and shared memory implementations).

What I've been up to

Since 2005 I've been working in the software infrastructure department in Bloomberg. Bloomberg doesn't give a lot away about its internal technology or what we do with it, but occasionally things are made publically available. There is an increasing presence on github, most notably with our version of the STL and some basic libraries following our coding standards. More specifically to my own work there's a video after the fold of a talk by a colleague at jsconf 2011 about the internal development environment we built a few years ago for the developers building the Bloomberg terminal: