Can Yrs fit my table?

In this blog post we'll design a collaborative 2D table structure using yrs - Rust port of popular yjs conflict-free replicated data types library for creating real-time, local-first apps. We'll also cover some optimizations to improve scalability of our solution.

Over the years I've often been asked if yrs is a good library for things like spreadsheets, 2D data sets etc. As usual, the answer is: it depends. Inputs can vary and depending on person big table can mean 10 000 or 1 000 000 rows. However today we'll discuss a case scenario to show that we can organize real life data set - in our case it will be Uber fares CSV which is 23MB table of total 200 000 x 9 cells - in under a second.

You can find entire demo in this Github repository. If you want to understand it better, we'll go over some design decisions and tips to optimize this particular case.

Design Yrs for tables

Now let's start with the basic features, that we want our table to maintain:

  • Keep a structure of rows and columns.
  • An ability to import data of various types from CSV: in our demo we only differentiate between integer, float and string and each column can contain data of different types.
  • We want to be able to associate some extra attributes with each column and row ie. header names or width/height. We also acknowledge that these may also be concurrently changed by different users.
  • We want to be able to add, remove and reorder rows and columns.

With these in mind, let's start with a word of warning: the most straightforward approach of simply mapping array of rows to ArrayRef of MapRefs. However it comes with some drawbacks: the column names are duplicated for every row, column ordering is not guaranteed etc.

Another approach could be an ArrayRef of ArrayRefs. However arrays allow to only insert/remove elements, but not update them. Keeping them in order would be a pain if several people were about to concurrently edit the same row while representing every cell as a separate MapRef/TextRef (to enable concurrent edits) comes with some excessive overhead when we think about a scale of 1.8 mln cells.

What we're going to do is:

  • Define two ArrayRefs: one for columns and rows metadata - individually represented as MapRef to enable non-conflicting edits ie. of column header names and width.
  • Define a MapRef for cell data.

We will associate them as follows: every row/cell will generate its own unique ID. Every cell will be identified by string key in format: {row_id}:{column_id}. We'll go with hex strings of randomizedu32 pairs, which makes collisions rare enough for our use case - we also regenerate them if we detect any conflict with existing row/column ID.

Pretty much the same design has been studied and proposed in another research paper (see: A Study of Semantics for CRDT-based Collaborative Spreadsheets).

Approach 1

Now, let's try to import a CSV data. We'll start with a naïve approach. For simplicity we'll assume that every row has the same number of cells, and that the first line is used for headers. Since number of columns isn't usually a problem, we'll just create them right away - we'll also reuse this method later:

struct Table {
    cols: ArrayRef, // column metadata
    rows: ArrayRef, // row metadata
    cells: MapRef,  // contents of the cells
}

impl Table {
    fn import_columns(&self, txn: &mut TransactionMut, reader: &mut Reader<File>) -> Vec<u32> {
        let headers = reader.headers().unwrap();
        let headers: Vec<_> = headers.iter().collect();
        let mut column_ids = Vec::with_capacity(headers.len());
        let mut column_id_index = HashSet::with_capacity(headers.len());
        for header in headers {
            let mut col_id = fastrand::u32(..);
            // (skipped) ensure unique col_id
            column_ids.push(col_id);
            let len = self.cols.len(txn);
            // this preliminary instance will become MapRef after being 
            // integrated into yrs Doc
            let col_meta = MapPrelim::from([
                ("id", Any::BigInt(col_id as i64)),
                ("name", Any::from(header)),
                ("width", Any::from(130)),
            ]);
            self.cols.insert(txn, len, col_meta);
        }
        drop(column_id_index);
        column_ids
    }
}

Above we'll store columns in dedicated structure of ArrayRef of MapRefs - this way we can ensure that we can concurrently add, remove and reorder columns. We can also change their width and name concurrently without conflicts while automatically resolving potential concurrent updates thanks to nature of yrs Ref types.

As for rows, we'll use similar strategy for now: just read CSV rows, create corresponding row_meta: MapRef with generated row ID. Finally, we're add cells to dedicated MapRef - we can navigate through them since their keys are made of row and column identifiers.

pub fn import_naive(&self, txn: &mut TransactionMut, reader: &mut Reader<File>) {
    let column_ids = self.import_columns(txn, reader);
    
    let mut row_id_index = HashSet::new();
    for result in reader.records() {
        let record = result.unwrap();
        let mut row_id = fastrand::u32(..);
        // (skipped) ensure unique row_id
        let row_meta = MapPrelim::from([
            ("id", Any::from(row_id)), 
            ("height", Any::from(30))
        ]);
        self.rows.insert(txn, 0, row_meta);
        for (col_idx, cell) in record.into_iter().enumerate() {
            // cell ID is composite hex key or row and column IDs
            let cell_id = format!("{:x}:{:x}", row_id, column_ids[col_idx]);
            self.cells.insert(txn, cell_id, Self::parse(cell));
        }
    }
}

Now, if we run this over our 23MB data set, we'll notice that this method is quite slow. While first 30 000 rows are loaded fairly fast, the speed is progressively slowing down the more rows were added. Why?

Why is it so slow?

In order to understand what's happening, we must first explain internal structure of yrs ArrayRef. All sequence types in yrs use the same underlying structure, which is essentially a double linked list of so called Blocks.

Each block contains a buffer of sequentially inserted elements - meaning as long as we don't change cursor position, all we need to do is to append element at the end of block's continuous buffer. Changing its position into middle of existing block will cause it to be split apart into two and any new element inserted in between.

These operations are very frequent in text editing scenarios - for which yjs/yrs are optimized - and in practice have very good ratio of performance and memory footprint, even when we have few thousands of blocks linked together.

On the other side, programmers are usually discouraged from operations like prepending (instead of appending), since every such action will force a new block to be created, reason being: blocks internal buffer can only expand on one end. However, there's a catch here: all of these optimizations happen when we talk about elements that can be compressed together into single block ie. text chunks or JSON elements. Some content types require sole ownership over the blocks they're in... and that includes shared refs like MapRef.

Above means that if you have an ArrayRef of 200 000 rows, where each is represented as MapRef, internally it will always be a linked list of 200 000 separate blocks! Now, if we want to append row metadata 200 000 times, we need to iterate to the end of our linked list first in each iteration, turning this part of our import into O(N^2) with N big enough for people to start noticing.

Approach 2

Since we identified a source of our performance problems, let's optimize it. To address the issue, what we need to realize is: as long as we don't bind cell data to specific row it doesn't really matter in what order are we adding our rows metadata. Therefore, we'll split our process in two steps.

First, we'll generate all of our rows metadata.

fn import_row_meta(&self, txn: &mut TransactionMut, row_count: usize) -> Vec<u32> {
    let mut row_ids = Vec::with_capacity(row_count);
    let mut row_id_index = HashSet::with_capacity(row_count);
    for _ in 0..row_count {
        let mut row_id = fastrand::u32(..);
        // (skipped) ensure unique row_id
        row_ids.push(row_id);
        let row_meta = MapPrelim::from([
            ("id", Any::from(row_id)), 
            ("height", Any::from(30))
        ]);
        self.rows.insert(txn, 0, row_meta);
    }
    row_ids
}

What's important here:

  1. self.rows.insert(txn, 0, row_meta) means that we're prepending our rows metadata, which for individual row is an O(1) operation.
  2. We'll need to know the number of rows ahead of time. This means that we need to load all of them into memory first.

As for the second point, since we're talking about 23MB of data executed on the client, it's not a big deal - yrs is an in-memory database, so it's fair to expect having enough RAM to fit entire content.

pub fn import(&self, txn: &mut TransactionMut, reader: &mut Reader<File>)  {
    let column_ids = self.import_columns(txn, reader);
    
    // prefetch all CSV data into memory - we need to know the row count
    let rows: Vec<_> = reader
        .records()
        .map(|record| {
            let record = record.unwrap();
            let record: Vec<_> = record.iter().map(Self::parse).collect();
            record
        })
        .collect();
    
    // create row metadata first
    let row_ids = self.import_row_meta(txn, rows.len());
    
    // import row data
    for (row_idx, record) in rows.into_iter().enumerate() {
        for (col_idx, cell) in record.into_iter().enumerate() {
            let cell_id = format!("{:x}:{:x}", row_ids[row_idx], column_ids[col_idx]);
            self.cells.insert(txn, cell_id, cell);
        }
    }
}

Once row metadata is ready, we can now reiterate and create cells with necessary data. As presented in the README, this process now can be finished in reasonable time:

imported 1800000 cells in 814.888708ms
encoded 1800000 cells (200000 rows x 9 columns) in 100.70175ms: 58213277 bytes (original file size: 23458139 bytes)

800ms for import and 100ms for encoding the yrs document state using v2 encoding sounds acceptable. What may make us less happy is the fact that the serialized yrs Doc state is over 55MiB - more than twice the size of an original file. Unfortunately, while yrs is very good at serializing its own block metadata, it doesn't include any sort of compression over the programmer defined data, including cell keys.

More experiments

We can further improve the payload size by adding another compression on top of the encoded document. In our experiment we tried to add zstd compression (using level 6 for reference), which reduced size from 55 → 11MiB at the cost of 100 → 400ms of increased total encoding time. Since this is data that we're going to dispatch over the network and onto the disk, it can be considered pretty good trade-off.

Depending on the nature of the content, we can achieve better or worse outcomes - it's not uncommon to have 16x compression over the real life data sets. However if you don't want to be surprised, you should always try it against your own data and never put blind trust in somebody's else benchmarks.