Tuning MyRocks for performance

There are basically two things which I majorly like about using MyRocks, 1. LSM Advantage – smaller space & lower write amplification and 2. Best of MySQL like replication, storage engine centric database infrastructure operations and MySQL orchestration tools. Facebook built RocksDB as an embeddable and persistent key-value store with lower amplification factor () compared to InnoDB. Let me explain a scenario were InnoDB proves less efficient compared to RocksDB in SSD:
We know InnoDB is constrained by a fixed compressed page size. Alignment during fragmentation and compression causes extra unused space because the leaf nodes are not full. Let’s consider a InnoDB table with a compressed page size of 8KB. A 16KB in-memory page compressed to 5KB still uses 8KB on storage. Adding to this, each entry in the primary key index has 13 bytes of metadata (6 byte transaction id + 7 byte rollback pointer), and the metadata is not compressed, making the space overhead significant for small rows. Typically flash devices are limited by the WRITE endurance, In a typical scenario were index values are stored in leaf nodes and sorted by key, the often operational database may not fit in memory and keys get updated in an random platform leading to higher write amplification. In the worst case, updating one row requires a number of page reads, makes several pages dirty, and forces many dirty pages to be written back to storage.

Sow now what I really love about MyRocks?

It’s all about much lower write amplification factor of RocksDB compared to InnoDB is what I am very much impressed about. On pure flash, reducing write volume (write amplification) is important because flash burns out if writing too much data. Reducing write volume also helps to improve overall throughput on flash. InnoDB adopts “update in place” architecture. Even though updating just 1 record, an entire page where the row belongs becomes dirty, and the dirty page has to be written back to storage. On typical OLTP systems, modification unit (row) size is much smaller than I/O unit (page) size. This makes write amplification very high. I have published performance benchmarking of InnoDB, RocksDB and TokuDB, You can read about it here

Things to remember before tuning MyRocks:

  • Data loading limitations
    • Limitation – Transaction must fit in memory:
      • mysql > ALTER TABLE post_master ENGINE = RocksDB;
        • Error 2013 (HY000): Lost connection to MySQL server during query.
      • Higher memory consumption and eventually get killed by OOM killer
    • When loading data into MyRocks tables, there are two recommended session variables:
      • SET session sql_log_bin=0;
      • SET session rocksdb_bulk_load=1;

There are few interesting things to remember before bulk loading MyRocks and tuning the system variable rocksdb_bulk_load:

  1. Data being bulk loaded can never overlap with existing data in the table. It is always recommended to bulk data load into an empty table. But, The mode will allow loading some data into the table, doing other operations and then returning and bulk loading addition data if there is no overlap between what is loaded and what already exists.
  2. The data may not be visible until the bulk load mode is ended (i.e. the rocksdb_bulk_load is set to zero again). RocksDB stores data into “SST” (Sorted String Table) files and Until a particular SST has been added the data will not be visible to the rest of the system, thus issuing a SELECT on the table currently being bulk loaded will only show older data and will likely not show the most recently added rows. Ending the bulk load mode will cause the most recent SST file to be added. When bulk loading multiple tables, starting a new table will trigger the code to add the most recent SST file to the system — as a result, it is inadvisable to interleave INSERT statements to two or more tables during bulk load mode.

Configuring MyRocks for performance:

Character Sets:

  • MyRocks works more optimal with case sensitive collations (latin1_bin, utf8_bin, binary)

Transaction

  • Read Committed isolation level is recommended. MyRocks’s transaction isolation implementation is different from InnoDB, but close to PostgreSQL. Default tx isolation in PostgreSQL is Read Committed.

Compression

  • Set kNoCompression (or kLZ4Compression) on L0-1 or L0-2
  • In the bottommost level, using stronger compression algorithm (Zlib or ZSTD) is recommended.
  • If using zlib compression, set kZlibCompression at the bottommost level (bottommost_compression).
  • If using zlib compression, set compression level accordingly. The above example (compression_opts=-14:1:0) uses zlib compression level 1. If your application is not write intensive, setting (compression_opts=-14:6:0) will give better space savings (using zlib compression level 6).
  • For other levels, set kLZ4Compression.

Data blocks, files and compactions

  • Set level_compaction_dynamic_level_bytes=true
  • Set proper rocksdb_block_size (default 4096). Larger block size will reduce space but increase CPU overhead because MyRocks has to uncompress many more bytes. There is a trade-off between space and CPU usage.
  • Set rocksdb_max_open_files=-1. If setting greater than 0, RocksDB still uses table_cache, which will lock a mutex every time you access the file. I think you’ll see much greater benefit with -1 because then you will not need to go through LRUCache to get the table you need.
  • Set reasonable rocksdb_max_background_jobs
  • Set not small target_file_size_base (32MB is generally sufficient). Default is 4MB, which is generally too small and creates too many sst files. Too many sst files makes operations more difficult.
  • Set Rate Limiter. Without rate limiter, compaction very often writes 300~500MB/s on pure flash, which may cause short stalls. On 4x MyRocks testing, 40MB/s rate limiter per instance gave pretty stable results (less than 200MB/s peak from iostat).

Bloom Filter

  • Configure bloom filter and Prefix Extractor. Full Filter is recommended (Block based filter does not work for Get() + prefix bloom). Prefix extractor can be configured per column family and uses the first prefix_extractor bits as the key. If using one BIGINT column as a primary key, recommended bloom filter size is 12 (first 4 bytes are for internal index id + 8 byte BIGINT).
  • Configure Memtable bloom filter. Memtable bloom filter is useful to reduce CPU usage, if you see high CPU usage at rocksdb::MemTable::KeyComparator. Size depends on Memtable size. Set memtable_prefix_bloom_bits=41943040 for 128MB Memtable (30/128M=4M keys * 10 bits per key).

Cache

  • Do not set block_cache at rocksdb_default_cf_options (block_based_table_factory). If you do provide a block cache size on a default column family, the same cache is NOT reused for all such column families.
  • Consider setting shared write buffer size (db_write_buffer_size)
  • Consider using compaction_pri=kMinOverlappingRatio for writing less on compaction.

 

Reference Source: https://github.com/facebook/mysql-5.6/wiki/my.cnf-tuning  

About Shiv Iyer 36 Articles
WebScale Database Infrastructure Operations Expert in MySQL, MariaDB, PostgreSQL and ClickHouse with core interests in performance, scalability, high availability and database reliability engineering. Shiv currently is the Founder and Principal of MinervaDB, an independent and vendor neutral Consulting, 24*7 Support and Remote DBA Services provider for MySQL, MariaDB, PostgreSQL and ClickHouse serving approximately 300 customers globally.
UA-155183614-1