Author Archives: Colin

MySQL storage of UTF-8 characters

While reviewing a recent pull request, I realized I didn’t fully understand how MySQL stores UTF-8 characters.  It is especially confusing due to inconsistencies with the family of text column types (TINYTEXT, TEXT…) that we use for aggregate storage (such as JSON blobs) versus typical string (VARCHAR) fields for most of our Rails models.

After some extensive research, below is an overview of how UTF-8 is stored in MySQL across the various data types.

UTF-8 Overview

UTF-8 is a multi-byte encoding of the Unicode code points. The Wikipedia article on UTF-8 is excellent.  Originally, when Unicode was a 16-bit standard (code points U+0000 to U+FFFF), UTF-8 was variable from 1 to 3 bytes.

Later, Unicode was extended beyond 16 bits (code point U+10000 and beyond) to make room for some ancient languages.  UTF-8 was extended to 4 bytes max.  This is when MySQL added support for UTF-8, but to optimize storage they only supported the 3-byte form.  Later, more emoji were added to Unicode beyond the ancient languages, and MySQL added a new character encoding utf8mb4 to support these.

In MySQL 5.x, sticking with the 3-byte form improved performance, but with the downside of limited emoji support.  MySQL 8.0 apparently has major speed improvements for utf8mb4 and actually deprecates utf8(mb3).

MySQL Types


For these MySQL types, the count given in parentheses is interpreted as characters, not bytes.  To match the SQL spec, MySQL doesn’t allow extra characters even if there are enough bytes.  For example, with utf8(mb3) encoding, MySQL will reserve 30 bytes to hold a VARCHAR(10) column. Even though an 11 character string of ASCII only needs 11 bytes and would therefore fit, MySQL will reject the 11-character string as too long.


For these MySQL types, the documented size limit (2^8, 2^16, 2^24, 2^32) are given in bytes, not characters.  MySQL will store any text that fits.  So this is the opposite of CHAR and VARCHAR!  For example, for a TINYTEXT field with a maximum size of 255 bytes, a 255-character ASCII string can be stored.  But consider a string of 3-byte UTF-8 characters like ☃.  Only 85 of them will fit.  Or newer 4-byte emoji like 😹 will work even with a database default of utf8mb3…but only 63 of them will fit.

Hobo Fields Migrations for Text

We use Hobo Fields to declare fields in our models and manage our Rails migrations.  The Hobo Fields schema generator allows any arbitrary limit to be set for a text field, but MySQL only supports the 4 powers of 2^8 given above.  And Hobo Fields interprets the limit as worst-case characters so it applies a 3X conversion between characters and bytes.  So the only valid limits are (2^8 – 1)/3, (2^16 – 1)/3, (2^24 – 1)/3, (2^32 – 1)/3.  We encapsulated those as constants to use in our models:


MYSQL_TINY_TEXT_UTF8_LIMIT   = 0x0000_00FF / MYSQL_BYTES_PER_UTF8_CHARACTER #            85 characters
MYSQL_TEXT_UTF8_LIMIT        = 0x0000_FFFF / MYSQL_BYTES_PER_UTF8_CHARACTER #        21,845 characters

Footnote: MySQL vs Ruby Methods

Note that Ruby and MySQL took opposite approaches to mapping the concept of “length” to characters vs bytes.  See length vs. LENGTH in the table below:

Ruby MySQL
character .size
byte .bytesize LENGTH()

ruby_dig Gem Adds Hash#dig and Array#dig from Ruby 2.3 to Earlier Versions

Introducing `ruby_dig`

It may take us some time to upgrade to Ruby 2.3.  But we’d like to be able to start using `dig` right away.  The `ruby_dig` gem solves this by adding the `dig` method to `Array` and `Hash` just like Ruby 2.3+ has natively.

The gem can be found on ruby_gems and on github.

Why Do We Need `dig`?

With the ever-growing popularity of JSON-based APIs, we all find ourselves writing code to “dig” through a parsed JSON response of nested hashes and arrays. This can be error-prone and tedious, so Ruby 2.3 added the `dig` instance method to Array and Hash to simplify the process.

For a simple example, let’s take some code that uses Github’s API to get the assignee of the first Pull Request for a given repo:

uri = Uri.parse("")
pulls_response = Net::HTTP.get_response(uri)
pulls_response.code == 200 or raise "Got non-200 response code: #{response.inspect}"

pulls = JSON.parse(pulls_response.body)

first_assignee = pulls[0]['assignee']['login']

But what if the response doesn’t come back in the expected format? Any of the `[]` operators above might return `nil` and then next `[] would raise the dreaded—and nearly useless—Ruby exception:

NoMethodError: undefined method `[]' for nil:NilClass

Here is the last line rewritten to use `dig` and to raise a useful exception if the format is unexpected:

#   first_assignee = pulls[0]['assignee']['login']
first_assignee = pulls.dig(0, 'assignee', 'login') or raise "Got unexpected response #{pulls.inspect}"

Implementation Notes

The dig method is implemented by calling `self.[]` so it will work with classes that derive from `Array` or `Hash`.  Most notably, `ActiveSupport::HashWithIndifferentAccess`.  Therefore `dig` will work fine in a Rails application looking in `params`:

params.dig(:user, :emails, 0, :friendly_name)
params.dig("user", "emails", 0, "friendly_name") # equivalent to the above

Also negative array indexes will work fine:

params.dig(:user, :emails, -1, :friendly_name) # find the last email friendly name

However, neither the Ruby 2.3 documentation nor the tests in the commit make it clear what should happen in `Array#dig` if you pass non-numeric index.  This can happen easily, if the result you got had a hash where you expected an array.  It seems in keeping with the spirit of `dig` that it should return `nil` in this case rather than raising an exception:

TypeError: no implicit conversion of Symbol into Integer

Feedback on magic comment ‘immutable: string’

Charles Nutter, the mastermind behind JRuby, had some feedback on the magic ‘immutable: string’ comment we proposed in Ruby issue 9278,  (BTW our proposal is directly built on his Ruby 2.1.0 contribution, Ruby issue 9042.)

I figured it was worth copying the response up here where the formatting is more rich.

From Charles:

A magic comment should not completely change the semantics of a literal type. Encoding magic comments do not suffer from the same issue since they only change how the bytes of the strings are interpreted by the encoding subsystem…they do not change semantics.

I view this `immutable: string` comment as less intrusive than the encoding magic comment. The magic encoding was typically required in order get code to compile at all in 1.9. In this case the magic `immutable: string` comment is just a tip to Ruby saying “Feel free to optimize string literals here”. The comment is completely optional, as is the optimization. But it’s worth it because the optimization has a big speed payoff. 1.6X in this benchmark. I think most projects will run at least 1.1X faster.

A magic comment is far removed from the actual literal strings, meaning that every developer that enters a file will have to keep in mind whether the strings have been forced to be immutable before doing any work with literal strings.

I expect that the `immutable: string` magic comment would typically be dropped at the top of all files in a project. (As many did with the encoding comment when porting to 1.9.) Any project that has automated test coverage is going to want to put the comment throughout as a project policy, because of the speed payoff.  [BTW to make it easy to add this magic comment, we’ve cloned the magic_encoding gem into a magic_immutable gem here.]

Although this eliminates or reduces calls to .freeze, it causes the opposite effect to get a mutable string in the same file…specifically, you have to call .dup.

Yes, exactly! This magic comment eliminates a giant/infeasible/ugly change (`.freeze` after every string literal) and replaces it with a tiny/feasible one. That is certainly its goal.

Consider Rails. I just did a quick/approximate regex search to count string literals in Rails 3.2.12 and found on the order of 50,000 of string literals. How many of those do you think are mutated? I’m going to guess no more than 10 places across all of Rails. (I’m hoping to get some actual stats using a hack of this branch, but for a quick estimate I audited about 150 cases that mutate a string and didn’t find a single case where the receiver was a string literal.)

So if we add the magic comment in Rails, we’ll need to add `.dup` (or change to in up to 10 places instead of adding `.freeze` after 50,000 string literals!

I think it would be better to consider adding the String#f method proposed during the rework of .freeze optimizations. Adding a .f to a literal string is not very much to ask, and in your particular script, it would actually add *fewer* characters to the code than the magic comment.

I’m with Paul Graham, who argues that expressiveness is counted in elements (~tokens), not bytes. So `.f` is no more expressive than `.freeze`. Whether `.f` or `.freeze`, it’s still at least 1 element to be added after *every* string literal. One of Ruby’s main differentiators is its beauty and English-like readability. That’s certainly what drew me to Ruby. No one who feels that way is going to put `.freeze` or `.f` after every string literal.

In summary, there will certainly be projects that don’t bother with the `immutable: string` magic comment (for example, those that lack automated test coverage, or don’t run on MRI). But many projects will use it because they will get a big, immediate performance payoff in Ruby 2.1. We’re already looking forward to it!

Magic comment ‘immutable: string’ makes Ruby 2.1’s “literal”.freeze optimization the default

In Mutable strings in Ruby we saw that making Ruby strings immutable with .freeze can remove a common source of bugs.  We also saw that Ruby’s allowance for string mutation can cause significant performance degradation because string literals must be allocated (and later garbage collected) every time that code is run.

Ruby 2.1 takes a step towards addressing the performance problem with its “literal”.freeze (formerly “literal”f suffix) optimization. This addresses the performance problem described in Part 1, when you write ‘.freeze’ after your string literals:

def log_message(message)
  puts message + "[EOL]".freeze

The above change can make a big difference in performance.  But you have to remember to put ‘.freeze” after every string literal, which is ugly and impractical to do broadly.

# -*- immutable: string -*- Makes “literal”.freeze the Default

To be practical, we need a way to mark entire code files as having their string literals as frozen by default.  This pull request does just that with a magic ‘immutable: string’ comment at the top of the file.  Here’s a test that shows its usage:

# -*- immutable: string -*-

require 'minitest/autorun'
require_relative 'test2.rb' # file with a mutable string

heredoc_string = <<-EOS
  Hello World

strings = [

def mutate(str)
  str.slice!(1, 2)

def log(message)
  "I'm logging: #{message}"

describe "strings defined in this file" do
  strings.each do |s|
    it "should raise an error" do
      -> {
      }.must_raise(RuntimeError).message.must_match(/can't modify frozen String/)

describe "string interpolation" do
  it "should fail" do
   -> {
      str = log("blah blah")
    }.must_raise(RuntimeError).message.must_match(/can't modify frozen String/)

  it "should succeed" do
    -> { str = "foo#{some_string}" }

describe "strings not defined in this file" do
  it "should be mutable" do
    Foo::CONSTANT.must_equal "SING"

describe "static strings" do
  it "should always have the same object_id" do
    def some_string
      "A nice frozen string!"
    some_string.object_id.must_equal some_string.object_id
# test2.rb
module Foo

Overriding immutable: string

Sometimes in a file with the magic comment you may actually need a mutable string.  That is easy to do with or ”.dup:

# -*- immutable: string -*-

def concatenate(*args)
  result =     # or:  result = ''.dup
  args.each do |arg|
    result << arg

Making Ruby More Functional

This new magic comment should allow some big performance gains and lessen bugs at the same time, taking Ruby one step closer to Functional Programming.  We sincerely hope this pull request will be accepted into the main Ruby 2.1 branch.

Questions or comments?  Feel free to post them here, email, or tweet to @colindkelley.

Mutable strings in Ruby

When people ask why we use Ruby instead of Python, I usually mention the beautiful, nearly-invisible English-like syntax, the ease of writing DSLs, trivially simple block arguments, meta-programming including reflection… But I have to admit that Python has a few clearly superior features.  First on the list: Python strings are better because they are immutable.

The Perils of Mutation

What’s so bad about Ruby’s mutable strings?  Consider this innocuous-looking Ruby method that puts “[EOL]” at the end of the message that is being logged:

def log_message(message)
  message << "[EOL]"
  puts message

LOOP_MESSAGE = "In the loop"

3.times do

Here’s the output:

In the loop[EOL]
In the loop[EOL][EOL]
In the loop[EOL][EOL][EOL]

What are those extra [EOL]s doing in there?!  Drop into pry and have a look at LOOP_MESSAGE after the loop has finished:

=> "In the loop[EOL][EOL][EOL]"

Uggh! Every pass through the loop mutated that “constant” LOOP_MESSAGE string.  Not good.

Avoid Mutating Methods and Operators

Let’s banish the nasty mutating “String#<<” operator:

def log_message(message)
  puts message + "[EOL]"

With that, the bug is fixed!

In the loop[EOL]
In the loop[EOL]
In the loop[EOL]

Besides String#<<, most of the other mutating operators to avoid end in !, like slice!, downcase!, etc.

Run-time Immutability with Object#freeze

Ruby has always had a run-time form of immutability.  If you call “.freeze” on an object, the Ruby runtime will raise an exception if any code tries to mutate that object:

>> pets = "cat"
=> "cat"
>> pets << " dog"
=> "cat dog"
>> pets.freeze;
>> pets << " canary"
RuntimeError: can't modify frozen String
 from (irb):8

So if you remember to freeze your string literals, you can avoid bugs with mutating shared references. You may have seen this done in your favorite Ruby gem or in Rails itself:

CONTENT_TYPE = "Content-Type".freeze

Performance Problems Caused by Mutability

Every time this code is called, the string literal “[EOL]” will be allocated again off the heap.  If this method were called inside the tightest loop in our code, the time to dynamically allocate this string literal (and later garbage collect it) can start to dominate the performance of the application.

The only reason Ruby goes to all this effort to allocate string literals every time the code is run is to allow the string to be mutated.  For example:

def concatenate(*args)
  result = ''
  args.each do |arg|
    result << arg

Non-Mutation is the Functional Way

Mutating variables is bug-prone, particularly in a language with shared references like Ruby, because side-effects from one method can change the behavior of another.  More generally, it is much harder for a programmer to understand the behavior of code that mutates because the name of an object is insufficient to predict its content—the programmer must also know the point in the object’s lifetime.

It may be less intuitive, but non-mutating code can actually perform better because objects that are never mutated can be freely shared, both by the language and the application.

Non-mutation is a cornerstone of the functional programming style.