What MySQL 8.0.1 means to you as a Developer

This post will be updated as soon more information comes along.

This developer version wasn’t released yet, when it does, use at your own risk.

Oracle released the development version of MySQL 8.0.0-dmr on September 12th of 2016. Since then, the team have been working on the 8.0.1 development milestone. You can find the partial change list here.

The objective here is try to explain how this will have any real world impact for you from 8.0.1. Please remember though, that any changes made to this version will not be final until the General Availability (date not currently set).

These topics are aimed at the Software Engineering side and not DBA and this is why Replication, for instance, is not covered here.

Changes

Charset and Collation

MySQL 8.0 was defined as to have utf8mb4 as the default CHARACTER SET and utf8mb4_general_ci as the default COLLATION. 8.0.1 will change the default COLLATION to utf8mb4_0900_ai_ci.

Let’s analyse the name utf8mb4_0900_ai_ci:

How does that impact you? It means that by default, new tables will have that collation and will be able to handle more characters than Basic Multilingual Plane (more emoji! 🤦🏼‍♀️), plus it will be accent and case insensitive. If you want case and accent sensitive you will need to use utf8mb4_0900_as_cs.

If you wish to know more about the reasons for utf8 now being utf8mb4 you should read this post on MySQL official blog: Sushi = Beer ?! An introduction of UTF8 support in MySQL 8.0

Language Specific Charsets

There are cases where language takes precedence over the default general collation. For this you will need to use, for example in the instance of German phone book order, utf8mb4_de_pb_0900_as_cs.

Optimization

Descending Index

This particularly is one of my most desired features. Finally being implemented on this version, the ALTER TABLE ADD INDEX ix_column (column DESC) won’t be parsed as ASC anymore.

Since InnoDB uses BTREE indexes, when running a query that uses it in the case of single columns it doesn’t matter if the index is ASC or DESC. The index is used for DESC anyway.

However when working with multi-column indexes this will matter and having a descending index will actually increase your performance. Example:

A generic table users:

Field Type Null Key Default Extra
id int(10) unsigned NO PRI auto_increment
first_name varchar(127) NO
last_name varchar(128) NO
email varchar(255) NO
created_at timestamp NO CURRENT_TIMESTAMP
updated_at timestamp NO CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP

The following query will have better performance if run with DESC index on updated_at time column:

SELECT *
FROM `users`
ORDER BY updated_at DESC, first_name ASC
LIMIT 100;

For that we need the following index:

ALTER TABLE `users`
ADD INDEX `ix_updated_at_first_name`
(updated_at DESC, first_name ASC);

Before that, the query plan would do a full scan on the table and not use the index at all. After the index is created it searches on index, and not the table:

Before After
Before After

The query above is just a simplistic example and doesn’t filter anywhere the index, this is why it causes to do a full index scan.

Parser

I personally never made use for \N to indicate NULL in any query itself, but I have used it in CSV or TSV files.
This behaviour won’t be supported anymore.
This change won’t impact file import or export through LOAD DATA INFILE or SELECT ... INTO OUTFILE.

Functionalities

In-place operations

In MySQL 5.7 the Generated Columns feature was added. However ALTER TABLE in tables containing one would be a COPY operation (which is slower since it has to copy all data again of the table). Now it can be INPLACE as long as the column(s) being modified is not in a generated column. What this actually means is: the metadata for the column will be changed in real time, without the need to internally create a new table and copy data.

JSON

Features

Two new functions are added to aggregate JSON values: JSON_ARRAYAGG() and JSON_OBJECTAGG().

JSON_ARRAYAGG() takes a column or expression as an argument and aggregates the result in a single JSON array.
With JSON_OBJECTAGG(id, col) you can use two columns which will be interpreted as key and value and returns a single JSON object.

Performance

Performance on JSON columns when used with ORDER BY was improved. Before MySQL would allocate 1K of memory to a sort key, making it fixed length. The extra padding was removed in this version.

Error Handling

The indexing of JSON fields is only possible through Generated Columns. If you tried to index a JSON field before, would get the following as an error:

JSON column '%s' cannot be used in key specification.

The error message has now been made clearer:

JSON column '%s' supports indexing only via generated columns on a specified JSON path.

Bugs

More than 100 bugs were fixed on this version. It ranges from InnoDB through Replication and even compilation bugs. The list is too big to be tacked on to this article, but the complete (and yet partial) change list is available here.

Understanding Generated Columns

The Theory

Generated Columns is a feature released on MySQL 5.7. They can be used during CREATE TABLE or ALTER TABLE statements. It is a way of storing data without actually sending it through the INSERT or UPDATE clauses in SQL. The database resolves what the data will be.

There are two types of Generated Columns: Virtual and Stored. They work with:

  • mathematical expressions (product_price * quantity)
  • built-in functions (RIGHT(), CONCAT(), FROM_UNIXTIME(), JSON_EXTRACT())
  • literals (“2”, “new”, 0)

Besides that, they can be indexed but they don’t allow subqueries in it.
A Generated Column works within the table domain. If you need subqueries on a particular column, you may have to look at Views.

The basic example

As an example I am going to use an e-commerce database as based on my past experience of what I have seen and worked. You will probably have at least these tables or something similar:

  • users – stores user info
  • products – stores product info like price and description
  • orders – stores the user_id, date of order
  • orders_items – stores product_id, order_id, quantity and price at the time of purchase

This is the whole DB: Gist.

Notice the order_items definition:

The retrieval would bring:

id order_id product_id product_price quantity
1 369 1304 202.18 7
2 4 1736 250.40 3
3 270 1404 29.89 5
4 256 179 190.40 10
5 107 1911 146.98 1

One example is to get the total of that order_item row, something like total_item_price that would store the value of product_price * quantity to show how much the summed amount of an item would be. Some databases have the MONEY type to store price, as with MySQL it is recommended to work with DECIMAL.

People solve this problem in different ways:

  • store the calculated price on a new column to make it easier to retrieve;
  • create a view;
  • or they calculate in the application itself, which in this case might cause problems due to how the language handles floats. There are libraries to deal with money values in a lot of languages and frameworks, however, the overhead of converting each row into a money object could be costly depending on the amount of data being transferred.

Another way I’ve seen is: people calculate in the query the total amount for the orders_items row as product_price * quantity:

id order_id product_id product_price quantity total_item_price
1 369 1304 202.18 7 1415.26
2 4 1736 250.40 3 751.20
3 270 1404 29.89 5 149.45
4 256 179 190.40 10 1904.00
5 107 1911 146.98 1 146.98

Virtual Columns

  • They take no disk space, except when using a Virtual Column as in a Secondary Index.
  • They are an INPLACE operation: it means the table definition is changed without having to recopy all the data again. More info.
  • The values are calculated on the fly during read operations and BEFORE triggers.

Consider using virtual columns for data where changes happens in a significant number of times. The cost of a Virtual Column comes from reading a table constantly and the server has to compute every time what that column value will be.

Stored Columns

  • They do use disk space.
  • It has the same cost of adding a new column, so it is a COPY operation
  • Values are updated in every INSERT and UPDATE statement.

You should consider using Stored Columns for when the data doesn’t change significantly or at all after creation, like for instance, the example above with the orders_items table. Once a purchase is made, the price of the product is stored, not being changed, neither the quantity. Considering this information we could create total_item_price as a Stored Column.

The code

Creating a table

-- Virtual Column

CREATE TABLE `orders_items` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`order_id` int(10) unsigned NOT NULL,
`product_id` int(10) unsigned NOT NULL,
`product_price` decimal(10,2) unsigned NOT NULL DEFAULT '0.00',
`quantity` int(10) unsigned NOT NULL DEFAULT 1,
`total_item_price` decimal(10,2) AS (`quantity` * `product_price`),
`created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`updated_at` varchar(45) NOT NULL DEFAULT 'CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP',
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4;

-- Stored Column

CREATE TABLE `orders_items` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`order_id` int(10) unsigned NOT NULL,
`product_id` int(10) unsigned NOT NULL,
`product_price` decimal(10,2) unsigned NOT NULL DEFAULT '0.00',
`quantity` int(10) unsigned NOT NULL DEFAULT 1,
`total_item_price` decimal(10,2) AS (`quantity` * `product_price`) STORED,
`created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`updated_at` varchar(45) NOT NULL DEFAULT 'CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP',
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4;

Notice how the definition changes on line 9 and 23: you have another keyword, AS, then an expression and specifically on line 23 you see a STORED keyword. In both lines they are generated columns, if nothing is specified will be a VIRTUAL column.

Altering a table

It uses the same syntax as adding a column, just adding the “AS (expression)” after the data type:

-- `full_name` as VIRTUAL COLUMN
ALTER TABLE users
ADD COLUMN `full_name` VARCHAR(500)
AS (CONCAT_WS(" ", `first_name`, `last_name`));

-- `total_item_price` as STORED COLUMN
ALTER TABLE orders_items
ADD COLUMN `total_item_price` DECIMAL(10, 2)
AS (`quantity` * `product_price`) STORED;

JSON fields

It is also possible to extract data from JSON fields using generated columns. As the functions for JSON are built-in, JSON_EXTRACT and JSON_UNQUOTE as well “->” and “->>” work as expressions for a generated column:

-- Stored Columns
ALTER TABLE `twitter_users`
ADD COLUMN `location` VARCHAR(255)
AS (response->>"$.location") STORED;

Final considerations

When the type is STORED, it must be specified after the expression otherwise the default behaviour will be to be VIRTUAL.

Generated columns can have indexes created as the following, no matter if stored, virtual or extracted from a JSON field:

ALTER TABLE users
ADD INDEX `ix_full_name` (`full_name`);

Which is the same syntax for normal columns.