Tag: mysql

Converting comma separated fields to MySQL JSON – a case study

This post is a case study of a job I had to do in a legacy application, it doesn’t mean it will apply to you, but it might.

This is a table of contents:

There are many ways to store a tree in a relational database, this is not by far the best option to do it, however it is still common to see it happen.

One way it is called materialized path, which consist with values separated by a delimiter, in my case a comma ,.

You would have a tree stored in a manner like this:

id parent_id user_id depth asc_path node
1 0 1 1
2 1 2 2 ,1, L
3 1 13 2 ,1, R
4 2 3 3 ,2,1, L
5 2 61 3 ,2,1, R
6 13 23 3 ,13,1, L
7 13 22 3 ,13,1, R
8 3 4 4 ,3,2,1, L
9 3 156 4 ,3,2,1, R
10 22 1568 4 ,22,13,1, L
11 22 26 4 ,22,13,1, R
12 23 1476 4 ,23,13,1, L
13 23 690716 4 ,23,13,1, R
14 61 1051 4 ,61,2,1, L
15 61 62 4 ,61,2,1, R

The column asc_path stands for ascending path of a tree in where which node has two other ones, not necessarily being a binary tree, being stored in the database.

This column has commas in the beginning and in the end because how queries are made to search if an element is present in the path or not by using LIKE "%,id,%". If someone did a search to know if the number 2 was a node in any of the paths, without the commas, it would also return 23, 62 and any other number containing 2.

Performance

The only way to make it a bit faster is having a FULLTEXT index created in asc_path. Because a BTREE index starts indexing in the beginning of a string, since the presence of the wildcard % in the string search it makes it in possible to use said index.

This is the graphical representation of the example above:

Tree

Searching

To search an specific element the query would be:

SELECT
parent_id,
user_id,
depth,
asc_path,
node
FROM tree
WHERE asc_path LIKE '%,13,%';

Result:

parent_id user_id depth asc_path node
13 23 3 ,13,1, L
13 22 3 ,13,1, R
22 1568 4 ,22,13,1, L
22 26 4 ,22,13,1, R
23 1476 4 ,23,13,1, L
23 690716 4 ,23,13,1, R

Converting to a JSON array

Some databases, like PostgresSQL (section 9.42) have more modifiers functions to convert strings to JSON, in my case I wanted to store the ascending tree path in a JSON field which would give me the possibility of using JSON_CONTAINS(json_doc, val) to know the records that have a given node in its path.

To do it, I had to transform the string in a JSON array.

1st step: remove the leading commas

Removing the leading commas, but before any update, lets test what we are doing:

SELECT
parent_id,
user_id,
depth,
asc_path,
TRIM(BOTH ',' FROM asc_path) AS trimmed_commas
FROM tree

Results:

parent_id user_id depth asc_path trimmed_commas
0 1 1
1 2 2 ,1, 1
1 13 2 ,1, 1
2 3 3 ,2,1, 2,1
2 61 3 ,2,1, 2,1
13 23 3 ,13,1, 13,1
13 22 3 ,13,1, 13,1
3 4 4 ,3,2,1, 3,2,1
3 156 4 ,3,2,1, 3,2,1
22 1568 4 ,22,13,1, 22,13,1

2nd step: add brackets to the string

A JSON array is formed around brackets [], and we need to have it in our string to be a valid JSON document:

SELECT
parent_id,
user_id,
depth,
asc_path,
TRIM(BOTH ',' FROM asc_path) AS trimmed_commas,
CONCAT("[", TRIM(BOTH ',' FROM asc_path), "]") AS added_brackets
FROM tree;

Results:

parent_id user_id depth asc_path trimmed_commas added_brackets
0 1 1
1 2 2 ,1, 1 [1]
1 13 2 ,1, 1 [1]
2 3 3 ,2,1, 2,1 [2,1]
2 61 3 ,2,1, 2,1 [2,1]
13 23 3 ,13,1, 13,1 [13,1]
13 22 3 ,13,1, 13,1 [13,1]
3 4 4 ,3,2,1, 3,2,1 [3,2,1]
3 156 4 ,3,2,1, 3,2,1 [3,2,1]
22 1568 4 ,22,13,1, 22,13,1 [22,13,1]

3rd step: validate if the changes works

Let’s use JSON_VALID() to see if it will accept our new string as a JSON, keep in mind that when the argument is NULL the return is also NULL:

SELECT
parent_id,
user_id,
depth,
asc_path,
TRIM(BOTH ',' FROM asc_path) AS trimmed_commas,
CONCAT("[", TRIM(BOTH ',' FROM asc_path), "]") AS added_brackets,
JSON_VALID(CONCAT("[", TRIM(BOTH ',' FROM asc_path), "]")) AS json_valid
FROM tree;

Results:

parent_id user_id depth asc_path trimmed_commas added_brackets json_valid
0 1 1
1 2 2 ,1, 1 [1] 1
1 13 2 ,1, 1 [1] 1
2 3 3 ,2,1, 2,1 [2,1] 1
2 61 3 ,2,1, 2,1 [2,1] 1
13 23 3 ,13,1, 13,1 [13,1] 1
13 22 3 ,13,1, 13,1 [13,1] 1
3 4 4 ,3,2,1, 3,2,1 [3,2,1] 1
3 156 4 ,3,2,1, 3,2,1 [3,2,1] 1
22 1568 4 ,22,13,1, 22,13,1 [22,13,1] 1
22 26 4 ,22,13,1, 22,13,1 [22,13,1] 1
23 1476 4 ,23,13,1, 23,13,1 [23,13,1] 1
23 690716 4 ,23,13,1, 23,13,1 [23,13,1] 1
61 1051 4 ,61,2,1, 61,2,1 [61,2,1] 1
61 62 4 ,61,2,1, 61,2,1 [61,2,1] 1

Replacing 1st step and 2nd step with a function

So that your query gets easier to use and not messy, you can create a function, I decided to create to_json_array(input_string, delimiter_char):

Running the query only with to_json_array on MySQL:

SELECT
parent_id,
user_id,
depth,
asc_path,
to_json_array(asc_path, ',') AS to_json_array,
JSON_VALID(to_json_array(asc_path, ',')) AS is_to_json_array_valid,
node
FROM tree;

Result:

parent_id user_id depth asc_path to_json_array is_to_json_array_valid node
0 1 1
1 2 2 ,1, [1] 1 L
1 13 2 ,1, [1] 1 R
2 3 3 ,2,1, [2, 1] 1 L
2 61 3 ,2,1, [2, 1] 1 R
13 23 3 ,13,1, [13, 1] 1 L
13 22 3 ,13,1, [13, 1] 1 R
3 4 4 ,3,2,1, [3, 2, 1] 1 L
3 156 4 ,3,2,1, [3, 2, 1] 1 R
22 1568 4 ,22,13,1, [22, 13, 1] 1 L

Disclaimer

This function is not native, and its use in production is not guaranteed.

Notice that the database returns the JSON as valid making it possible to convert that TEXT to a new column asc_path_json:

ALTER TABLE tree
ADD COLUMN asc_path_json JSON
AFTER asc_path;

UPDATE tree
SET asc_path_json = to_json_array(asc_path, ',');

Which gives us the ability to check more quickly if an item is in the path for that node:

SELECT *
FROM tree
WHERE json_contains(asc_path_json, "13");

Result:

id parent_id user_id depth asc_path asc_path_json node
6 13 23 3 ,13,1, [13, 1] L
7 13 22 3 ,13,1, [13, 1] R
10 22 1568 4 ,22,13,1, [22, 13, 1] L
11 22 26 4 ,22,13,1, [22, 13, 1] R
12 23 1476 4 ,23,13,1, [23, 13, 1] L
13 23 690716 4 ,23,13,1, [23, 13, 1] R
MySQL version poll: a not so scientific analysis

MySQL version poll: a not so scientific analysis

Prior to my talk at LaraconEU 2016 I was curious to know how much adoption for MySQL 5.7 was in within the community.

I tweeted this:

Twitter polls only gives you up to 4 items to choose. What I wanted to know is if people were using MariaDB or other forks like Percona, but I didn’t had the proper space, and I  only put three options.

This January I managed to get a bit more syndication on my tweet and more people replied. I added a 4th option, “Other”. This option could include the fork data as well as people using even the MySQL 4:

Analysis results

This have no scientific foundation whatsoever. Most of the people on my twitter bubble work on tech and try to be using cutting edge technology, but I could see a bit of a trend (taking into the consideration also the amount of people that now replied).

August 2016 January 2017

It is possible to notice that 5.7 got more market where 5.5 was the most common version to those people. I would like to think they upgraded first to 5.6 to then upgrade to 5.7 and not just jumped versions disabling and doing this to make it work:

SET @@GLOBAL.sql_mode = '';

Again, this is the equivalent of disabling errors in any language because you are not gonna fix them, just want swipe under the carpet. Don’t do that.

It is nice to see that 5.5 is losing ground (again, a pinch of salt here) to newer and modern versions.

What should I not consider?

Well, you can actually ignore the whole poll as a trend indicator. The first one ran only for a day and it got 85 votes with not all options on it, and the second one had 669 votes and it was a week long poll. Plus the fact there is no way to do a control group to calculate the error margin.

What does this really mean?

MySQL 5.7 was released with General Availability around October 2015, major hosting  and cloud companies started to make it available on February/March 2016. Adoption always take a bit of a time to be absorbed, specially if you have to do any code change to support the new version of the database (hint, you probably will have to). It also means that those companies may at any point stop providing support for versions older than 5.6 (5.5, 5.1, etc.).

Also take into consideration that MySQL 8.0 is under development and most of the strictness embedded by default on 5.7 will continue to come on 8.0. So if you are reading this blogpost and starting a new project, go ahead and start with 5.7 already so when version 8.0 comes out you won’t have trouble upgrading.

If you have a legacy application then, there are ways of adapting your code so you can enjoy everything the new version has to offer. Just a final reminder, disabling strictness on the server to be able to use the JSON feature may sound as a smart idea in the beginning, but that also means putting your data consistency at risk.

What MySQL 8.0.1 means to you as a Developer

This post will be updated as soon more information comes along.

This developer version wasn’t released yet, when it does, use at your own risk.

Oracle released the development version of MySQL 8.0.0-dmr on September 12th of 2016. Since then, the team have been working on the 8.0.1 development milestone. You can find the partial change list here.

The objective here is try to explain how this will have any real world impact for you from 8.0.1. Please remember though, that any changes made to this version will not be final until the General Availability (date not currently set).

These topics are aimed at the Software Engineering side and not DBA and this is why Replication, for instance, is not covered here.

Changes

Charset and Collation

MySQL 8.0 was defined as to have utf8mb4 as the default CHARACTER SET and utf8mb4_general_ci as the default COLLATION. 8.0.1 will change the default COLLATION to utf8mb4_0900_ai_ci.

Let’s analyse the name utf8mb4_0900_ai_ci:

How does that impact you? It means that by default, new tables will have that collation and will be able to handle more characters than Basic Multilingual Plane (more emoji! 🤦🏼‍♀️), plus it will be accent and case insensitive. If you want case and accent sensitive you will need to use utf8mb4_0900_as_cs.

If you wish to know more about the reasons for utf8 now being utf8mb4 you should read this post on MySQL official blog: Sushi = Beer ?! An introduction of UTF8 support in MySQL 8.0

Language Specific Charsets

There are cases where language takes precedence over the default general collation. For this you will need to use, for example in the instance of German phone book order, utf8mb4_de_pb_0900_as_cs.

Optimization

Descending Index

This particularly is one of my most desired features. Finally being implemented on this version, the ALTER TABLE ADD INDEX ix_column (column DESC) won’t be parsed as ASC anymore.

Since InnoDB uses BTREE indexes, when running a query that uses it in the case of single columns it doesn’t matter if the index is ASC or DESC. The index is used for DESC anyway.

However when working with multi-column indexes this will matter and having a descending index will actually increase your performance. Example:

A generic table users:

Field Type Null Key Default Extra
id int(10) unsigned NO PRI auto_increment
first_name varchar(127) NO
last_name varchar(128) NO
email varchar(255) NO
created_at timestamp NO CURRENT_TIMESTAMP
updated_at timestamp NO CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP

The following query will have better performance if run with DESC index on updated_at time column:

SELECT *
FROM `users`
ORDER BY updated_at DESC, first_name ASC
LIMIT 100;

For that we need the following index:

ALTER TABLE `users`
ADD INDEX `ix_updated_at_first_name`
(updated_at DESC, first_name ASC);

Before that, the query plan would do a full scan on the table and not use the index at all. After the index is created it searches on index, and not the table:

Before After
Before After

The query above is just a simplistic example and doesn’t filter anywhere the index, this is why it causes to do a full index scan.

Parser

I personally never made use for \N to indicate NULL in any query itself, but I have used it in CSV or TSV files.
This behaviour won’t be supported anymore.
This change won’t impact file import or export through LOAD DATA INFILE or SELECT ... INTO OUTFILE.

Functionalities

In-place operations

In MySQL 5.7 the Generated Columns feature was added. However ALTER TABLE in tables containing one would be a COPY operation (which is slower since it has to copy all data again of the table). Now it can be INPLACE as long as the column(s) being modified is not in a generated column. What this actually means is: the metadata for the column will be changed in real time, without the need to internally create a new table and copy data.

JSON

Features

Two new functions are added to aggregate JSON values: JSON_ARRAYAGG() and JSON_OBJECTAGG().

JSON_ARRAYAGG() takes a column or expression as an argument and aggregates the result in a single JSON array.
With JSON_OBJECTAGG(id, col) you can use two columns which will be interpreted as key and value and returns a single JSON object.

Performance

Performance on JSON columns when used with ORDER BY was improved. Before MySQL would allocate 1K of memory to a sort key, making it fixed length. The extra padding was removed in this version.

Error Handling

The indexing of JSON fields is only possible through Generated Columns. If you tried to index a JSON field before, would get the following as an error:

JSON column '%s' cannot be used in key specification.

The error message has now been made clearer:

JSON column '%s' supports indexing only via generated columns on a specified JSON path.

Bugs

More than 100 bugs were fixed on this version. It ranges from InnoDB through Replication and even compilation bugs. The list is too big to be tacked on to this article, but the complete (and yet partial) change list is available here.