GROUP BY, are you sure you know it?

New MySQL version, YAY!

MySQL 5.7 is full of new features, like virtual columns, virtual indexes and JSON fields! But, it came with some changes to the default configuration. When running:

SELECT @@GLOBAL.sql_mode;

We get:

ONLY_FULL_GROUP_BY,STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION

What I want to talk about is the ONLY_FULL_GROUP_BY mode. This mode rejects queries where nonaggregated columns are expected, but aren’t on the GROUP BY or HAVING clause. Before MySQL 5.7.5, ONLY_FULL_GROUP_BY was disabled by default, now it is enabled.

You know the drill…

This is a simple statement, people use it everywhere, it shouldn’t be that hard to use, right?

Given the following schema:

Suppose I want to list all users that commented on post_id = 1, MySQL 5.6:

SELECT *
FROM comments c
INNER JOIN users u ON c.user_id = u.id
WHERE c.post_id = 1
GROUP BY c.user_id;

And this is the result:

Same query running on 5.7.11 gives the following results:

[42000][1055] Expression #1 of SELECT list is not in GROUP BY clause and contains nonaggregated column 'blog.c.id' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by

What does it mean?

What MySQL is complaining about here is this: you grouped rows by c.user_id, but the problem is there are more than one result to be retrieved for the c.id column. Since you didn’t use any aggregators, as min(c.id) for instance, it doesn’t know which result to bring.

Previous versions of MySQL would solve this “magically”. This change is not MySQL being temperamental with you, it is them implementing long old industry standard specifications (SQL/92 and SQL/99) to the database. To rely on results brought in the previous versions of that query is not smart. Those results are unpredictable and totally arbitrary.

From the 5.6 documentation:

MySQL extends the standard SQL use of GROUP BY so that the select list can refer to nonaggregated columns not named in the GROUP BY clause. This means that the preceding query is legal in MySQL. You can use this feature to get better performance by avoiding unnecessary column sorting and grouping. However, this is useful primarily when all values in each nonaggregated column not named in the GROUP BY are the same for each group. The server is free to choose any value from each group, so unless they are the same, the values chosen are indeterminate. Furthermore, the selection of values from each group cannot be influenced by adding an ORDER BY clause. Result set sorting occurs after values have been chosen, and ORDER BY does not affect which values within each group the server chooses.

How do I fix it?

It will make your query more verbose, but it will make it right. There are two ways of doing this.

One way is using aggregators in the fields you need to retrieve and that will be grouped by the email field, for instance.

SELECT
any_value(u.id) AS user_id,
any_value(u.name) AS name,
u.email,
any_value(u.country) AS country,
any_value(u.created) AS registration_date,
max(c.created) AS last_comment,
count(*) AS total_comments
FROM comments c
INNER JOIN users u ON c.user_id = u.id
WHERE c.post_id = 1;

Another way is to name the fields that will be unique in the GROUP BY clause:

SELECT
u.id AS user_id,
u.name,
u.email,
u.country,
u.created AS registration_date,
max(c.created) AS last_comment,
count(*) AS total_comments
FROM comments c
INNER JOIN users u ON c.user_id = u.id
WHERE c.post_id = 1
GROUP BY u.email, u.id, u.name, u.country, u.created;

Result for both queries:

+---------+---------------+----------------------+---------+---------------------+---------------------+----------------+
| user_id | name | email | country | registration_date | last_comment | total_comments |
+---------+---------------+----------------------+---------+---------------------+---------------------+----------------+
| 2 | Bart Simpson | bart@simpsons.com | US | 2016-03-03 20:07:28 | 2016-03-03 21:21:08 | 2 |
| 1 | Lisa Simpson | lisa@simpsons.com | US | 2016-03-03 20:07:23 | 2016-03-03 21:20:50 | 2 |
| 3 | Homer Simpson | nobrain@simpsons.com | US | 2016-03-03 20:07:38 | 2016-03-03 21:20:56 | 1 |
+---------+---------------+----------------------+---------+---------------------+---------------------+----------------+
3 rows in set (0.00 sec)

In another words, both queries follows SQL/92 specification:

The SQL/92 standard for GROUP BY requires the following:

  • A column used in an expression of the SELECT clause must be in the GROUP BY clause. Otherwise, the expression using that column is an aggregate function.
  • A GROUP BY expression can only contain column names from the select list, but not those used only as arguments for vector aggregates.

The results of a standard GROUP BY with vector aggregate functions produce one row with one value per group.

In the 5.7.5 version, MySQL also implemented SQL/99, which means that if such a relationship exists between name and id, the query is legal. This would be the case, for example, where you group by a primary key or foreign key:

SELECT
u.id AS user_id,
u.name,
u.email,
u.country,
u.created AS registration_date,
max(c.created) AS last_comment,
count(*) AS total_comments
FROM comments c
INNER JOIN users u ON c.user_id = u.id
WHERE c.post_id = 1
GROUP BY u.id;

You can read more details about how MySQL handles GROUP BY in their documentation.

TL;DR;

According to the documentation, this configuration is being enabled by default because GROUP BY processing has become more sophisticated to include detection of functional dependencies. It also brings MySQL closer to the best practices for SQL language with the bonus of removing the “magic” element when grouping. Having that, grouping fields are no longer arbitrary selected.

Disabling ONLY_FULL_GROUP_BY

If you are upgrading your database server and want to avoid any possible breaks you can disable by removing it from your sql_mode.

Changing in runtime

SET @@GLOBAL.sql_mode = 'STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION'

A restart is not necessary, but a reconnection is.

Change permanently

If you want to disable it permanently, add/edit the following in your my.cnf file:

sql_mode = "STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION"

For this change a service restart is required:

$ mysql service restart

One thought on “GROUP BY, are you sure you know it?

  1. NIce post. I understand how aggregates and functions works but it still feels weird to use ANY_VALUE(). Manually adding all the values to the group by feels like I’m telling MySQL “it’s OK, go ahead and give me that”. It adds to your “mental bloat” when you’re reading code, especially old code. It’s hard to let go of old habits and I suffered a lot when I had to use PG and ORA. At the same time since I started using ONLY_FULL_GROUP_BY (and all of 5.7’s new defaults) I can’t go back. Data integrity and free error reporting are too good to let go. But I’m still not happy when I’m writting queries. Not sure I’ll ever reach a point where I don’t re-read my group by queries around 50 times just to make sure I’m not screwing it up.

    Another thing that worries me it’s possible side effects of using DISTINCT x GROUP BY. I need an adult.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s