Fast data import trick

A few weeks ago my friend Frank de Jonge told me he managed to improve an import into a MySQL server down from more than 10 hours to 16 minutes. According to him it had to do with several field types (too long fields to really small data), the amount of indexes, and constraints on the tables. We were talking about 1 million records here. He wondered if it was possible to make it even faster.

The basics

Turns out there are many ways of importing data into a database, it all depends where are you getting the data from and where you want to put it. Let me give you a bit more context: you may want to get data from a legacy application that exports into CSV to your database server or even data from different servers.

If you are pulling data from a MySQL table into another MySQL table (lets assume they are into different servers) you might as well use mysqldump.

To export a single table:

$ mysqldump -h localhost -u root -p --extended-insert --quick --no-create-info mydb mytable | gzip > mytable.sql.gz

A bit more about this line:

  • --extended-insert: it makes sure that it is not one INSERT per line, meaning a single statement can have dozens of rows.
  • --quick: useful when dumping large tables, by default MySQL reads the whole table in memory then dumps into a file, that way the data is streamed without consuming much memory.
  • --no-create-info: this means only the data is being exported, no CREATE TABLE statements will be added

The complex

The problem my friend faced was a bit more complex. He needed to generate the dump file due to the source of his data coming from somewhere else (Later on I advised him on the benefits of LOAD FILE), but since 90% of his work was already done he wanted to know:

Why when I do blocks of 50 rows to be inserted is it faster then when I do with 500?

There could be N reasons for that:

  • buffering 500 rows into memory is slower than 50, remember, you are reading from the disk, it is always slow.
  • if no transactions are used, the indexes gets rebuilt after the end of each INSERT, to 1 million rows at a 50 values per statement we have 20k INSERTs, while with 500 it would be 2k statements. My speculation here is that indexes in InnodB engine are BTREE, slowling building means that you “know” where the values are in the tree, so it’s a fast search to sort and organise while with 500 items you need to reorganise a lot of information at once. Again, this is an speculation.

Suggestions

Transactions

My first suggestion was: wrap everything in a single transaction. Put a START TRANSACTION in the beginning and at the end a COMMIT statement. That way you do the rebuilding of the indexes and foreign key checks at the end of the script.

He reported a minor improvement on performance.

The Danger

I knew from the begining a way where his import would be really fast, but since the source of his data wasn’t as secure as the database itselft it could result in duplicated data, missing foreign keys, it could end really really bad.

MySQL by default when you use mysqldump put this option in place because it’s fair to assume you are going to be importing this to an empty database, so no data integrity problems. Which wasn’t the case.

The data was manipulated to be inserted, so the trick I said to him was and I quote:

SET foreign_key_checks = 0;
/* do you stuff REALLY CAREFULLY */
SET foreign_key_checks = 1;

The import went from 16 min to 6 min. He got super happy 😀:

And people on the internet got curious (because Frank is famous now, apparently):

I confess it was fun to see the time cut down and more than half, but use with caution.

An even more faster way

CSV files. Yes, that’s faster. Specifically TSV, since any string can have a comma.

To generate:

$ mysqldump -h localhost -u root -p --tab=/tmp mydb mytable

Or if you are manipulating the data yourself from another source, don’t forget to use \N for NULL values.

To Read:

$ mysql -h localhost -u root -p
mysql> LOAD DATA INFILE '/tmp/mytable.txt' INTO TABLE mytable;
Query OK, 881426 rows affected (29.30 sec)
Records: 881426 Deleted: 0 Skipped: 0 Warnings: 0

The same data with bulk INSERTs took over a minute. There are many variables when dealing with that statement such as buffer size, the checking of the keys itself, so for high volume data importing straight from a text file is still the fastest option.

Conclusion

As I said before, it was just a matter of disabling the constraint check in the script. Only do that if you are sure the data is good, else, other options like net_buffer_length, max_allowed_packet and read_buffer_size can help you import big SQL files. Also in most cases this should be considered: Data Integrity > Performance.

One thought on “Fast data import trick

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s