Category: Tutorial

Transferring ownership of repositories on GitHub

For the past couple months, I’ve been studying. As a side effect, my GitHub account was cluttered with code that is experimental. I didn’t exactly want to trash the experimental code. I wanted to keep my code but also not specifically keep it under my profile.

The solution I found was to create an organization and transfer my desired repositories there. I thought this was a great solution, there was just one little detail I was missing: the current GitHub API does not support repository ownership transfer. Which meant I would have to go to each repository, click on “Settings”, click on “Transfer”, fill in the “Repository Name” and then put the “Destination User”. A lot of steps for someone looking to move over 250 repositories.

The first thing that came to my mind was to use Selenium to automate this task. But my lack of exposure with the technology made me think a bit outside the box. One of the things I learned these past months was capybara. Capybara is used alongside Rspec, extending the test suite DSL. It mimics user interaction with the browser and comes with a Selenium driver out of the box.

In other words, you can create a bot to go to the browser, fill up forms and submit it. This is exactly what I was looking for. As I stated before, Capybara uses Rspec, so my code would actually have to be wrapped inside a test.

Caveats

  • You need to disable two-factor authentication on GitHub, and after the script finishes running set it back on.
  • It does not transfer private repositories.
  • You need Mozilla’s geckodriver installed. If you use macOS, you can use brew to install it.

The Script

This was developed as a hack. Use at your own risk. You can download it at gabidavila/github-move-repositories. As of now this code gets all public repositories, unless the variable ONLY_FORKS is set to TRUE, and moves to a destination user. Do not push your .env​ file.If you want to move only specific repositories, you will need to edit the code yourself. For more information the README.md of this project is kept up to date.

Contributions are welcome if you feel you can help improve the tool. For example: add options of which repositories to move.

Enjoy!

Converting comma separated fields to MySQL JSON – a case study

This post is a case study of a job I had to do in a legacy application, it doesn’t mean it will apply to you, but it might.

This is a table of contents:

There are many ways to store a tree in a relational database, this is not by far the best option to do it, however it is still common to see it happen.

One way it is called materialized path, which consist with values separated by a delimiter, in my case a comma ,.

You would have a tree stored in a manner like this:

id parent_id user_id depth asc_path node
1 0 1 1
2 1 2 2 ,1, L
3 1 13 2 ,1, R
4 2 3 3 ,2,1, L
5 2 61 3 ,2,1, R
6 13 23 3 ,13,1, L
7 13 22 3 ,13,1, R
8 3 4 4 ,3,2,1, L
9 3 156 4 ,3,2,1, R
10 22 1568 4 ,22,13,1, L
11 22 26 4 ,22,13,1, R
12 23 1476 4 ,23,13,1, L
13 23 690716 4 ,23,13,1, R
14 61 1051 4 ,61,2,1, L
15 61 62 4 ,61,2,1, R

The column asc_path stands for ascending path of a tree in where which node has two other ones, not necessarily being a binary tree, being stored in the database.

This column has commas in the beginning and in the end because how queries are made to search if an element is present in the path or not by using LIKE "%,id,%". If someone did a search to know if the number 2 was a node in any of the paths, without the commas, it would also return 23, 62 and any other number containing 2.

Performance

The only way to make it a bit faster is having a FULLTEXT index created in asc_path. Because a BTREE index starts indexing in the beginning of a string, since the presence of the wildcard % in the string search it makes it in possible to use said index.

This is the graphical representation of the example above:

Tree

Searching

To search an specific element the query would be:

SELECT
parent_id,
user_id,
depth,
asc_path,
node
FROM tree
WHERE asc_path LIKE '%,13,%';

Result:

parent_id user_id depth asc_path node
13 23 3 ,13,1, L
13 22 3 ,13,1, R
22 1568 4 ,22,13,1, L
22 26 4 ,22,13,1, R
23 1476 4 ,23,13,1, L
23 690716 4 ,23,13,1, R

Converting to a JSON array

Some databases, like PostgresSQL (section 9.42) have more modifiers functions to convert strings to JSON, in my case I wanted to store the ascending tree path in a JSON field which would give me the possibility of using JSON_CONTAINS(json_doc, val) to know the records that have a given node in its path.

To do it, I had to transform the string in a JSON array.

1st step: remove the leading commas

Removing the leading commas, but before any update, lets test what we are doing:

SELECT
parent_id,
user_id,
depth,
asc_path,
TRIM(BOTH ',' FROM asc_path) AS trimmed_commas
FROM tree

Results:

parent_id user_id depth asc_path trimmed_commas
0 1 1
1 2 2 ,1, 1
1 13 2 ,1, 1
2 3 3 ,2,1, 2,1
2 61 3 ,2,1, 2,1
13 23 3 ,13,1, 13,1
13 22 3 ,13,1, 13,1
3 4 4 ,3,2,1, 3,2,1
3 156 4 ,3,2,1, 3,2,1
22 1568 4 ,22,13,1, 22,13,1

2nd step: add brackets to the string

A JSON array is formed around brackets [], and we need to have it in our string to be a valid JSON document:

SELECT
parent_id,
user_id,
depth,
asc_path,
TRIM(BOTH ',' FROM asc_path) AS trimmed_commas,
CONCAT("[", TRIM(BOTH ',' FROM asc_path), "]") AS added_brackets
FROM tree;

Results:

parent_id user_id depth asc_path trimmed_commas added_brackets
0 1 1
1 2 2 ,1, 1 [1]
1 13 2 ,1, 1 [1]
2 3 3 ,2,1, 2,1 [2,1]
2 61 3 ,2,1, 2,1 [2,1]
13 23 3 ,13,1, 13,1 [13,1]
13 22 3 ,13,1, 13,1 [13,1]
3 4 4 ,3,2,1, 3,2,1 [3,2,1]
3 156 4 ,3,2,1, 3,2,1 [3,2,1]
22 1568 4 ,22,13,1, 22,13,1 [22,13,1]

3rd step: validate if the changes works

Let’s use JSON_VALID() to see if it will accept our new string as a JSON, keep in mind that when the argument is NULL the return is also NULL:

SELECT
parent_id,
user_id,
depth,
asc_path,
TRIM(BOTH ',' FROM asc_path) AS trimmed_commas,
CONCAT("[", TRIM(BOTH ',' FROM asc_path), "]") AS added_brackets,
JSON_VALID(CONCAT("[", TRIM(BOTH ',' FROM asc_path), "]")) AS json_valid
FROM tree;

Results:

parent_id user_id depth asc_path trimmed_commas added_brackets json_valid
0 1 1
1 2 2 ,1, 1 [1] 1
1 13 2 ,1, 1 [1] 1
2 3 3 ,2,1, 2,1 [2,1] 1
2 61 3 ,2,1, 2,1 [2,1] 1
13 23 3 ,13,1, 13,1 [13,1] 1
13 22 3 ,13,1, 13,1 [13,1] 1
3 4 4 ,3,2,1, 3,2,1 [3,2,1] 1
3 156 4 ,3,2,1, 3,2,1 [3,2,1] 1
22 1568 4 ,22,13,1, 22,13,1 [22,13,1] 1
22 26 4 ,22,13,1, 22,13,1 [22,13,1] 1
23 1476 4 ,23,13,1, 23,13,1 [23,13,1] 1
23 690716 4 ,23,13,1, 23,13,1 [23,13,1] 1
61 1051 4 ,61,2,1, 61,2,1 [61,2,1] 1
61 62 4 ,61,2,1, 61,2,1 [61,2,1] 1

Replacing 1st step and 2nd step with a function

So that your query gets easier to use and not messy, you can create a function, I decided to create to_json_array(input_string, delimiter_char):

Running the query only with to_json_array on MySQL:

SELECT
parent_id,
user_id,
depth,
asc_path,
to_json_array(asc_path, ',') AS to_json_array,
JSON_VALID(to_json_array(asc_path, ',')) AS is_to_json_array_valid,
node
FROM tree;

Result:

parent_id user_id depth asc_path to_json_array is_to_json_array_valid node
0 1 1
1 2 2 ,1, [1] 1 L
1 13 2 ,1, [1] 1 R
2 3 3 ,2,1, [2, 1] 1 L
2 61 3 ,2,1, [2, 1] 1 R
13 23 3 ,13,1, [13, 1] 1 L
13 22 3 ,13,1, [13, 1] 1 R
3 4 4 ,3,2,1, [3, 2, 1] 1 L
3 156 4 ,3,2,1, [3, 2, 1] 1 R
22 1568 4 ,22,13,1, [22, 13, 1] 1 L

Disclaimer

This function is not native, and its use in production is not guaranteed.

Notice that the database returns the JSON as valid making it possible to convert that TEXT to a new column asc_path_json:

ALTER TABLE tree
ADD COLUMN asc_path_json JSON
AFTER asc_path;

UPDATE tree
SET asc_path_json = to_json_array(asc_path, ',');

Which gives us the ability to check more quickly if an item is in the path for that node:

SELECT *
FROM tree
WHERE json_contains(asc_path_json, "13");

Result:

id parent_id user_id depth asc_path asc_path_json node
6 13 23 3 ,13,1, [13, 1] L
7 13 22 3 ,13,1, [13, 1] R
10 22 1568 4 ,22,13,1, [22, 13, 1] L
11 22 26 4 ,22,13,1, [22, 13, 1] R
12 23 1476 4 ,23,13,1, [23, 13, 1] L
13 23 690716 4 ,23,13,1, [23, 13, 1] R

Fast data import trick

A few weeks ago my friend Frank de Jonge told me he managed to improve an import into a MySQL server down from more than 10 hours to 16 minutes. According to him it had to do with several field types (too long fields to really small data), the amount of indexes, and constraints on the tables. We were talking about 1 million records here. He wondered if it was possible to make it even faster.

The basics

Turns out there are many ways of importing data into a database, it all depends where are you getting the data from and where you want to put it. Let me give you a bit more context: you may want to get data from a legacy application that exports into CSV to your database server or even data from different servers.

If you are pulling data from a MySQL table into another MySQL table (lets assume they are into different servers) you might as well use mysqldump.

To export a single table:

$ mysqldump -h localhost -u root -p --extended-insert --quick --no-create-info mydb mytable | gzip > mytable.sql.gz

A bit more about this line:

  • --extended-insert: it makes sure that it is not one INSERT per line, meaning a single statement can have dozens of rows.
  • --quick: useful when dumping large tables, by default MySQL reads the whole table in memory then dumps into a file, that way the data is streamed without consuming much memory.
  • --no-create-info: this means only the data is being exported, no CREATE TABLE statements will be added

The complex

The problem my friend faced was a bit more complex. He needed to generate the dump file due to the source of his data coming from somewhere else (Later on I advised him on the benefits of LOAD FILE), but since 90% of his work was already done he wanted to know:

Why when I do blocks of 50 rows to be inserted is it faster then when I do with 500?

There could be N reasons for that:

  • buffering 500 rows into memory is slower than 50, remember, you are reading from the disk, it is always slow.
  • if no transactions are used, the indexes gets rebuilt after the end of each INSERT, to 1 million rows at a 50 values per statement we have 20k INSERTs, while with 500 it would be 2k statements. My speculation here is that indexes in InnodB engine are BTREE, slowling building means that you “know” where the values are in the tree, so it’s a fast search to sort and organise while with 500 items you need to reorganise a lot of information at once. Again, this is an speculation.

Suggestions

Transactions

My first suggestion was: wrap everything in a single transaction. Put a START TRANSACTION in the beginning and at the end a COMMIT statement. That way you do the rebuilding of the indexes and foreign key checks at the end of the script.

He reported a minor improvement on performance.

The Danger

I knew from the begining a way where his import would be really fast, but since the source of his data wasn’t as secure as the database itselft it could result in duplicated data, missing foreign keys, it could end really really bad.

MySQL by default when you use mysqldump put this option in place because it’s fair to assume you are going to be importing this to an empty database, so no data integrity problems. Which wasn’t the case.

The data was manipulated to be inserted, so the trick I said to him was and I quote:

SET foreign_key_checks = 0;
/* do you stuff REALLY CAREFULLY */
SET foreign_key_checks = 1;

The import went from 16 min to 6 min. He got super happy 😀:

And people on the internet got curious (because Frank is famous now, apparently):

I confess it was fun to see the time cut down and more than half, but use with caution.

An even more faster way

CSV files. Yes, that’s faster. Specifically TSV, since any string can have a comma.

To generate:

$ mysqldump -h localhost -u root -p --tab=/tmp mydb mytable

Or if you are manipulating the data yourself from another source, don’t forget to use \N for NULL values.

To Read:

$ mysql -h localhost -u root -p
mysql> LOAD DATA INFILE '/tmp/mytable.txt' INTO TABLE mytable;
Query OK, 881426 rows affected (29.30 sec)
Records: 881426 Deleted: 0 Skipped: 0 Warnings: 0

The same data with bulk INSERTs took over a minute. There are many variables when dealing with that statement such as buffer size, the checking of the keys itself, so for high volume data importing straight from a text file is still the fastest option.

Conclusion

As I said before, it was just a matter of disabling the constraint check in the script. Only do that if you are sure the data is good, else, other options like net_buffer_length, max_allowed_packet and read_buffer_size can help you import big SQL files. Also in most cases this should be considered: Data Integrity > Performance.