Month: April 2016

Why you need a Data Engineer

yzqgz
Dilbert

Tech evolves quickly. When the buzzword Big Data started showing up more and more, the market was in need of people able to analyse and give meaning to what was collected. For instance, an article published in 2012 by Harvard Business Review was entitled: Data Scientist: The Sexiest Job of the 21st Century.

Today we have DBA, Data Scientist, Data Engineer, Data Analyst, a wealth of options with the “Data” as prefix. More often than not, people put everyone in the same basket and assume everyone knows and has the same set of skills.

From my point of view and perspective as a Data Engineer, these are the differences:

  • DBA – Once the person I hated the most in any team. Seriously, why doesn’t that human give me the necessary permissions on the database? If I had access, I would have done my job sooner… Well, that was my thought as a Software Engineer at the time. Turns out, DBAs are, what my friends and I used to call, the database babysitter. You need to tune and figure out why performance is not as it should be? Need help with a complicated query? That’s the go-to person for it. But notice, this is RDBMS specific and heavily focused on the operational part.

  • Data Scientist – The market usually wants a professional with a PhD in statistics or an otherwise heavily math-oriented person. This person will be responsible for creating prediction models based on current data. Do you know how Amazon knows what you should buy next based on your browsing history? Yeah, this individual probably did the programming around that, has machine learning down to a T, and needs to possess Product, Engineering and Statistics knowledge.

  • Data Analyst – This person also deals with a bit of statistics, but more in the business sense, dealing with and creating reports for Business Intelligence. This role tries to answer business questions; identifying where data acquisition/quality is failing, for example.

Data Engineer – This role I can explain with more passion: it is what I do, so this probably will be biased. We are the bridge. We help Software Engineers to build the application for storage and retrieval in a manner which provides the Data Scientists and Data Analysts with the information they need to do their job.

So why do you need a Data Engineer on your team?

We do ETL (Extract-Transform-Load). We put data in the Data Warehouse, it’s from there that Data Analysts and the Data Scientists get part of the information they need. We may ended up sending data to Hadoop for instance too. They don’t query into your main relational database or on your MongoDB cluster (not usually),

I’ve seen queries taking hours to run because the main DB is not structured to deliver the information the way they need. That’s because when trying to do a new application we think in a normalised database, some of those professions need a star schema as in a Data Warehouse for example, or totally non normalized data if you are optimising searches on Solr.

We devise the best strategy for caching information, design database architecture, NoSQL clusters. Should this JSON return from the Facebook API really be stored in the relational database? (Short answer: no.) Should this query with a LIKE '%string%' really be running in the application (no.) and not getting data from Elasticsearch (probably.)?

We work with RDBMS, NoSQL, Search Engines, Cache engines. I particularly make a lot of use of RDBMS since most of my work has been on Legacy applications. As an example, one of our responsibilities is to lower the load on RDBMS for unnecessary stored data presented there.

It is still necessary to know about topics like: indexing, transactions, query profiling and performance tuning.

To sum up: We are the wild card of storage technology.

Developers don’t usually care about the precise details of data. They just want what is fast and easy to use. They think about delivery, not so much about long term data retrieval.

“I am going to store this access log for my website on that table.”

They probably didn’t stop to think that that table will potentially have million of records within a span of months. Why not ELK? Cassandra? Those tools allows the information to be fast retrievable with Elasticsearch and easily scalable with Cassandra. That way you don’t overload your main RDBM system with the same query that’s being repeated plenty of times through a TEXT field for instance. See also : Strip your TEXT field.

How should you work with a Data Engineer?

You know that feature you want to implement? Talk to us first. You can design the application the way you want, but we give you the insight into the data layer.

Do not isolate us on your team, we need support from your DevOps, we need the engineers to be our partners and to collaborate on problem solving; we are not only be there to run the queries they deem necessary because they don’t have access to the database. If you want to have that wicked fast search by category on your ecommerce, I guarantee you that LIKE is not the fast approach, The DevOps will help us setup the environment for the platform to take the best advantage, many developers may foreseen this, but most don’t. Again, we are the bridge. This is our job.

You can have another view of this on this blogpost: The different data science roles in the industry.

NOT NULL all the things!

Different types of languages deal with this “value” in diverse ways. You can have a more comprehensive list of what NULL can mean on this website. What I like to think about NULL is along the lines of invalid, as if some sort of garbage is stored there. It doesn’t mean it’s empty, it’s just mean that something is there, and it has no value to you.

Databases deal when storing this type in a similar way, PostgreSQL treats it as “unknown” while MySQL treats it as “no data“.

Both databases recommend using \N to represent NULL values where import or exporting of data is necessary.

When to use it

You don’t. Particularly, I DON’T recommend using NULL.

NULL doesn’t mean empty

So if you want to represent lack of data or optional fields use a default value. It’s bad sign of architecture having NULLABLE fields, there is an extra case to test and to write for. It adds unnecessary complexity.

However, there is one case where I do think NULL is acceptable. And that is when working with MySQL date related fields. I will talk more about this further down.

How to Query it

MySQL doesn’t recognize field = NULL because, remember, NULL means invalid, not empty. Thus using it will not return any rows.

As much as NULL value will never be equal to another NULL, when using ORDER BY, GROUP BY and DISTINCT, the server interprets the values as equal. Aggregators functions such as MIN(), SUM() and COUNT() ignore NULL values, except for COUNT(*) that counts rows, and not columns.

When using ORDER BY a column is nullable the NULL values appear first if instructed as ASC and in the end if DESC is requested.

PostgreSQL on the other hand has an option to convert equal comparisons expressions to field IS NULL, if enabled (transform_num_equals).

The ordering for ORDER BY depends on indexing of the field, by default NULL comes first, but you can specify when creating an index where the NULL values should be: top or bottom.

For aggregators functions, PostgreSQL works the same way.

Performance

Having NOT NULL columns permits similar performance on MySQL as an column = 1 do. However that doesn’t happen in LEFT JOIN operations while a field could be NULL. But this is for the type of queries where IS NULL is used:

# Assuming that `active` column is NOT NULL
SELECT * FROM users WHERE active = 1 OR active IS NULL;

Summing up directly from MySQL documentation:

Declare columns to be NOT NULL if possible. It makes SQL operations faster, by enabling better use of indexes and eliminating overhead for testing whether each value is NULL. You also save some storage space, one bit per column. If you really need NULL values in your tables, use them. Just avoid the default setting that allows NULL values in every column.

The COALESCE() function

This function return the first non-null result of a column. Keep in mind to perform this operation on non-indexed columns. It is slow by its nature, just a friendly warning of when you are using to be mindful of its fallback.

The exception to the rule

I think this applies to MySQL databases. DATE/DATETIME should not be allowed to be NULL if, and only if the sql_mode directive NO_ZERO_DATE is disabled.

What does it mean? NO_ZERO_DATE doesn’t allow for 0000-00-00 to be inserted in a DATE/DATETIME. MySQL 5.7 sql_mode insures some restrictions into the database by default. If for instance you have DATE column that is NOT NULL and doesn’t pass a value to it, 0000-00-00 will be saved, because the column is NOT NULL, BUT will give a warning:

mysql> DESCRIBE users;
+------------+------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------+------------------+------+-----+---------+----------------+
| id | int(10) unsigned | NO | PRI | NULL | auto_increment |
| name | varchar(45) | NO | | NULL | |
| created_at | datetime | NO | | NULL | |
+------------+------------------+------+-----+---------+----------------+
3 rows in set (0.00 sec)

mysql> INSERT INTO users (name) VALUES ("Gabi");
Query OK, 1 row affected, 1 warning (0.01 sec)

mysql> SHOW WARNINGS;
+---------+------+-------------------------------------------------+
| Level | Code | Message |
+---------+------+-------------------------------------------------+
| Warning | 1364 | Field "created_at" doesn't have a default value |
+---------+------+-------------------------------------------------+
1 row in set (0.00 sec)

Making this operation to make sure no zero date will be allowed as it is by default in MySQL 5.7:

SET @@GLOBAL.sql_mode = "STRICT_TRANS_TABLES,NO_ENGINE_SUBSTITUTION,NO_ZERO_DATE;

When trying to insert a similar query:

mysql> INSERT INTO users (name) VALUES ("Blossom");
ERROR 1364 (HY000): Field "created_at" doesn't have a default value

To sum it up

If you have MySQL and NO_ZERO_DATE is in your sql_mode, you should ALWAYS use NOT NULL. MySQL 5.7 brings the mode enabled by default among with other things, read more here.

If you don’t have it enabled for any other reason, then DATE/DATETIME MAY be NULL, because data integrity > performance in this case.

Again, 0000-00-00 IS NOT a valid date.