July 9, 2016

The three most important habits of a data analyst

I've been doing data analysis for almost 15 years -- mostly using Excel and Business Intelligence tools. And from the very first year I believe that accuracy is the biggest challenge for a data analyst. Accuracy is fundamental because if a calculation result is incorrect then everything else that is based on it -- visualizations, judgements and conclusions, become irrelevant and worthless. Even performance is not so important, because sometimes you can solve a performance problem by throwing in more hardware, but that would never fix incorrect calculation logic.

Ensuring accuracy is probably the most important skill a data analyst should master. To me, striving for accuracy is a mental discipline developed as a result of constant self-training, rather than something that can be learned overnight. There are three practical habits to develop this skill:

1) Sanity checks. These are quick litmus tests that allow detecting grave errors on early stages. After you get a calculation result for the first time, ask yourself -- does it make sense? Does the order of magnitude look sane? If it's a share (percentage) of something else -- is it reasonably big/small? If it's a list of items -- is it reasonably long/short? Sounds like a no-brainer but people tend to skip sanity checks frequently.

2) Full assumption testing. In my experience this habit is most overlooked by beginner analysts. Assumptions should not be opinions, they must be verified facts. "We're told that field A has unique keys" -- verify it by trying to find duplicate values in it. "Field B has no nulls" -- again, verify it by counting nulls or check data model constraints (where applicable). "They said that Gender is encoded with M and F" -- verify it by counting distinct values in field Gender. Whatever assumptions are used for filtering, joining or calculation -- absolutely all of them must be tested and confirmed prior to doing anything else. Once you develop this habit you would be surprised how often assumptions turn out to be wrong. A good data analyst can spend a few days to verify assumptions before even starting analyzing data itself. Sometimes assumptions are implicit -- e.g. when we compare two text fields we usually implicitly assume that neither of if has special symbols or trailing spaces. A good data samurai is able to see invisible recognize implicit assumptions and test them explicitly.

3) Double calculation. This habit is sometimes overlooked by even experienced analysts. Probably because it requires sometimes rather tedious effort. This habit is about creating alternative calculations, often created in a different tool -- typically Excel. The point is to test the core logic, therefore such alternative calculation can include only a subset of original data and do not cover minor cases. The results achieved using alternative calculation should be equal to results of the main calculation logic, regardless whether it's done in SQL or some BI/ETL tool.

Let the Accuracy be with you.

July 5, 2016

Columnar in-memory ETL

Columnar databases are not exotic anymore. They're quite widespread and their benefits are well-known: data compression, high performance on analytical workloads, less demanding storage I/O throughput requirements. At the same time, ETL tools currently are still exclusively row-based, as they were 10 years ago or 20 years ago. Below, I'm describing a working principle of a columnar in-memory ETL [1], its differences in comparison with row-based ETL/ELT tools, and area of applicability (spoiler alert -- it's not just data transformation).

Incremental column-based transformations
Row-based ETL tools transform data row by row, when possible. For complex transformations, like aggregations, they build temporary data structures either in memory or on disks. This approach dictates a particular pattern -- row-based ETL tools strive to create as few temporary data structures as possible in order to reduce data redundancy and minimize number of full table scans, because they are very costly.

A column-based ETL tool, similarly to a columnar database, operates not with rows of uncompressed data, but with compressed columns (typically using vocabulary compression). Unlike a row-based ETL system, it can re-use datasets and transform compressed data directly (i.e. without decompression) and incrementally. Let me illustrate it with two examples: calculating a new column, and filtering a table:

1) Calculating a new column is simply adding a new column to existing table. Since a table is a set of columns, the result would be the original table (untransformed) + the newly calculated column. Note that in this case data is not moved -- we simply attach a new column to already existing table which is possible because our data is organized by columns, not by rows. In the case of a row-based ETL, we would have to update each row.

2) Filtering compressed columns can be done in two steps. Let's assume that we're using the vocabulary compression. In this case a column is represented as a combination of a vocabulary of unique entries, and a vector of pointers to the vocabulary entries (one pointer per row). Filtering can be done by marking selected vocabulary entries first, and then rebuilding the vector by removing pointers to not selected entries. Here, the benefit is double: we don't calculate a filtering condition (which is a heavy operation) for every row, but only for the vocabulary which is typically much shorter. Rebuilding the vector is a fast operation since it doesn't require calculating the filtering condition. Another benefit is that we don't have to rebuild vocabulary -- we can keep using the old vocabulary with the new vector, thus greatly reducing data redundancy.

In the examples above transformations are incremental -- i.e. new tables are obtained by incrementally modifying and re-using existing data.

Transformations like aggregation, sorting, unpivoting and some other can also be done by directly utilizing compressed data structure to a greater or lesser extent.

Incremental column-based transformation allows greatly reduced data redundancy, which brings us to the next topic:

In-memory transformations
Because of reduced redundancy and data compression, column-based ETL is a good candidate for in-memory processing. The obvious downside is, apparently, the limitation by RAM (which will be addressed below). The upsides of keeping all data in-memory are:
  • Increased performance due to elimination of slow disk I/O operations.
  • The ability to instantly view results of literally every transformation step without re-running transformations from the beginning. A columnar ETL effectively stores all results of transformations with a relatively little memory overhead, due to data compression and incremental logic.

Storing all data in memory has a quite interesting and useful consequence, barely possible for row-based ETL tools:

Reactive transformations
Having all intermediate transformation results in memory lets us re-calculate transformations starting from any point, instead of running everything from the beginning as in the case with traditional ETL tools. For instance, in a chain of 20 transformations we can modify a formula in the 19th transformation and recalculate only last two transformations. Or last 5, if we decide so. Or last 7. If transformations are a non-linear graph-like workflow, we can intelligently recalculate only necessary transformations, respecting dependencies.

(click to zoom)

Effectively, it enables an Excel-like experience, where transformations are recalculated automatically when one of them changes, similarly to Excel formulas that are re-evaluated when another formula or a cell value changes.

This creates a whole new experience of working with data -- reactive, exploratory, self-service data manipulation.

Resume: a new kind of tool for data
The columnar representation allows incremental transformation of compressed data, which in turn makes it possible to greatly reduce redundancy (typical for row-based ETL tools), and keep entire datasets in memory. This, in turn, speeds up calculations and enables reactive, interactive data exploration and data analysis capabilities.

Columnar in-memory ETL is basically a new kind of hybrid technology in which there is no distinct borderline between data transformation and data analysis. Instead of slicing and dicing a pre-calculated data model like OLAP tools do, we get the ability to explore data by transforming it on the fly. At the same time it does the traditional ETL job, typical for row-based ETL utilities.

The RAM limitation still remains though. It can be partially mitigated with a data partitioning strategy, where a big dataset is sliced into parts which then are processed in parallel in a map/reduce fashion. In the long term, the Moore's law is still effective for RAM prices which benefits in-memory data processing in general.

Row-based vs Columnar In-memory ETL

All-in-all, while processing billions of rows is still more appropriate for row-based ETL tools than for columnar ones, the latter represent a new paradigm of mixed data transformation and analysis, which makes it especially relevant now, when public interest to self-service data transformation is growing.

[1] You can take a look at an example of a columnar in-memory ETL tool here: http://easymorph.com.