Analytics at vaamo
We implemented a new data analytics solution. This is how it turned out.
We needed a change in analytics
At the beginning of this year we arrived at a conclusion. Vaamo as a company is now in a situation where it needs to put in place a proper analytics setup. Up until now we fulfilled our reporting needs in a rather pragmatic way. We wrote several JSON blobs for different data sets in one of our production databases daily, out of which we created CSV files we used for further analysis. This approach, as you can imagine is not that scalable. It required manual analyses with Excel every day and did not allow us to see connections between data sets without more extensive analyses.
We needed a more fitting approach. After looking into different approaches to do data warehousing and business intelligence, we decided on a prototype setup consisting of a data warehouse built as a Data Vault running on our production Postgres database. On top of that we decided to build database views to answer our most interesting questions. We also wanted a solution to give non-SQL-savvy users access to that data. We decided on Metabase to solve that problem.
How did it turn out?
For many, especially rather simple queries on smaller datasets this worked just fine. We could use Metabase to display meaningful metrics over time pretty simply. As we started to identify more complex queries that needed joins over multiple bigger tables, however, SQL views were way to slow to properly interact with data.
How did we approach that issue?
Our first approach was reducing the amount of data to only import the data since November 2016, which is the time we switched to our new discretionary portfolio management offering. This was the most interesting timeframe to generate knowledge from because from this point on we could compare the data for the different offerings. In some areas, this definitely brought performance gains, most views that were slow before were slightly less slow, but still not usable. As a quick win to squeeze more performance out of these views we materialized them. Materialized views are essentially one big snapshot of the view at a given time. This means the database does not have to join at query time, but rather once when generating the materialized view. This meant significant performance improvements, almost all the queries we defined were reasonably fast. The trade-off however is as I mentioned, that materialized views are only a snapshot, which means, every time you import new data, you have to refresh the view (i.e. rebuilding it from scratch, which can take quite some time). Furthermore, when we worked on a particularly big query, which joined almost all the big tables in our database, the temporary tables Postgres creates during the building process of the materialized view became so big, Postgres ran out of memory. This showed us, that materialized views were also not the tool we needed, especially with the datasets getting even bigger as time goes on.
Our current solution
As I described, the limiting factor we experienced was the creation of the table we wanted to perform queries on. Clearly, writing all that data for every single date into one table in one transaction didn’t work. Additionally, rebuilding that whole table every time new data arrives was not a good solution either. The solution we came up with was this: We now write all the relevant data for one particular query into one table, one day at a time. This means at every point during the import, only the data for the current day has to be kept in memory. Even for our biggest data sets, Postgres seemed to have no problem doing so. Another advantage is, that every day, only the data for that particular day has to be inserted into that table. Since as you can imagine this table becomes pretty big, idempotent inserts became pretty expensive after a while, however, indices on all the fields of said table, improved performance very significantly. Importing the data for one day took about 2.5 minutes, but with an index we could reduce the insert to about a second. Obviously, the index also helps with queries on that table, even queries including aggregating data in tables with multiple million rows, return in about a second. After using that time slice approach as a prototype for that one particularly big query, we intend to replace all our views with these time slice tables in the long run. Currently our setup is a mixture of all three approaches:
Our experience with Metabase
On the first look, Metabase looks pretty cool for a free BI solution. It is easy to describe questions without the need to know SQL. The data visualization is also pretty good, making it easy to generate pretty looking graphs and diagrams. But, the more we interacted with it, the more we realized Metabase is more of a tool to access data instead of performing extensive analyses on it. One shortcoming we stumbled across was that Metabase does not support joins. The only way to combine multiple tables is to define foreign keys from other tables, allowing you to combine data from exactly two, predefined tables. Our expectation was to be able to join tables to perform more extensive analyses. That way, every time you want to combine two new tables, you have to define a SQL view or table to access that data in Metabase. Yet, For the time being, we decided, this trade-off is still worth it for a free open source BI tool.
A Look into Vaamo’s analytics future
So, how do we plan to go on from here? Currently we are busy fixing some inconsistencies we experience in the data we currently have and automating the daily import and time slice creation. In the long term we want to add other data sources to our data warehouse and are evaluating at which point the amount of data we want to analyze gets to big for our Postgres instance to handle and when cloud based data warehousing solutions become necessary performance-wise.