Международная конференция разработчиков
и пользователей свободного программного обеспечения

Using Hadoop stack to build a cloud VAT declarations revising service

Alex Chistyakov, Saint-Petersburg, Russian Federation

LVEE 2016

Recent changes in Russian federal law require organizations to submit tax information in electronic form. Federal Tax Service of Russia revises this information and sends automatically generated claims. “Kontur.NDS+” project helps businesses to perform this revising in the cloud before tax documents are submitted to the Federal Tax Service. “Kontur.NDS+” is built on Hadoop stack (which is quite common), sharded and replicated Solr (which is quite common too, but required some tuning) and Perl (which is not common at all). The project evolved the last 1.5 years and this evolution is a subject of current presentation.

Hadoop is a free/libre software project, which includes tools, libraries and framework to build distributed applications, hosted on clusters of handreds and thousands of nodes. Hadoop is used as the basis of search engines in many well-known high load websites, including Facebook.

Our own hadoop-based project presented here was started more than 1.5 years ago and was basically a startup at that time. We did not get brand new servers and therefore had to design a fully fault-tolerant system without a SPOF (single point of failure). A standard Hadoop stack used by everybody is not fault tolerant right out of the box, so we had to carefully plan and implement a quorum-based system.

Since our project was the only cloud-based solution for tax documents revising, it quickly began to dominate on the market. Our team had to use advanced performance measurement and optimization techniques to fulfill customer expectations. We tried many single-node and cluster-based monitoring tools before we finally settled on stack, built of Graphite1, Whisper2 and Grafana3.

We record and analyze many Hadoop- and HBase-related metrics, JVM (memory and GC-related) metrics and our custom business metrics. We use flamegraphs (a technique popularized by Brendan Gregg4) to record and analyze Perl stack frames. We also had to dump Solr5 memory to better understand why it needed so much RAM.

We have also tailored our software development process to meet performance requirements. Number of requests and overall load is not the same during the year, maximums of load are at the quarter ends, so we were able to plan and perform thorough stress testing. We maintain this stress testing procedure routinely every quarter.

Perl in particular and dynamically typed languages in general do not seem to be a safe choice for developing a lot of business logic, but we were able to overcome limitations of the dynamically typed language. We utilize a thorough code review process and a set of assertions and cross checks to ensure our implementation is valid.

Every quarter means a lot of new documents for us, so we are constantly seeking the right tool to do our job better. We are testing the Phoenix extension to HBase right now. Phoenix6 is basically an SQL-like engine which does not use Hadoop YARN resource management to schedule and run queries. Phoenix will allow us to experiment with ad-hoc OLAP & OLTP – online analytical and transactions processing queries.

And, of course, we use and love PostgreSQL, but it is for not-so-big data, as known.


1 Graphite monitoring tool

2 Whisper Databse

3 Grafana data visualizer

4 B. Gregg. CPU Flame Graphs

5 Solr search platform

6 Apache Phoenix

Abstract licensed under Creative Commons Attribution-ShareAlike 3.0 license