At Telecoming we started using Kylin as our main analytics database in a new Business Intelligence project that started by the end of 2017. We moved from a custom report engine based in MySQL and later in AWS Redshift into a fully Hadoop based solution with Kylin as last step prior reporting generation. We started with Kylin 2.2.1, migrated to 2.4.1 by mid 2018 and moved to 2.6.1 last month. Kylin 2.2.1 and our first release but had some annoying bugs and stability problems. It was not only due to less mature release but also because didn’t have experience running Kylin. Some design changes at cube level and optimizations (thank you, Alberto Ramon) helped us a lot in order to improve performance and stability but there were still some issues present that has been solved in later versions.
This is our experience.
What we were looking for with this upgrade:
1) Bug fixing. In particular one bug related with views when creating intermediate hive tables.
2) Performance enhancements. This is always welcomed.
3) Less dependency at HBase level. HBase is our main source of problems. In part, due to EMR (AWS) distribution that relies in S3 as main storage (now we are in HDP). Table metadata is frequenly accesed and response is not always good. So, we wanted to avoid Hbase storage for metadata table and move to mysql via jdbc.
1) Minimize downtime. Since our users do intensive use of reporting system, it’s imperative to minimize Kylin downtime.
2) Easy rollback. Just in case of a unsuccessful upgrade, we need an easy way to rollback.
What we did:
1) We prepared a new AWS instance with Kylin 2.6.1, already configured and tuned. Of course, previously we had run a for a few weeks a Kylin 2.6.1 prototype and tweaked all config files in order to run properly. This new instance was stopped when we started the migration. It also had a brand new local mysql 5.7 for metadata storage instead of relying on HBase.
2) We stopped all cube buildings on Kylin 2.4.1. We waited until those that were in progress had finished. So, at query level everything was working although no more data were added.
3) We performed a full backup of Kylin metadata (metastore.sh backup) when all builds had finished. It took only 2 minutes.
4) We performed an Hbase snapshot over all Hbase tables. So, rolling back in case of a total failure to previous version was an easy task: restore metadata + clone snapshot into new table (removing upgraded table). It took only a few seconds since you can run all snapshot commands in a single sentence.
5) We performed a metadata restore on our new Kylin 2.6.1 instance (metastore.sh restore). In this way, all metadata were automatically migrated from Hbase based storage to jdbc based storage in a few minutes.
6) We upgraded coprocessor from 2.6.1 (with kylin.sh org.apache.kylin.storage.hbase.util.DeployCoprocessorCLI default all).
7) We started new kylin version and checked that everything was in place. We could query all cubes and got proper responses.
8) After all test were finished, we pointed our DNS entries to new kylin 2.6.1 instance and resumed the building processes from it. Of course, kylin 2.4.1 was shutdown. Some days later, we removed hbase snapshots in order to free up storage space.
As you can imagine, our downtime was near zero at query level since all kylin queries were run at 2.4.1 until 2.6.1 was ready to serve requests.
Rollback plan (not needed)
Rollback plan was very simple. In case of problem, we would switch to previous version and we would restore all Hbase tables from previously taken snapshots. This snapshots included Hbase metadata table, so, no need to restore metadata.
What we have got
- Kylin 2.6.1 is more stable than 2.4.1. Only problems we have found at server level has been due to memory issues at platform level. Now, many steps are spark based and take some extra memory when launching jobs. We solved easily by adjusting JVM (-Xmx, -Xms) parameters for kylin 2.6.1.
- Cube builds are faster due to improvements at cube build level.
- User interface is really fast. Previously browsing cube build list took up to 10 seconds. Now, it almost instantaneous.
- Some bugs are not present anymore. Two of them were in particular very problematic:
- Hive intermediate tables from views didn’t add an uuid suffix. So, if the same table was used for other builds, it could be accidentally deleted by previous job at cleaning phase. So, build failed.
- Param “kylin.job.cube-auto-ready-enabled” was not working. So, after building a segment it was automatically enabled and not always we wanted this.
In general, upgrade procedure has been easy and fast. However, a lot of previous work had to be done in order to test that upgrade would not be a problem. So, we built a parallel environment in order to check that all new functionalities were working and bugs were solved. We adapted our config files to new Kylin requirements, for example by replacing metadata storage from hbase to jdbc.
Upgrading a Kylin version to a new one is, in general, a simple task. However, it’s imperative to read carefully documentation because sometimes it require extra steps, and of course, check parameters in order to detect new features or changes that has to be taken in consideration in the new version. Having a backup and a rollback plan (tested) to come back as fast as possible is also very important.
From our users side, the most successful problem solved is the interface response. They are really happy how kylin interface is working right now. And from support side dealing with a MySQL database instead of an Hbase is also a great improvement. Editing a parameter inside a json on hbase can be hard since there are not many applications for editing a modifying a Hbase table.