Compared with Kylin 3.x, Kylin 4.0 implements a new Spark build engine and parquet storage, making it possible for Kylin to deploy without Hadoop environment. Compared with deploying Kylin 3.x on AWS EMR, deploying kylin4 directly on AWS EC2 instances has the following advantages:
1. Cost saving. Compared with AWS EMR node, AWS EC2 node has lower cost.
2. More flexible. On the EC2 node, users can more independently select the services and components they need for installation and deployment.
3. Remove Hadoop dependency. Hadoop ecology is heavy and needs to be maintained at a certain labor cost. Remove hadoop can be closer to the cloud-native.

After realizing the feature of supporting build and query in Spark Standalone mode, we tried to deploy Kylin 4.0 without Hadoop on the EC2 instance of AWS, and successfully built the cube and query.

Environment preparation

  • Apply for AWS EC2 Linux instances as required
  • Create Amazon RDS for MySQL as kylin and hive metabases
  • S3 as kylin’s storage

Component version information

The component version information provided here is that we selected during the test. If users need to use other versions for deployment, you can replace them by yourself and ensure the compatibility between component versions.

  • JDK 1.8
  • Hive 2.3.9
  • Zookeeper 3.4.13
  • Kylin 4.0 for spark3
  • Spark 3.1.1
  • Hadoop 3.2.0(No startup required)

Deployment process

1 Configure environment variables

  • Modify profile

    vim /etc/profile
      
    # Add the following at the end of the profile file
    export JAVA_HOME=/usr/local/java/jdk1.8.0_291
    export JRE_HOME=${JAVA_HOME}/jre
    export HADOOP_HOME=/etc/hadoop/hadoop-3.2.0
    export HIVE_HOME=/etc/hadoop/hive
    export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
    export PATH=$HIVE_HOME/bin:$HIVE_HOME/conf:${HADOOP_HOME}/bin:${JAVA_HOME}/bin:$PATH
      
    # Execute after saving the contents of the above file
    source /etc/profile
    

2 Install JDK 1.8

  • Download JDK1.8 to the prepared EC2 instance and unzip it to the /usr/local/Java directory:

    mkdir /usr/local/java
    tar -xvf java-1.8.0-openjdk.tar -C /usr/local/java
    

3 Config Hadoop

  • Download Hadoop and unzip it

    wget https://archive.apache.org/dist/hadoop/common/hadoop-3.2.0/hadoop-3.2.0.tar.gz
    mkdir /etc/hadoop
    tar -xvf hadoop-3.2.0.tar.gz -C /etc/hadoop
    
  • Copy the jar package required by S3 to the Hadoop class loading path, otherwise an error of ClassNotFound type may occur

    cd /etc/hadoop
    cp hadoop-3.2.0/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.375.jar hadoop-3.2.0/share/hadoop/common/lib/
    cp hadoop-3.2.0/share/hadoop/tools/lib/hadoop-aws-3.2.0.jar hadoop-3.2.0/share/hadoop/common/lib/
    
  • Modify core-site.xml,config AWS account information and endpoint. The following is an example:

    <?xml version="1.0" encoding="UTF-8"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <!--
      Licensed under the Apache License, Version 2.0 (the "License");
      you may not use this file except in compliance with the License.
      You may obtain a copy of the License at
      
        http://www.apache.org/licenses/LICENSE-2.0
      
      Unless required by applicable law or agreed to in writing, software
      distributed under the License is distributed on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
      See the License for the specific language governing permissions and
      limitations under the License. See accompanying LICENSE file.
    -->
      
    <!-- Put site-specific property overrides in this file. -->
      
    <configuration>
      <property>
        <name>fs.s3a.access.key</name>
        <value>SESSION-ACCESS-KEY</value>
      </property>
      <property>
        <name>fs.s3a.secret.key</name>
        <value>SESSION-SECRET-KEY</value>
      </property> 
      <property>
        <name>fs.s3a.endpoint</name>
        <value>s3.$REGION.amazonaws.com</value>
      </property>
    </configuration> 
    

4 Install Hive

  • Download Hive and unzip it

    wget https://downloads.apache.org/hive/hive-2.3.9/apache-hive-2.3.9-bin.tar.gz
    tar -xvf apache-hive-2.3.9-bin.tar.gz -C /etc/hadoop
    mv /etc/hadoop/apache-hive-2.3.9-bin /etc/hadoop/hive
    
  • Configure environment variables

    vim /etc/profile
      
    # Add the following at the end of the profile file
    export HIVE_HOME=/etc/hadoop/hive
    export PATH=$PATH:$HIVE_HOME/bin:$HIVE_HOME/conf
      
    # Execute after saving the contents of the above file
    source /etc/profile
    
  • Modify hive-site.xml, vim ${HIVE_HOME}/conf/hive-site.xml. Please start Amazon RDS for MySQL database in advance to obtain the mysql connection URI, user name and password.

    Note: Please configure VPC and security group correctly to ensure that EC2 instances can access the database normally.

    The sample content of hive-site.xml is as follows:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!--
       Licensed to the Apache Software Foundation (ASF) under one or more
       contributor license agreements.  See the NOTICE file distributed with
       this work for additional information regarding copyright ownership.
       The ASF licenses this file to You under the Apache License, Version 2.0
       (the "License"); you may not use this file except in compliance with
       the License.  You may obtain a copy of the License at
      
           http://www.apache.org/licenses/LICENSE-2.0
      
       Unless required by applicable law or agreed to in writing, software
       distributed under the License is distributed on an "AS IS" BASIS,
       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
       See the License for the specific language governing permissions and
       limitations under the License.
    --><configuration>
      <!-- WARNING!!! This file is auto generated for documentation purposes ONLY! -->
      <!-- WARNING!!! Any changes you make to this file will be ignored by Hive.   -->
      <!-- WARNING!!! You must make your changes in hive-site.xml instead.         -->
      <!-- Hive Execution Parameters -->
      <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>password</value>
        <description>password to use against metastore database</description>
      </property>
      <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:mysql://host-name:3306/hive?createDatabaseIfNotExist=true</value>
        <description>JDBC connect string for a JDBC metastore</description>
      </property>
      <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.jdbc.Driver</value>
        <description>Driver class name for a JDBC metastore</description>
      </property>
      <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>admin</value>
        <description>Username to use against metastore database</description>
      </property>
      <property>
        <name>hive.metastore.schema.verification</name>
        <value>false</value>
        <description>
          Enforce metastore schema version consistency.
          True: Verify that version information stored in metastore matches with one from Hive jars.  Also disable automatic
                schema migration attempt. Users are required to manually migrate schema after Hive upgrade which ensures
                proper metastore schema migration. (Default)
          False: Warn if the version information stored in metastore doesn't match with one from in Hive jars.
        </description>
      </property>
    </configuration>
    
  • Hive metadata initialization

    # Download the jar package of MySQL JDBC and place it in $HIVE_HOME/lib directory
    cp mysql-connector-java-5.1.47.jar $HIVE_HOME/lib
    bin/schematool -dbType mysql -initSchema
    mkdir $HIVE_HOME/logs
    nohup $HIVE_HOME/bin/hive --service metastore >> $HIVE_HOME/logs/hivemetastorelog.log 2>&1 &
    

    Note:If the following error is reported in this step:

    java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V
    

    This is caused by the inconsistency between the guava version in hive2 and the guava version in Hadoop3. Please replace the guava jar in directory $HIVE_HOME/lib with the guava jar in directory $HADOOP_HOME/share/hadoop/common/lib/.

  • To prevent jar package conflicts in the subsequent process, you need to remove some spark and scala related jar packages from hive’s class loading path:

    mkdir $HIVE_HOME/spark_jar
    mv $HIVE_HOME/lib/spark-* $HIVE_HOME/spark_jar
    mv $HIVE_HOME/lib/jackson-module-scala_2.11-2.6.5.jar $HIVE_HOME/spark_jar
    

    Note: Here just lists the conflicting jar packages encountered during the test. If users encounter problems similar to jar package conflicts, you can judge which jar packages have conflicts according to the class loading path and remove the relevant jar packages. It is recommended to keep the jar package version under the spark class loading path when the same jar package has version conflicts.

5 Deploy Spark Standalone

  • Download Spark 3.1.1 and unzip it

    wget http://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
    tar -xvf spark-3.1.1-bin-hadoop3.2.tgz -C /etc/hadoop
    mv /etc/hadoop/spark-3.1.1-bin-hadoop3.2 /etc/hadoop/spark 
    export SPARK_HOME=/etc/hadoop/spark 
    
  • Copy jar package required by S3:

    cp $HADOOP_HOME/share/hadoop/tools/lib/hadoop-aws-3.2.0.jar $SPARK_HOME/jars
    cp $HADOOP_HOME/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.375.jar $SPARK_HOME/jars
    cp mysql-connector-java-5.1.47.jar $SPARK_HOME/jars
    
  • Copy hive-site.xml and mysql-jdbc

    cp $HIVE_HOME/conf/hive-site.xml $SPARK_HOME/conf
    
  • Setup Spark master and worker

    $SPARK_HOME/bin/start-master.sh
    $SPARK_HOME/bin/start-worker.sh spark://hostname:7077
    

6 Deploy Zookeeper

  • Download zookeeper and unzip it

    wget http://archive.apache.org/dist/zookeeper/zookeeper-3.4.13/zookeeper-3.4.13.tar.gz
    tar -xvf zookeeper-3.4.13.tar.gz -C /etc/hadoop
    mv /etc/hadoop/zookeeper-3.4.13 /etc/hadoop/zookeeper
    
  • Preparing the zookeeper configuration file. Since only one EC2 node is used in the test, the zookeeper pseudo cluster is deployed here.

    cp /etc/hadoop/zookeeper/conf/zoo_sample.cfg /etc/hadoop/zookeeper/conf/zoo1.cfg
    cp /etc/hadoop/zookeeper/conf/zoo_sample.cfg /etc/hadoop/zookeeper/conf/zoo2.cfg
    cp /etc/hadoop/zookeeper/conf/zoo_sample.cfg /etc/hadoop/zookeeper/conf/zoo3.cfg
    
  • Modify the above three configuration files in sequence and add the following contents, note that change the directory name to a different directory:

    server.1=localhost:2287:3387
    server.2=localhost:2288:3388
    server.3=localhost:2289:3389
    dataDir=/tmp/zookeeper/zk1/data
    dataLogDir=/tmp/zookeeper/zk1/log
    clientPort=2181
    
  • Create the required folders and files:

    mkdir /tmp/zookeeper/zk1/data
    mkdir /tmp/zookeeper/zk1/log
    mkdir /tmp/zookeeper/zk2/data
    mkdir /tmp/zookeeper/zk2/log
    mkdir /tmp/zookeeper/zk3/data
    mkdir /tmp/zookeeper/zk3/log
    vim /tmp/zookeeper/zk1/data/myid 
    vim /tmp/zookeeper/zk2/data/myid 
    vim /tmp/zookeeper/zk3/data/myid 
    
  • Setup zookeeper cluster

    /etc/hadoop/zookeeper/bin/zkServer.sh start /etc/hadoop/zookeeper/conf/zoo1.cfg
    /etc/hadoop/zookeeper/bin/zkServer.sh start /etc/hadoop/zookeeper/conf/zoo2.cfg
    /etc/hadoop/zookeeper/bin/zkServer.sh start /etc/hadoop/zookeeper/conf/zoo3.cfg
    

7 Setup kylin

  • Download kylin 4.0 binary package and unzip it

    wget https://mirror-hk.koddos.net/apache/kylin/apache-kylin-4.0.0/apache-kylin-4.0.0-bin.tar.gz
    tar -xvf apache-kylin-4.0.0-bin.tar.gz /etc/hadoop
    export KYLIN_HOME=/etc/hadoop/apache-kylin-4.0.0-bin
    mkdir $KYLIN_HOME/ext
    cp mysql-connector-java-5.1.47.jar $KYLIN_HOME/ext
    
  • Modify kylin.properties vim $KYLIN_HOME/conf/kylin.properties

    kylin.metadata.url=kylin_metadata@jdbc,url=jdbc:mysql://hostname:3306/kylin,username=root,password=password,maxActive=10,maxIdle=10
    kylin.env.zookeeper-connect-string=hostname
    kylin.engine.spark-conf.spark.master=spark://hostname:7077
    kylin.engine.spark-conf.spark.submit.deployMode=client
    kylin.env.hdfs-working-dir=s3://bucket/kylin
    kylin.engine.spark-conf.spark.eventLog.dir=s3://bucket/kylin/spark-history
    kylin.engine.spark-conf.spark.history.fs.logDirectory=s3://bucket/kylin/spark-history
    kylin.engine.spark-conf.spark.yarn.jars=s3://bucket/spark2_jars/*
    kylin.query.spark-conf.spark.master=spark://hostname:7077
    kylin.query.spark-conf.spark.yarn.jars=s3://bucket/spark2_jars/*
    
  • Execute bin/kylin.sh start

  • Kylin may encounter ClassNotFound type errors during startUp. Please refer to the following methods to restart kylin:

    # Download commons-collections-3.2.2.jar 
    cp commons-collections-3.2.2.jar $KYLIN_HOME/tomcat/webapps/kylin/WEB-INF/lib/
    # Download commons-configuration-1.3.jar
    cp commons-configuration-1.3.jar $KYLIN_HOME/tomcat/webapps/kylin/WEB-INF/lib/
    cp $HADOOP_HOME/share/hadoop/common/lib/aws-java-sdk-bundle-1.11.563.jar $KYLIN_HOME/tomcat/webapps/kylin/WEB-INF/lib/
    cp $HADOOP_HOME/share/hadoop/common/lib/hadoop-aws-3.2.2.jar $HADOOP_HOME/tomcat/webapps/kylin/WEB-INF/lib/