Service Monitoring
Kylin provides the service monitoring for main components to help administrators obtain the service status and maintain instances.
Currently, we provide the following methods to monitor the core components in Kylin:
- Query: each Query node will records its service status in InfluxDB
- Build: each All node will records the service status and job status in InfluxDB
Two Rest APIs are provided to monitor and obtain the service status so that customers can integrate it with their own monitor platform.
- Get the Kylin cluster status by monitor query and building services. If the status is WARNINGorCRASH, it means the cluster is unstable.
- Get the service unavailable time with the specified time range and some detailed monitor data to help admins to track and retrospect.
How to Use
Get Cluster Status
GET http://host:port/kylin/api/monitor/status
- 
HTTP Header - Accept: application/vnd.apache.kylin-v4-public+json
- Accept-Language: cn
- Content-Type: application/json;charset=utf-8
 
- 
Curl Request Example curl -X GET \
 'http://host:port/kylin/api/monitor/status' \
 -H 'Accept: application/vnd.apache.kylin-v4-public+json' \
 -H 'Accept-Language: cn' \
 -H 'Authorization: Basic QURNSU46S1lMSU4=' \
 -H 'Content-Type: application/json;charset=utf-8'
- 
Response Details - active_instancesnumber of active instances in current cluster.
- query_statusquery service status. It could be GOOD / WARNING / CRASH
- job_statusbuilding service status. It could be GOOD / WARNING / CRASH.
- Jobjob instance status. It will show the instance details and status.
- queryquery instance status. It will show the instance details and status.
 
- 
Response Example {
 "code": "000",
 "data": {
 "active_instances": 1,
 "query_status": "GOOD",
 "job_status": "GOOD",
 "job": [
 {
 "instance": "sandbox.hortonworks.com:7070",
 "status": "GOOD"
 }
 ],
 "query": [
 {
 "instance": "sandbox.hortonworks.com:7070",
 "status": "GOOD"
 }
 ]
 },
 "msg": ""
 }
Get Cluster Status with Specific Time Range
GET http://host:port/kylin/api/monitor/status/statistics
- 
HTTP Header - Accept: application/vnd.apache.kylin-v4-public+json
- Accept-Language: cn
- Content-Type: application/json;charset=utf-8
 
- 
URL Parameters - start-- required- Longtimestamp. Get the monitor data greater than or equal to the timestamp.
- end-- reuquired- Longtimestamp. Get the monitor data smaller than the timestamp.
 
- 
Curl Example curl -X GET \
 'http://host:port/kylin/api/monitor/status/statistics?start=1583562358466&end=1583562358466' \
 -H 'Accept: application/vnd.apache.kylin-v4-public+json' \
 -H 'Accept-Language: cn' \
 -H 'Authorization: Basic QURNSU46S1lMSU4=' \
 -H 'Content-Type: application/json;charset=utf-8'
- 
Response Details - Startstart time of monitoring. It will be rounded down based on the interval of monitoring data. If the interval is 1 minute, it will only record data in minute level. For example, if the argument is- 1587353550000, it will be recognized as- 1587353520000. Therefore, the data might be inaccurate.
- endend time of monitoring. It will be rounded down based on the interval of monitoring data. If the interval is 1 minute, it will only record data in minute level. For example, if the argument is- 1587353550000, it will be recognized as- 1587353520000. Therefore, the data might be inaccurate.
- intervalinterval of monitor data, default value is 60000 ms (1 min)
- jobjob instance status. It will show the instance details and status, which includes unavailable time and counts. The time unit of unavailable time is ms.
- queryquery instance status. It will show the instance details and status, which includes unavailable time and counts. The time unit of unavailable time is ms.
 
- 
Response Example {
 "code":"000",
 "data":{
 "start":1584151560000,
 "end":1584151680000,
 "interval":60000,
 "job":[
 {
 "instance":"sandbox.hortonworks.com:7070",
 "details":[
 {
 "time":1584151572650,
 "status":"GOOD"
 },
 {
 "time":1584151632770,
 "status":"GOOD"
 }
 ],
 "unavailable_time":0,
 "unavailable_count":0
 }
 ],
 "query":[
 {
 "instance":"sandbox.hortonworks.com:7070",
 "details":[
 {
 "time":1584151609142,
 "status":"GOOD"
 },
 {
 "time":1584151669142,
 "status":"GOOD"
 }
 ],
 "unavailable_time":0,
 "unavailable_count":0
 }
 ]
 },
 "msg":""
 }
Know Limitation
- The detected query is constant query which will not scan HDFS files.
- InfluxDB is not high available now. Hence, some monitor data will be lost if the InfluxDB service is down.
- The job status will be inaccurate if deleting or discarding plenty of jobs.
- Since system monitoring depends on InfluxDB, if the system monitoring is still enabled (enabled by default) when InfluxDB is not configured, some useless errors may appear in the log. So when InfluxDB is not configured, it is recommended to configure kylin.monitor.enabled = falseinkylin.propertiesto turn off the system monitoring function.