Elasticsearch-75-percentiles百分比算法及优化

需求背景

有一个网站,记录了每次请求访问的耗时,需要统计tp50,tp90,tp99.

  • tp50:50%的请求的耗时最长在多长时间
  • tp90:90%的请求的耗时最长在多长时间
  • tp99:99%的请求的耗时最长在多长时间

准备数据

1
DELETE /website

创建索引

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
PUT /website
{
"mappings": {
"logs": {
"properties": {
"latency": {
"type": "long"
},
"province": {
"type": "keyword"
},
"timestamp": {
"type": "date"
}
}
}
}
}

添加数据

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
POST /website/logs/_bulk
{ "index": {}}
{ "latency" : 105, "province" : "江苏", "timestamp" : "2016-10-28" }
{ "index": {}}
{ "latency" : 83, "province" : "江苏", "timestamp" : "2016-10-29" }
{ "index": {}}
{ "latency" : 92, "province" : "江苏", "timestamp" : "2016-10-29" }
{ "index": {}}
{ "latency" : 112, "province" : "江苏", "timestamp" : "2016-10-28" }
{ "index": {}}
{ "latency" : 68, "province" : "江苏", "timestamp" : "2016-10-28" }
{ "index": {}}
{ "latency" : 76, "province" : "江苏", "timestamp" : "2016-10-29" }
{ "index": {}}
{ "latency" : 101, "province" : "新疆", "timestamp" : "2016-10-28" }
{ "index": {}}
{ "latency" : 275, "province" : "新疆", "timestamp" : "2016-10-29" }
{ "index": {}}
{ "latency" : 166, "province" : "新疆", "timestamp" : "2016-10-29" }
{ "index": {}}
{ "latency" : 654, "province" : "新疆", "timestamp" : "2016-10-28" }
{ "index": {}}
{ "latency" : 389, "province" : "新疆", "timestamp" : "2016-10-28" }
{ "index": {}}
{ "latency" : 302, "province" : "新疆", "timestamp" : "2016-10-29" }

pencentiles

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
GET /website/logs/_search
{
"size": 0,
"aggs": {
"latency_percentiles": {
"percentiles": {
"field": "latency",
"percents": [
50,
90,
99
]
}
}
}
}

返回值:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
{
"took": 10,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 12,
"max_score": 0,
"hits": []
},
"aggregations": {
"latency_percentiles": {
"values": {
"50.0": 108.5,
"90.0": 380.3,
"99.0": 624.8500000000001
}
}
}
}

从返回值来看, tp50是108.5,tp90是380.3,tp99是624.85 这个结果是es算的,他并不是完全准确的

案例

先根据地区分组,然后计算上面的数据

请求:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
GET /website/logs/_search
{
"size": 0,
"aggs": {
"group_by_province":{
"terms": {
"field": "province"
},
"aggs": {
"latency_percentiles":{
"percentiles": {
"field": "latency",
"percents": [
50,
95,
99
]
}
}
}
}
}
}

返回值:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
{
"took": 7,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 12,
"max_score": 0,
"hits": []
},
"aggregations": {
"group_by_province": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "新疆",
"doc_count": 6,
"latency_percentiles": {
"values": {
"50.0": 288.5,
"95.0": 587.75,
"99.0": 640.75
}
}
},
{
"key": "江苏",
"doc_count": 6,
"latency_percentiles": {
"values": {
"50.0": 87.5,
"95.0": 110.25,
"99.0": 111.65
}
}
}
]
}
}
}

percentiles rank

比如现在要计算网站访问时延SLA统计
SLA: 就是提供的服务标准,比如确保所有的请求都要在200ms以内

需求: 在200ms以内的有百分之多少,在1000ms以内的有百分之多少

这个时候就需要用到percentiles ranks了,percentiles ranks其实比pencentile还要常用

请求:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
GET /website/logs/_search 
{
"size": 0,
"aggs": {
"group_by_province": {
"terms": {
"field": "province"
},
"aggs": {
"latency_percentile_ranks": {
"percentile_ranks": {
"field": "latency",
"values": [
200,
1000
]
}
}
}
}
}
}

返回值:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
{
"took": 7,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 12,
"max_score": 0,
"hits": []
},
"aggregations": {
"group_by_province": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "新疆",
"doc_count": 6,
"latency_percentile_ranks": {
"values": {
"200.0": 29.40613026819923,
"1000.0": 100
}
}
},
{
"key": "江苏",
"doc_count": 6,
"latency_percentile_ranks": {
"values": {
"200.0": 100,
"1000.0": 100
}
}
}
]
}
}
}

percentile的优化

percentile是采用TDigest算法的,用很多节点来执行百分比的计算,近似估计,所以会有一定的误差,用到的节点越多,就越精准

compression参数

默认值是100, 限制节点的数量最多 compression * 20 个去计算

设置的越大,占用的内存就越多,就越精准,但是性能也会越差, 一个节点占用32字节, 100node 20 32 = 64KB

所以,如果你想要percentile算法越精准,compression可以设置的越大