Elasticsearch-4-聚合分析

上文中,添加了6个电影的document,接下来做这些document的聚合分析,统计等.

上文添加的6个电影数据中都包含有genres 的一个数组

统计每个genres下的电影数量
1
2
3
4
5
6
7
8
9
10
11
GET /movies/movie/_search
{
"size": 0, // size不设置的话 hits中会把对进行聚合的所有数据返回.
"aggs": {
"group_by_genres": { // group_by_genres 是自定义的名字
"terms": {
"field": "genres" // 要做聚合的field
}
}
}
}

运行一下,会发现报错 如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
{
"error": {
"root_cause": [
{
"type": "illegal_argument_exception",
"reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [genres] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory."
}
],
"type": "search_phase_execution_exception",
"reason": "all shards failed",
"phase": "query",
"grouped": true,
"failed_shards": [
{
"shard": 0,
"index": "movies",
"node": "f57uV91xS_GRTQS2Ho81rg",
"reason": {
"type": "illegal_argument_exception",
"reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [genres] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory."
}
}
],
"caused_by": {
"type": "illegal_argument_exception",
"reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [genres] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory."
}
},
"status": 400
}

错误原因是:
默认情况下,在文本字段上禁用Fielddata。在[genres]上设置fielddata=true,以便通过反转索引来加载内存中的fielddata。请注意,这可能会占用大量内存

这里我们需要将文本field的fielddata属性设置为true,具体原因之后再说.

1
2
3
4
5
6
7
8
9
PUT /movies/_mapping/movie
{
"properties": {
"genres":{
"type": "text",
"fielddata": true
}
}
}

然后再执行上面的查询, 返回结果为:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 6,
"max_score": 0,
"hits": []
},
"aggregations": {
"group_by_genres": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "drama",
"doc_count": 4
},
{
"key": "crime",
"doc_count": 3
},
{
"key": "biography",
"doc_count": 2
},
{
"key": "action",
"doc_count": 1
},
{
"key": "adventure",
"doc_count": 1
},
{
"key": "cirme",
"doc_count": 1
},
{
"key": "drame",
"doc_count": 1
},
{
"key": "mystery",
"doc_count": 1
},
{
"key": "thriller",
"doc_count": 1
},
{
"key": "war",
"doc_count": 1
}
]
}
}
}

具体的聚合结果返回到了 aggregations 下的 buckets 下, key为每个genres下的元素, doc_count 是包含该key的电影数量.

对名称中包含kill的电影,计算每个genres下的电影数量

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
GET /movies/movie/_search
{
"query": {
"match": {
"title": "kill"
}
},
"size": 0,
"aggs": {
"group_by_genres": {
"terms": {
"field": "genres"
}
}
}
}

其实就是在上个查询的基础上加了一个query条件,先查询query,将返回的结果再进行聚合分析

先按genres分组,然后计算每个genres下的电影的年份的平均值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
GET movies/movie/_search
{
"size": 0,
"aggs": {
"group_by_genres": { // 自定义分组名称
"terms": {
"field": "genres" // 聚合genres
},
"aggs": {
"avg_year": { // 在上面分组的基础上 在进行聚合分析
"avg": { // 计算平均值
"field": "year"
}
}
}
}
}
}

平均值计算是按照每组里面的数据进行平均

计算每个genres下的电影的平均年份,并且按照平均年份降序排序

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
GET movies/movie/_search
{
"size": 0,
"aggs": {
"group_by_genres": {
"terms": {
"field": "genres",
"order": {
"avg_year": "desc"
}
},
"aggs": {
"avg_year": {
"avg": {
"field": "year"
}
}
}
}
}
}

在上一个分组计算平均值的基础上 在上层的terms里面加一个order 要排序的字段就是下面一层聚合计算平均值的名称”avg_year”

按照指定的年份范围区间进行分组,然后在每组内再按照genres进行分组,最后再计算每组的平均年份

上文中添加的数据 电影年份有 1962 1972 1979 2007 2003, 用range来进行分组 分为1960-1970 1970-1980 2000-2010 三组

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
GET /movies/movie/_search
{
"size": 0,
"aggs": {
"group_by_year": {
"range": {
"field": "year",
"ranges": [
{
"from": 1960,
"to": 1970
},
{
"from": 1970,
"to": 1980
},
{
"from": 2000,
"to": 2010
}
]
}
}
}
}

三组年份的电影分好以后,再往下一层按genres分一层,分好之后再往下聚合,用来计算平均值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
GET /movies/movie/_search
{
"size": 0,
"aggs": {
"group_by_year": {
"range": {
"field": "year",
"ranges": [
{
"from": 1960,
"to": 1970
},
{
"from": 1970,
"to": 1980
},
{
"from": 2000,
"to": 2010
}
]
},
"aggs": {
"group_by_genres": {
"terms": {
"field": "genres"
},
"aggs": {
"avg_year": {
"avg": {
"field": "year"
}
}
}
}
}
}
}
}