Elasticsearch-47-实战案例-基于dis_max实现best fileds策略进行多字段搜索

准备工作

为帖子增加content字段

1
2
3
4
5
6
7
8
9
10
11
POST /forum/article/_bulk
{ "update": { "_id": "1"} }
{ "doc" : {"content" : "i like to write best elasticsearch article"} }
{ "update": { "_id": "2"} }
{ "doc" : {"content" : "i think java is the best programming language"} }
{ "update": { "_id": "3"} }
{ "doc" : {"content" : "i am only an elasticsearch beginner"} }
{ "update": { "_id": "4"} }
{ "doc" : {"content" : "elasticsearch and hadoop are all very good solution, i am a beginner"} }
{ "update": { "_id": "5"} }
{ "doc" : {"content" : "spark is best big data solution based on scala ,an programming language similar to java"} }

需求一

搜索title或content中包含java或solution的帖子
构建搜索条件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
GET forum/article/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"title": "java solution"
}
},
{
"match": {
"content": "java solution"
}
}
]
}
}
}

返回值:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
{
"took": 23,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 0.8849759,
"hits": [
{
"_index": "forum",
"_type": "article",
"_id": "2",
"_score": 0.8849759,
"_source": {
"articleID": "KDKE-B-9947-#kL5",
"userID": 1,
"hidden": false,
"postDate": "2017-01-02",
"tag": [
"java"
],
"tag_cnt": 1,
"view_cnt": 50,
"title": "this is java blog",
"content": "i think java is the best programming language"
}
},
{
"_index": "forum",
"_type": "article",
"_id": "4",
"_score": 0.7120095,
"_source": {
"articleID": "QQPX-R-3956-#aD8",
"userID": 2,
"hidden": true,
"postDate": "2017-01-02",
"tag": [
"java",
"elasticsearch"
],
"tag_cnt": 2,
"view_cnt": 80,
"title": "this is java, elasticsearch, hadoop blog",
"content": "elasticsearch and hadoop are all very good solution, i am a beginner"
}
},
{
"_index": "forum",
"_type": "article",
"_id": "5",
"_score": 0.56008905,
"_source": {
"articleID": "DHJK-B-1395-#Ky5",
"userID": 3,
"hidden": false,
"postDate": "2018-12-03",
"tag": [
"elasticsearch"
],
"tag_cnt": 1,
"view_cnt": 10,
"title": "this is spark blog",
"content": "spark is best big data solution based on scala ,an programming language similar to java"
}
},
{
"_index": "forum",
"_type": "article",
"_id": "1",
"_score": 0.26742277,
"_source": {
"articleID": "XHDK-A-1293-#fJ3",
"userID": 1,
"hidden": false,
"postDate": "2017-01-01",
"tag": [
"java",
"hadoop"
],
"tag_cnt": 2,
"view_cnt": 30,
"title": "this is java and elasticsearch blog",
"content": "i like to write best elasticsearch article"
}
}
]
}
}

我们来看一下返回值:
排在第一位的是id是2的document,这个document中只有title包含了java,content也包含了java
排在第二位的是id是4的document,这document中,是title中包含了java,content中包含了solution
排在第三位的是id是5的document,这个document中,是content包含了java和solution

这样看来应该是id=5的document是相关度比id=4的高的,但是id=4的排在了前面,这是为什么呢?

es的计算方式

es在计算每个document的relevance score是每个query的分数的和,乘以matched query的数量,除以总query的数量
对于每个query(就是上面should中的每个match),es都会计算一个数量, matched query 就是匹配到的条件的数量

我们来算一下id=4 的document的分数,查询中的两个条件
{ “match”: { “title”: “java solution” }},针对document4 是有一个分数的,假设是1.1
{ “match”: { “content”: “java solution” }},针对document4,也是有一个分数的,假设是1.2
query分数的和1.1 + 1.2 = 2.3,matched query的数量是2, 总共的query数量是2,所以计算出来就是2.3 * 2 / 2 = 2.3

我们再来算一下document 5 的分数,查询中的两个条件
{ “match”: { “title”: “java solution” }},针对document5 是没有分数的,因为这个条件不匹配document5
{ “match”: { “content”: “java solution” }},针对document5,也是有一个分数的,假设是2.3
这时候query分数的总和就是2.3,matched query的数量是1,总共的query数量是2,所以计算出来就是 2.3 * 1 / 2 = 1.15

2.3 > 1.15 所以document4 排在了document5的前面

best fields策略 dis_max

best fields策略: 就是说,搜索到的结果应该是某一个匹配到尽可能多的关键词的document被排在前面,而不是匹配到了少数的关键词还排在前面

搜索请求:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
GET forum/article/_search
{
"query": {
"dis_max": {
"queries": [
{
"match": {
"title": "java solution"
}
},
{
"match": {
"content": "java solution"
}
}
]
}
}
}

返回值:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
{
"took": 6,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 0.68640786,
"hits": [
{
"_index": "forum",
"_type": "article",
"_id": "2",
"_score": 0.68640786,
"_source": {
"articleID": "KDKE-B-9947-#kL5",
"userID": 1,
"hidden": false,
"postDate": "2017-01-02",
"tag": [
"java"
],
"tag_cnt": 1,
"view_cnt": 50,
"title": "this is java blog",
"content": "i think java is the best programming language"
}
},
{
"_index": "forum",
"_type": "article",
"_id": "5",
"_score": 0.56008905,
"_source": {
"articleID": "DHJK-B-1395-#Ky5",
"userID": 3,
"hidden": false,
"postDate": "2018-12-03",
"tag": [
"elasticsearch"
],
"tag_cnt": 1,
"view_cnt": 10,
"title": "this is spark blog",
"content": "spark is best big data solution based on scala ,an programming language similar to java"
}
},
{
"_index": "forum",
"_type": "article",
"_id": "4",
"_score": 0.5565415,
"_source": {
"articleID": "QQPX-R-3956-#aD8",
"userID": 2,
"hidden": true,
"postDate": "2017-01-02",
"tag": [
"java",
"elasticsearch"
],
"tag_cnt": 2,
"view_cnt": 80,
"title": "this is java, elasticsearch, hadoop blog",
"content": "elasticsearch and hadoop are all very good solution, i am a beginner"
}
},
{
"_index": "forum",
"_type": "article",
"_id": "1",
"_score": 0.26742277,
"_source": {
"articleID": "XHDK-A-1293-#fJ3",
"userID": 1,
"hidden": false,
"postDate": "2017-01-01",
"tag": [
"java",
"hadoop"
],
"tag_cnt": 2,
"view_cnt": 30,
"title": "this is java and elasticsearch blog",
"content": "i like to write best elasticsearch article"
}
}
]
}
}

可以看到,这次查询document5排在了document4的前面

dis_max语法,直接取多个query中,分数最高的那个query的分数即可,我们来分析一下:
{ “match”: { “title”: “java solution” }},针对document4,是有一个分数的,比如1.1
{ “match”: { “content”: “java solution” }},针对document4,也是有一个分数的,比如1.2
取最大分数,1.2

{ “match”: { “title”: “java solution” }},针对doc5,是没有分数的
{ “match”: { “content”: “java solution” }},针对doc5,是有一个分数的,比如2.3
取最大分数,2.3

然后document4的分数 = 1.2 < document5的分数 = 2.3,所以document5就可以排在更前面的地方,符合我们的需要

基于tie_breaker参数优化dis_max搜索效果

场景

搜索条件:搜索title或content中包含java beginner的帖子

假设我们现在有3个document
document1:title中包含java,content不包含 java beginner任何一个关键词
document2:title中不包含任何一个关键词,content中包含beginner
document3:title中包含java,content中包含beginner

这时候执行搜索,可能出现的结果是document1和document2排在了document3的前面,而我们期望的是document3排在最前面

dis_max是只取一个query最大的分数,完全不考虑其他的query的分数

使用tie_breaker优化结果

tie_breaker参数的意义在于,将其他的query分数,乘以tie_breaker,然后综合在一起计算,除了取最高分以外,还会考虑其他的query分数

tie_breaker的值在0-1之间

用法示例:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
GET forum/article/_search
{
"query": {
"dis_max": {
"queries": [
{
"match": {
"title": "java beginner"
}
},
{
"match": {
"content": "java beginner"
}
}
],
"tie_breaker":0.3
}
}
}

跟queries是同级的, 可以去试一下加tie_breaker和不加时候查询的分数,对比一下就很清楚了,这里就不去演示了