Elasticsearch-53-基于slop参数实现近似匹配

slop参数

比如我们现在有一个搜索请求如下:

1
2
3
4
5
6
7
8
9
10
11
GET forum/article/_search
{
"query": {
"match_phrase": {
"title": {
"query": "java spark",
"slop":1
}
}
}
}

slop的作用是什么呢?
query string 中的几个term,要经过几次移动才能与一个document匹配,移动的次数就是slop

举例说明

现有一个document content的值是
hello world, java is very good, spark is also very good.
我们如果用之前说的match_phrase搜索java spark的话是搜索不到的

但是如果我们指定了slop,那么就允许java spark进行移动,来尝试与document进行匹配,比如就上面这个句子中要去匹配java spark
image
如图,spark向后进行了三次移动后,就能匹配到了这个document了.
slop的含义,不仅仅是说一个query string terms移动几次跟一个doc匹配上,而是说一个query string terms 最多可以移动几次去尝试跟一个doc匹配上
就上面这个例子而言slop的值只要大于等于3 就可以匹配的到,如果设置的是2,是匹配不到的

再来看一个例子:

1
2
3
4
5
6
7
8
9
10
11
GET /forum/article/_search
{
"query": {
"match_phrase": {
"content": {
"query": "spark data",
"slop": 3
}
}
}
}

执行搜索,返回的这个document的content值是:
spark is best big data solution based on scala ,an programming language similar to java spark
搜索关键词是 spark data, content中spark 和 data中间有3个词, 所以也是只要移动3次就可以匹配的到,所以这个slop最小设置成3就可以匹配的到

那么如果是搜索的data spark 那要怎么移动呢

1
2
3
4
5
6
7
8
9
10
11
GET /forum/article/_search
{
"query": {
"match_phrase": {
"content": {
"query": "data spark",
"slop": 5
}
}
}
}

image
看下上面这个图,前两次移动是data和spark交换了位置,然后再进行3次移动后就匹配到了,所以这个请求的slop就是最小是5

slop搜索下,关键词离的越近,relevance score就会越高,再来看个案例.搜索关键词是java best

1
2
3
4
5
6
7
8
9
10
11
GET /forum/article/_search
{
"query": {
"match_phrase": {
"content": {
"query": "java best",
"slop":15
}
}
}
}

返回值:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.65380025,
"hits": [
{
"_index": "forum",
"_type": "article",
"_id": "2",
"_score": 0.65380025,
"_source": {
"articleID": "KDKE-B-9947-#kL5",
"userID": 1,
"hidden": false,
"postDate": "2017-01-02",
"tag": [
"java"
],
"tag_cnt": 1,
"view_cnt": 50,
"title": "this is java blog",
"content": "i think java is the best programming language",
"sub_title": "learned a lot of course",
"author_first_name": "Smith",
"author_last_name": "Williams",
"new_author_last_name": "Williams",
"new_author_first_name": "Smith"
}
},
{
"_index": "forum",
"_type": "article",
"_id": "5",
"_score": 0.07111243,
"_source": {
"articleID": "DHJK-B-1395-#Ky5",
"userID": 3,
"hidden": false,
"postDate": "2018-12-03",
"tag": [
"elasticsearch"
],
"tag_cnt": 1,
"view_cnt": 10,
"title": "this is spark blog",
"content": "spark is best big data solution based on scala ,an programming language similar to java spark",
"sub_title": "haha, hello world",
"author_first_name": "Tonny",
"author_last_name": "Peter Smith",
"new_author_last_name": "Peter Smith",
"new_author_first_name": "Tonny"
}
}
]
}
}

看先这两个的_score分数, 两个terms的距离越近,分数就越高

其实,加了slop的phrase match,就是proximity match,近似匹配