Elasticsearch-58-通过ngram分词机制实现index-time搜索联想

什么是ngram

举个例子,现在有一个quick这个单词,在ngram的长度是1的时候,quick这个单词会被拆分为
ngram length = 1的情况下被拆分为q u i c k
ngram length = 2的情况下被拆分为qu ui ic ck
ngram length = 3的情况下被拆分为qui uic ick
ngram length = 4的情况下被拆分为quic uick
ngram length = 5的情况下被拆分为quick

如上,被拆分出来的每一个词就是一个ngram.

edge ngram

本文将使用edge ngram,实现搜索联想功能
那什么是edge ngram? 举例,还是quick这个单词,使用edge ngram的话,会被拆分为
q
qu
qui
quic
quick

举例说明

假设有一个document的值是hello world ,然后劲歌edge ngram拆分
h
he
hel
hell
hello

w
wo
wor
worl
world

然后我们去搜索 hello w的时候,会用hello 和 w分别去匹配然后返回.

这歌搜索联想跟我们之前的搜索联想不同,这里搜索的时候,不用再根据一个前缀去扫描整个倒排索引了,而是拿前缀去倒排索引里面去匹配即可,类似于match这种全文检索

实战案例

删除之前的my_index,然后重新创建索引,需要设置一下分词器

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter":{
"type":"edge_ngram",
"min_gram":1,
"max_gram":20
}
},
"analyzer": {
"autocomplete":{
"type":"custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
}
}

创建完成之后,测试一下这个分词器

1
2
3
4
5
GET /my_index/_analyze
{
"analyzer": "autocomplete",
"text": "quick brown"
}

返回值:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
{
"tokens": [
{
"token": "q",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "qu",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "qui",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "quic",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "quick",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "b",
"start_offset": 6,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "br",
"start_offset": 6,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "bro",
"start_offset": 6,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "brow",
"start_offset": 6,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "brown",
"start_offset": 6,
"end_offset": 11,
"type": "word",
"position": 1
}
]
}

分词器没问题以后,手动设定mapping映射

1
2
3
4
5
6
7
8
9
10
PUT /my_index/_mapping/my_type
{
"properties": {
"title":{
"type": "string",
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}

然后往里加几条测试数据

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
POST /my_index/my_type/1
{
"title":"hello world"
}

POST /my_index/my_type/2
{
"title":"hello we"
}

POST /my_index/my_type/3
{
"title":"hello win"
}

POST /my_index/my_type/4
{
"title":"hello wind"
}

POST /my_index/my_type/5
{
"title":"hello dog"
}

POST /my_index/my_type/6
{
"title":"hello cat"
}

最后来搜索测试一下,搜索hello w

1
2
3
4
5
6
7
8
GET /my_index/my_type/_search
{
"query": {
"match_phrase": {
"title": "hello w"
}
}
}

返回值:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
{
"took": 29,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 0.8361317,
"hits": [
{
"_index": "my_index",
"_type": "my_type",
"_id": "2",
"_score": 0.8361317,
"_source": {
"title": "hello we"
}
},
{
"_index": "my_index",
"_type": "my_type",
"_id": "4",
"_score": 0.8361317,
"_source": {
"title": "hello wind"
}
},
{
"_index": "my_index",
"_type": "my_type",
"_id": "1",
"_score": 0.8271048,
"_source": {
"title": "hello world"
}
},
{
"_index": "my_index",
"_type": "my_type",
"_id": "3",
"_score": 0.797104,
"_source": {
"title": "hello win"
}
}
]
}
}

这里如果使用的是match的话,只有hello的也会查询出来,全文检索,分数比较低.

推荐使用match_phrase,要求每个term都有,而且position刚好靠着1位,符合我们的期望