什么是ngram
举个例子,现在有一个quick这个单词,在ngram的长度是1的时候,quick这个单词会被拆分为
ngram length = 1的情况下被拆分为q u i c k
ngram length = 2的情况下被拆分为qu ui ic ck
ngram length = 3的情况下被拆分为qui uic ick
ngram length = 4的情况下被拆分为quic uick
ngram length = 5的情况下被拆分为quick  
如上,被拆分出来的每一个词就是一个ngram.
edge ngram
本文将使用edge ngram,实现搜索联想功能
那什么是edge ngram? 举例,还是quick这个单词,使用edge ngram的话,会被拆分为
q
qu
qui
quic
quick  
举例说明
假设有一个document的值是hello world ,然后劲歌edge ngram拆分
h
he
hel
hell
hello  
w
wo
wor
worl
world  
然后我们去搜索 hello w的时候,会用hello 和 w分别去匹配然后返回.
这歌搜索联想跟我们之前的搜索联想不同,这里搜索的时候,不用再根据一个前缀去扫描整个倒排索引了,而是拿前缀去倒排索引里面去匹配即可,类似于match这种全文检索
实战案例
删除之前的my_index,然后重新创建索引,需要设置一下分词器1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24PUT /my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "autocomplete_filter":{
          "type":"edge_ngram",
          "min_gram":1,
          "max_gram":20
        }
      },
      "analyzer": {
        "autocomplete":{
          "type":"custom",
          "tokenizer": "standard",
          "filter": [
              "lowercase",
              "autocomplete_filter" 
          ]
        }
      }
    }
  }
}
创建完成之后,测试一下这个分词器1
2
3
4
5GET /my_index/_analyze
{
  "analyzer": "autocomplete",
  "text": "quick brown"
}
返回值:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74{
  "tokens": [
    {
      "token": "q",
      "start_offset": 0,
      "end_offset": 5,
      "type": "word",
      "position": 0
    },
    {
      "token": "qu",
      "start_offset": 0,
      "end_offset": 5,
      "type": "word",
      "position": 0
    },
    {
      "token": "qui",
      "start_offset": 0,
      "end_offset": 5,
      "type": "word",
      "position": 0
    },
    {
      "token": "quic",
      "start_offset": 0,
      "end_offset": 5,
      "type": "word",
      "position": 0
    },
    {
      "token": "quick",
      "start_offset": 0,
      "end_offset": 5,
      "type": "word",
      "position": 0
    },
    {
      "token": "b",
      "start_offset": 6,
      "end_offset": 11,
      "type": "word",
      "position": 1
    },
    {
      "token": "br",
      "start_offset": 6,
      "end_offset": 11,
      "type": "word",
      "position": 1
    },
    {
      "token": "bro",
      "start_offset": 6,
      "end_offset": 11,
      "type": "word",
      "position": 1
    },
    {
      "token": "brow",
      "start_offset": 6,
      "end_offset": 11,
      "type": "word",
      "position": 1
    },
    {
      "token": "brown",
      "start_offset": 6,
      "end_offset": 11,
      "type": "word",
      "position": 1
    }
  ]
}
分词器没问题以后,手动设定mapping映射1
2
3
4
5
6
7
8
9
10PUT /my_index/_mapping/my_type
{
  "properties": {
    "title":{
      "type": "string",
      "analyzer": "autocomplete",
      "search_analyzer": "standard"
    }
  }
}
然后往里加几条测试数据1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29POST /my_index/my_type/1
{
  "title":"hello world"
}
POST /my_index/my_type/2
{
  "title":"hello we"
}
POST /my_index/my_type/3
{
  "title":"hello win"
}
POST /my_index/my_type/4
{
  "title":"hello wind"
}
POST /my_index/my_type/5
{
  "title":"hello dog"
}
POST /my_index/my_type/6
{
  "title":"hello cat"
}
最后来搜索测试一下,搜索hello w1
2
3
4
5
6
7
8GET /my_index/my_type/_search
{
  "query": {
    "match_phrase": {
      "title": "hello w"
    }
  }
}
返回值:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51{
  "took": 29,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 0.8361317,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "2",
        "_score": 0.8361317,
        "_source": {
          "title": "hello we"
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "4",
        "_score": 0.8361317,
        "_source": {
          "title": "hello wind"
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 0.8271048,
        "_source": {
          "title": "hello world"
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "3",
        "_score": 0.797104,
        "_source": {
          "title": "hello win"
        }
      }
    ]
  }
}
这里如果使用的是match的话,只有hello的也会查询出来,全文检索,分数比较低.
推荐使用match_phrase,要求每个term都有,而且position刚好靠着1位,符合我们的期望