Elasticsearch-58-通过ngram分词机制实现index-time搜索联想

什么是ngram

举个例子,现在有一个quick这个单词,在ngram的长度是1的时候,quick这个单词会被拆分为
ngram length = 1的情况下被拆分为q u i c k
ngram length = 2的情况下被拆分为qu ui ic ck
ngram length = 3的情况下被拆分为qui uic ick
ngram length = 4的情况下被拆分为quic uick
ngram length = 5的情况下被拆分为quick

如上,被拆分出来的每一个词就是一个ngram.

edge ngram

本文将使用edge ngram,实现搜索联想功能
那什么是edge ngram? 举例,还是quick这个单词,使用edge ngram的话,会被拆分为
q
qu
qui
quic
quick

举例说明

假设有一个document的值是hello world ,然后劲歌edge ngram拆分
h
he
hel
hell
hello

w
wo
wor
worl
world

然后我们去搜索 hello w的时候,会用hello 和 w分别去匹配然后返回.

这歌搜索联想跟我们之前的搜索联想不同,这里搜索的时候,不用再根据一个前缀去扫描整个倒排索引了,而是拿前缀去倒排索引里面去匹配即可,类似于match这种全文检索

实战案例

删除之前的my_index,然后重新创建索引,需要设置一下分词器

PUT /my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "autocomplete_filter":{
          "type":"edge_ngram",
          "min_gram":1,
          "max_gram":20
        }
      },
      "analyzer": {
        "autocomplete":{
          "type":"custom",
          "tokenizer": "standard",
          "filter": [
              "lowercase",
              "autocomplete_filter" 
          ]
        }
      }
    }
  }
}

创建完成之后,测试一下这个分词器

GET /my_index/_analyze
{
  "analyzer": "autocomplete",
  "text": "quick brown"
}

返回值:

{
  "tokens": [
    {
      "token": "q",
      "start_offset": 0,
      "end_offset": 5,
      "type": "word",
      "position": 0
    },
    {
      "token": "qu",
      "start_offset": 0,
      "end_offset": 5,
      "type": "word",
      "position": 0
    },
    {
      "token": "qui",
      "start_offset": 0,
      "end_offset": 5,
      "type": "word",
      "position": 0
    },
    {
      "token": "quic",
      "start_offset": 0,
      "end_offset": 5,
      "type": "word",
      "position": 0
    },
    {
      "token": "quick",
      "start_offset": 0,
      "end_offset": 5,
      "type": "word",
      "position": 0
    },
    {
      "token": "b",
      "start_offset": 6,
      "end_offset": 11,
      "type": "word",
      "position": 1
    },
    {
      "token": "br",
      "start_offset": 6,
      "end_offset": 11,
      "type": "word",
      "position": 1
    },
    {
      "token": "bro",
      "start_offset": 6,
      "end_offset": 11,
      "type": "word",
      "position": 1
    },
    {
      "token": "brow",
      "start_offset": 6,
      "end_offset": 11,
      "type": "word",
      "position": 1
    },
    {
      "token": "brown",
      "start_offset": 6,
      "end_offset": 11,
      "type": "word",
      "position": 1
    }
  ]
}

分词器没问题以后,手动设定mapping映射

PUT /my_index/_mapping/my_type
{
  "properties": {
    "title":{
      "type": "string",
      "analyzer": "autocomplete",
      "search_analyzer": "standard"
    }
  }
}

然后往里加几条测试数据

POST /my_index/my_type/1
{
  "title":"hello world"
}

POST /my_index/my_type/2
{
  "title":"hello we"
}

POST /my_index/my_type/3
{
  "title":"hello win"
}

POST /my_index/my_type/4
{
  "title":"hello wind"
}

POST /my_index/my_type/5
{
  "title":"hello dog"
}

POST /my_index/my_type/6
{
  "title":"hello cat"
}

最后来搜索测试一下,搜索hello w

GET /my_index/my_type/_search
{
  "query": {
    "match_phrase": {
      "title": "hello w"
    }
  }
}

返回值:

{
  "took": 29,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 0.8361317,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "2",
        "_score": 0.8361317,
        "_source": {
          "title": "hello we"
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "4",
        "_score": 0.8361317,
        "_source": {
          "title": "hello wind"
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 0.8271048,
        "_source": {
          "title": "hello world"
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "3",
        "_score": 0.797104,
        "_source": {
          "title": "hello win"
        }
      }
    ]
  }
}

这里如果使用的是match的话,只有hello的也会查询出来,全文检索,分数比较低.

推荐使用match_phrase,要求每个term都有,而且position刚好靠着1位,符合我们的期望