Elasticsearch-53-基于slop参数实现近似匹配

slop参数

比如我们现在有一个搜索请求如下:

GET forum/article/_search
{
  "query": {
    "match_phrase": {
      "title": {
        "query": "java spark",
        "slop":1
      }
    }
  }
}

slop的作用是什么呢?
query string 中的几个term,要经过几次移动才能与一个document匹配,移动的次数就是slop

举例说明

现有一个document content的值是
hello world, java is very good, spark is also very good.
我们如果用之前说的match_phrase搜索java spark的话是搜索不到的

但是如果我们指定了slop,那么就允许java spark进行移动,来尝试与document进行匹配,比如就上面这个句子中要去匹配java spark

如图,spark向后进行了三次移动后,就能匹配到了这个document了.
slop的含义,不仅仅是说一个query string terms移动几次跟一个doc匹配上,而是说一个query string terms 最多可以移动几次去尝试跟一个doc匹配上
就上面这个例子而言slop的值只要大于等于3 就可以匹配的到,如果设置的是2,是匹配不到的

再来看一个例子:

GET /forum/article/_search
{
  "query": {
    "match_phrase": {
      "content": {
        "query": "spark data",
        "slop": 3
      }
    }
  }
}

执行搜索,返回的这个document的content值是:
spark is best big data solution based on scala ,an programming language similar to java spark
搜索关键词是 spark data, content中spark 和 data中间有3个词, 所以也是只要移动3次就可以匹配的到,所以这个slop最小设置成3就可以匹配的到

那么如果是搜索的data spark 那要怎么移动呢

GET /forum/article/_search
{
  "query": {
    "match_phrase": {
      "content": {
        "query": "data spark",
        "slop": 5
      }
    }
  }
}

看下上面这个图,前两次移动是data和spark交换了位置,然后再进行3次移动后就匹配到了,所以这个请求的slop就是最小是5

slop搜索下,关键词离的越近,relevance score就会越高,再来看个案例.搜索关键词是java best

GET /forum/article/_search
{
  "query": {
    "match_phrase": {
      "content": {
        "query": "java best",
        "slop":15
      }
    }
  }
}

返回值:

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.65380025,
    "hits": [
      {
        "_index": "forum",
        "_type": "article",
        "_id": "2",
        "_score": 0.65380025,
        "_source": {
          "articleID": "KDKE-B-9947-#kL5",
          "userID": 1,
          "hidden": false,
          "postDate": "2017-01-02",
          "tag": [
            "java"
          ],
          "tag_cnt": 1,
          "view_cnt": 50,
          "title": "this is java blog",
          "content": "i think java is the best programming language",
          "sub_title": "learned a lot of course",
          "author_first_name": "Smith",
          "author_last_name": "Williams",
          "new_author_last_name": "Williams",
          "new_author_first_name": "Smith"
        }
      },
      {
        "_index": "forum",
        "_type": "article",
        "_id": "5",
        "_score": 0.07111243,
        "_source": {
          "articleID": "DHJK-B-1395-#Ky5",
          "userID": 3,
          "hidden": false,
          "postDate": "2018-12-03",
          "tag": [
            "elasticsearch"
          ],
          "tag_cnt": 1,
          "view_cnt": 10,
          "title": "this is spark blog",
          "content": "spark is best big data solution based on scala ,an programming language similar to java spark",
          "sub_title": "haha, hello world",
          "author_first_name": "Tonny",
          "author_last_name": "Peter Smith",
          "new_author_last_name": "Peter Smith",
          "new_author_first_name": "Tonny"
        }
      }
    ]
  }
}

看先这两个的_score分数, 两个terms的距离越近,分数就越高

其实,加了slop的phrase match,就是proximity match,近似匹配