Elasticsearch-52-phrase match搜索及原理

近似匹配

假设现在有两个document,他们的content的值分别是:
java is my favourite programming language, and I also think spark is a very good big data system.
java spark are very related, because scala is spark’s programming language and scala is also based on jvm like java.

用一个match query去搜索java spark

{
    "match":{
        "content":"java spark"
    }
}

match query的话,只能搜索到包含java或者包含spark的document,但是不知道java和spark是不是离的很近

包含java或者包含spark的document都会被返回回来.我们其实并不知道哪个document中java和spark距离的比较近.
如果我们是希望搜索java和spark,中间没有任何其他的字符,那么这时候用match匹配做全文检索肯定就不行了.

如果说我们要尽量让java和spark离的很近的document优先返回,要给他一个更高的relevance score,这就涉及到了proximity match 近似匹配.

如果现在有两个需求:

java spark,要连在一起,中间没有任何字符
java spark,不需要连在一起,但是这两个单词靠的越近,doc的分数越高,排名越靠前

要实现上面两个需求,用match做全文检索是搞不定的. 必须得用proximity match,近似匹配

phrase match(短语匹配),proximity match(近似匹配)
本文主要说的是 phrase match,就是仅仅搜索出java和spark靠在一起的那些doc,比如有个doc,是java use’d spark这样是不行的,必须是比如java spark are very good friends,是可以搜索出来的.

phrase match: 就是将多个term作为一个短语,一起去搜索,只有包含这个短语的document才会作为返回结果.

案例

先用match query全文检索搜索一下java spark

GET /forum/article/_search
{
  "query": {
    "match": {
      "content": "java spark"
    }
  }  
}

包含java和spark的都被返回来了,不是我们想要的结果,然后修改id是5的这个document的content,因为现在的数据没有符合我们要求的

POST /forum/article/5/_update
{
  "doc": {
    "content":"spark is best big data solution based on scala ,an programming language similar to java spark"
  }
}

然后来用phrase match搜索

GET /forum/article/_search
{
  "query": {
    "match_phrase": {
      "content": "java spark"
    }
  }
}

返回值:

{
  "took": 17,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.5753642,
    "hits": [
      {
        "_index": "forum",
        "_type": "article",
        "_id": "5",
        "_score": 0.5753642,
        "_source": {
          "articleID": "DHJK-B-1395-#Ky5",
          "userID": 3,
          "hidden": false,
          "postDate": "2018-12-03",
          "tag": [
            "elasticsearch"
          ],
          "tag_cnt": 1,
          "view_cnt": 10,
          "title": "this is spark blog",
          "content": "spark is best big data solution based on scala ,an programming language similar to java spark",
          "sub_title": "haha, hello world",
          "author_first_name": "Tonny",
          "author_last_name": "Peter Smith",
          "new_author_last_name": "Peter Smith",
          "new_author_first_name": "Tonny"
        }
      }
    ]
  }
}

就是我们刚刚修改的那条数据,只有包含了java spark这个短语的document才返回了,其他的数据不会返回

原理

term position
比如现在有两个document的content值如下:
document1: hello world, java spark
document2: hi, spark java

对上面的数据进行分词,然后会记录每个词在每个doc中出现的位置

word	term position
hello	doc1(0)
word	doc1(1)
java	doc1(2) doc2(2)
spark	doc1(3) doc2(1)

我们可以用分词器来看一下

GET _analyze
{
  "text": "hello world, java spark",
  "analyzer": "standard"
}

返回值:

{
  "tokens": [
    {
      "token": "hello",
      "start_offset": 0,
      "end_offset": 5,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "world",
      "start_offset": 6,
      "end_offset": 11,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "java",
      "start_offset": 13,
      "end_offset": 17,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "spark",
      "start_offset": 18,
      "end_offset": 23,
      "type": "<ALPHANUM>",
      "position": 3
    }
  ]
}

position就是每个词在句子中所在的位置

match_phrase搜索原理:
用java spark在上面的两个document中进行搜索
java对应的是doc1(2) doc2(2),spark对应的是doc1(3) doc2(1)
要求一个doc,必须包含每个term,才能拿出来继续计算

首先看doc1:在document1中 spark的position比java的position大1,java的position是2,spark的position是3,满足条件
然后看下doc2:在document2中 java position是2,spark position是1,spark position比java position小1,而不是大1,所以doc2不匹配