Elasticsearch-52-phrase match搜索及原理

近似匹配

假设现在有两个document,他们的content的值分别是:
java is my favourite programming language, and I also think spark is a very good big data system.
java spark are very related, because scala is spark’s programming language and scala is also based on jvm like java.

用一个match query去搜索java spark

1
2
3
4
5
{
"match":{
"content":"java spark"
}
}

match query的话,只能搜索到包含java或者包含spark的document,但是不知道java和spark是不是离的很近

包含java或者包含spark的document都会被返回回来.我们其实并不知道哪个document中java和spark距离的比较近.
如果我们是希望搜索java和spark,中间没有任何其他的字符,那么这时候用match匹配做全文检索肯定就不行了.

如果说我们要尽量让java和spark离的很近的document优先返回,要给他一个更高的relevance score,这就涉及到了proximity match 近似匹配.

如果现在有两个需求:

  1. java spark,要连在一起,中间没有任何字符
  2. java spark,不需要连在一起,但是这两个单词靠的越近,doc的分数越高,排名越靠前

要实现上面两个需求,用match做全文检索是搞不定的. 必须得用proximity match,近似匹配

phrase match(短语匹配),proximity match(近似匹配)
本文主要说的是 phrase match,就是仅仅搜索出java和spark靠在一起的那些doc,比如有个doc,是java use’d spark这样是不行的,必须是比如java spark are very good friends,是可以搜索出来的.

phrase match: 就是将多个term作为一个短语,一起去搜索,只有包含这个短语的document才会作为返回结果.

案例

先用match query全文检索搜索一下java spark

1
2
3
4
5
6
7
8
GET /forum/article/_search
{
"query": {
"match": {
"content": "java spark"
}
}
}

包含java和spark的都被返回来了,不是我们想要的结果,然后修改id是5的这个document的content,因为现在的数据没有符合我们要求的

1
2
3
4
5
6
POST /forum/article/5/_update
{
"doc": {
"content":"spark is best big data solution based on scala ,an programming language similar to java spark"
}
}

然后来用phrase match搜索

1
2
3
4
5
6
7
8
GET /forum/article/_search
{
"query": {
"match_phrase": {
"content": "java spark"
}
}
}

返回值:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
{
"took": 17,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.5753642,
"hits": [
{
"_index": "forum",
"_type": "article",
"_id": "5",
"_score": 0.5753642,
"_source": {
"articleID": "DHJK-B-1395-#Ky5",
"userID": 3,
"hidden": false,
"postDate": "2018-12-03",
"tag": [
"elasticsearch"
],
"tag_cnt": 1,
"view_cnt": 10,
"title": "this is spark blog",
"content": "spark is best big data solution based on scala ,an programming language similar to java spark",
"sub_title": "haha, hello world",
"author_first_name": "Tonny",
"author_last_name": "Peter Smith",
"new_author_last_name": "Peter Smith",
"new_author_first_name": "Tonny"
}
}
]
}
}

就是我们刚刚修改的那条数据,只有包含了java spark这个短语的document才返回了,其他的数据不会返回

原理

term position
比如现在有两个document的content值如下:
document1: hello world, java spark
document2: hi, spark java

对上面的数据进行分词,然后会记录每个词在每个doc中出现的位置

word term position
hello doc1(0)
word doc1(1)
java doc1(2) doc2(2)
spark doc1(3) doc2(1)

我们可以用分词器来看一下

1
2
3
4
5
GET _analyze
{
"text": "hello world, java spark",
"analyzer": "standard"
}

返回值:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
{
"tokens": [
{
"token": "hello",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "world",
"start_offset": 6,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "java",
"start_offset": 13,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "spark",
"start_offset": 18,
"end_offset": 23,
"type": "<ALPHANUM>",
"position": 3
}
]
}

position就是每个词在句子中所在的位置

match_phrase搜索原理:
用java spark在上面的两个document中进行搜索
java对应的是doc1(2) doc2(2),spark对应的是doc1(3) doc2(1)
要求一个doc,必须包含每个term,才能拿出来继续计算

首先看doc1:在document1中 spark的position比java的position大1,java的position是2,spark的position是3,满足条件
然后看下doc2:在document2中 java position是2,spark position是1,spark position比java position小1,而不是大1,所以doc2不匹配