Elasticsearch-21-query string分词和mapping案例遗留问题揭秘

query string 分词

query string必须以和index建立时相同的analyzer进行分词.

比如,我们有一个document,其中有一个field,它的值是:hello you and me,建立倒排索引.
我们要搜索这个document对应的index,搜索文本是hello me ,搜索请求就是:

1	GET /index/type/_search?q=field:hello me

“hello me”就是query string,默认情况下,es会使用它对应的field建立倒排索引时相同的分词器进行分词和normalization,只有这样,才能实现正确的搜索.

举个例子,document在建立倒排索引的时候,会把dogs转为dog,然后我们在搜索的时候传一个dogs过去,就找不到了,所以搜索传过去的dogs也必须变为dog才能实现正确的搜索.

mapping引入案例遗留问题揭秘

这里有一个知识点: 不同类型的field,可能有的就是full text(全文检索),有的就是exact value(精确搜索)

在初始mapping中,我们引入了一个小案例,当时的查询结果是:

GET /website/article/_search?q=2017		            	3条结果             
GET /website/article/_search?q=2017-01-01        	    3条结果
GET /website/article/_search?q=post_date:2017-01-01   	1条结果
GET /website/article/_search?q=post_date:2017         	1条结果

首先看第一个查询,我们没有指定用哪一个field进行查询,那默认的就是 _all 查询,之前有说过 _all的话会把document中的所有field的值当做字符串拼接, _all搜索的时候是full text,要分词进行normalization后查询

我们来看一下第一个document中的数据:

{
  "post_date": "2017-01-01",
  "title": "my first article",
  "content": "this is my first article in this website",
  "author_id": 11400
}

它的_all就是 “2017-01-01 my first article this is my first article in this website 11400”

三个document的 _all中分别有 2017-01-01 2017-01-02 2017-01-03 这个建立倒排索引就是

word	document1	document2	document3
2017	√	√	√
01	√
02		√
03			√

这时候第一个搜索 _all 查询2017 肯定能查到3条
第二个搜索请求的query string 会被分为 2017,01,01, 所以也能查到3条数据

然后是第三个请求,是指定post_date这个filed去查询, post_date 是个date类型的,而不是字符串类型, date类型的数据会按照exact value去建立索引

word	document1	document2	document3
2017-01-01	√
2017-01-02		√
2017-01-03			√

所以搜索第三个请求时可以搜索到1条结果.

按照上面的说法的话,第4个请求应该是搜索不到结果的,但是实际上有一条结果,这个在这里不讲解,因为是es 5.2以后做的一个优化

分词器测试

GET /_analyze
{
  "analyzer": "standard",       // 指定分词器
  "text": "Text to analyze"     // 要拆分的文本
}