Elasticsearch-27-搜索相关度TF&IDF算法

算法介绍

relevance score算法,简单来说就是计算出一个索引中的文本,与搜索文本,他们之间的关联匹配程度

Elasticsearch使用的是 term frequency/inverse document frequency算法，简称为TF/IDF算法

TF算法(Term frequency)

Term frequency:搜索文本中的各个词条在field文本中出现了多少次,出现次数越多就越相关

举个例子:
搜索请求是: hello world
document1:hello you, and world is very good
document2:hello, how are you

hello和world这两个词在document1中出现了两次,document2中出现了一次,所以document更相关

IDF算法(inverse document frequency)

inverse document frequency: 搜索文本中的各个词条在整个索引的所有document中出现了多少次,出现的次数越多,就越不相关

举例:
搜索请求是:hello world
document1:hello, today is very good
document2:hi world, how are you

看起来hello和world是每个document都出先一次,但是这个应该是document2更相关比如说在index中现在有一万条document,hello这个单词在所有的document中出现了1000次,world这个单词在所有的document中出现了100次,所以document2就更相关

Field-length norm

Field-length norm: field的值长度越长,相关度越弱

举例:
搜索请求:hello world
document1: { “title”: “hello article”, “content”: “babaaba…..(1万个单词)” }
document2: { “title”: “my article”, “content”: “blablabala…. (1万个单词),hi world” }
这个时候hello 和 world这两个词在整个index中出现的次数是一样多的,但是document1更相关,因为title这个filed中的数据短

查询_score是如何被计算出来的

语法:

GET /index/type/_search?explain
{
  "query": {
    "match": {
      "field": "text"
    }
  }
}

分析一个document是如何被匹配上的

语法:

GET /index/type/id/_explain
{
  "query": {
    "match": {
      "field": "text"
    }
  }
}