Elasticsearch-92-查询term vectors词条向量信息

term vectors介绍

可以理解为,关于词的一些统计信息.

可以查询到的信息比如有

词条的信息,比如position位置,start_offset开始的偏移值, end_offset结束的偏移值,词条的payLoads(主要用于自定义字段的权重)
词条统计,doc_freq, ttf term_freq 该词出现的次数频率
字段统计,包含sum_doc_freq:该字段中词的数量(去掉重复的数目) sum_ttf:文档中词的数量(包含重复的数目)、doc_count:涉及的文档数等等

默认这些统计信息都是基于分片的,可以设置dfs为true,返回全部的分片的信息,但是会有一定的性能问题,不推荐使用,还可以通过参数对返回的字段进行过滤,只返回感兴趣的部分

可以通过两种方式查询到term vector的信息

index-time,创建索引的时候,在mapping里面配置一下,就直接生成这些term和field的统计信息了
query-time,不需要提前创建,直接查询的时候使用就好了,是现场计算返回的

index-time生成

创建索引

PUT /my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "text": {
            "type": "text",
            "term_vector": "with_positions_offsets_payloads",
            "store" : true,
            "analyzer" : "fulltext_analyzer"
         },
         "fullname": {
            "type": "text",
            "analyzer" : "fulltext_analyzer"
        }
      }
    }
  },
  "settings" : {
    "index" : {
      "number_of_shards" : 1,
      "number_of_replicas" : 0
    },
    "analysis": {
      "analyzer": {
        "fulltext_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "type_as_payload"
          ]
        }
      }
    }
  }
}

最下面是创建了一个分词器,然后settings里面是设置了shard的数量,上面mappings里面设置了两个field,再看text里面,设置了term_vector, fullname是没有设置的

添加两条数据进去

PUT /my_index/my_type/1
{
  "fullname" : "Leo Li",
  "text" : "hello test test test "
}

PUT /my_index/my_type/2
{
  "fullname" : "Leo Li",
  "text" : "other hello test ..."
}

查询term vectors的数据

GET /my_index/my_type/1/_termvectors
{
  "fields": ["text"],
  "offsets": true,
  "payloads": true,
  "positions": true,
  "term_statistics": true,
  "field_statistics": true
}

用_termvectors查询,就是查询id是1doc中. text这个field里面的词,下面offsets,payloads,这些,都是用来控制这些数据在返回值显示不显示

返回值:

{
  "_index": "my_index",
  "_type": "my_type",
  "_id": "1",
  "_version": 1,
  "found": true,
  "took": 9,
  "term_vectors": {
    "text": {
      "field_statistics": {
        "sum_doc_freq": 6,   // 所有的document的trem的 doc_freq加起来
        "doc_count": 2,   // 有多少document包含这个field
        "sum_ttf": 8    // 所有document的trem 的ttf加起来
      },
      "terms": {   // 查询的这个field的值的所有term 
        "hello": { // term值
          "doc_freq": 2,  // 有多少document包含这个term
          "ttf": 2, // 这个term在所有document中出现的频率
          "term_freq": 1, // 这个term在当前document中出现了几次
          "tokens": [  // 一个trem 可能在这个doc中出现了好几次,每个都是一个token
            {
              "position": 0,  // 位置
              "start_offset": 0, // 开始下标
              "end_offset": 5,  // 结束下标
              "payload": "d29yZA=="
            }
          ]
        },
        "test": {
          "doc_freq": 2,
          "ttf": 4,
          "term_freq": 3,
          "tokens": [
            {
              "position": 1,
              "start_offset": 6,
              "end_offset": 10,
              "payload": "d29yZA=="
            },
            {
              "position": 2,
              "start_offset": 11,
              "end_offset": 15,
              "payload": "d29yZA=="
            },
            {
              "position": 3,
              "start_offset": 16,
              "end_offset": 20,
              "payload": "d29yZA=="
            }
          ]
        }
      }
    }
  }
}

query-time查询 term vector

上面我们创建索引的时候是没有对fullname去设置 term vector的,所以查询fullname的term vector就是query-time生成的

语法还和之前一样

GET /my_index/my_type/1/_termvectors
{
  "fields" : ["fullname"],
  "offsets" : true,
  "positions" : true,
  "term_statistics" : true,
  "field_statistics" : true
}

返回值也是和上面是相同的, 一般来说,如果条件允许,就用query-time的term vector就可以了

手动指定doc的term vector

请求:

GET /my_index/my_type/_termvectors
{
  "doc" : {
    "fullname" : "Leo Li",
    "text" : "hello test test test"
  },
  "fields" : ["text"],
  "offsets" : true,
  "payloads" : true,
  "positions" : true,
  "term_statistics" : true,
  "field_statistics" : true
}

这里是手动指定了一个doc, 实际上不是去查这个doc,而是指定你想要去安插的词条,比如上面这个请求,是查询的text这个field, 那么就是将doc里的text进行分词,然后对每个term,都去计算它现有的所有doc中的一些统计信息

这个还是挺有用的,可以手动指定要探查的term的数据情况

手动指定分词器来生成term vector

GET /my_index/my_type/_termvectors
{
  "doc" : {
    "fullname" : "Leo Li",
    "text" : "hello test test test"
  },
  "fields" : ["text"],
  "offsets" : true,
  "payloads" : true,
  "positions" : true,
  "term_statistics" : true,
  "field_statistics" : true,
  "per_field_analyzer" : {
    "text": "standard"
  }
}

前面的还是一样,就是请求最后,加了一个指定的分词器

terms filter

GET /my_index/my_type/_termvectors
{
  "doc" : {
    "fullname" : "Leo Li",
    "text" : "hello test test test"
  },
  "fields" : ["text"],
  "offsets" : true,
  "payloads" : true,
  "positions" : true,
  "term_statistics" : true,
  "field_statistics" : true,
  "filter" : {
      "max_num_terms" : 3,
      "min_term_freq" : 1,
      "min_doc_freq" : 1
    }
}

请求后加了一个filter参数,常用的有

max_num_terms 最大的词条数目
min_term_freq 最小的词频，比如忽略那些在字段中出现次数小于一定值的词条。
max_term_freq 最大的词频
min_doc_freq 最小的文档频率，比如忽略那些在文档中出现次数小于一定的值的词条
max_doc_freq 最大的文档频率
min_word_length 忽略的词的最小长度
max_word_length 忽略的词的最大长度

就是说,根据term统计信息,过滤出想要看到term vector统计结果
比如说,可以过滤掉一些出现频率过低的term

multi term vector

请求体中,指定index和type,id

GET _mtermvectors
{
   "docs": [
      {
         "_index": "my_index",
         "_type": "my_type",
         "_id": "2",
         "term_statistics": true
      },
      {
         "_index": "my_index",
         "_type": "my_type",
         "_id": "1",
         "fields": [
            "text"
         ]
      }
   ]
}

请求体中,指定type和id

GET /my_index/_mtermvectors
{
   "docs": [
      {
         "_type": "test",
         "_id": "2",
         "fields": [
            "text"
         ],
         "term_statistics": true
      },
      {
         "_type": "test",
         "_id": "1"
      }
   ]
}

请求体中指定id

GET /my_index/my_type/_mtermvectors
{
   "docs": [
      {
         "_id": "2",
         "fields": [
            "text"
         ],
         "term_statistics": true
      },
      {
         "_id": "1"
      }
   ]
}

GET /_mtermvectors
{
   "docs": [
      {
         "_index": "my_index",
         "_type": "my_type",
         "doc" : {
            "fullname" : "Leo Li",
            "text" : "hello test test test"
         }
      },
      {
         "_index": "my_index",
         "_type": "my_type",
         "doc" : {
           "fullname" : "Leo Li",
           "text" : "other hello test ..."
         }
      }
   ]
}

跟multi-type搜索模式是类似的