Elasticsearch-31-手动创建索引以及定制分词器

索引

创建索引

语法:

PUT /index
{
    "settings":{
        // any settings...
    },
    "mappings":{
        type1:{
            //  any settings...
        },
        type2:{
            // any settings...
        }
    }
}

示例:

PUT /my_index
{
  "settings": {
    "number_of_shards": 1,  // primary shard的数量
    "number_of_replicas": 0 // replica shard 的数量
  },
  "mappings": {
    "my_type":{
      "properties": {
        "field1":{
          "type": "text"
        }
      }
    }
  }
}

修改索引

语法:

PUT /index/_settings
{
  // any settings
}

示例:

PUT /my_index/_settings
{
  "number_of_replicas": 1   // 修改replica shard 的数量
}

删除索引

DELETE /index       
DELETE /index1,index2
DELETE /index_*   // 通配符删除
DELETE /_all    // 删除全部

在elasticsearch.yml中设置action.destructive_requires_name: true,以后就不能使用 _all删除全部了

分词器

修改分词器

之前我们说过,es默认的分词器就是standard,他做了以下几件事:
standard tokenizer:以单词边界进行切分
standard token filter:什么都不做
lowercase token filter:将所有字母转换为小写
stop token filer(默认被禁用):移除停用词,比如a the it等等

我们先来新建一个索引,并启用english stop token filer

PUT /my_index
{
  "settings": {
    "analysis": {  // 分词器相关
      "analyzer": { // 分词器
        "es_std":{  // 自定义名称
          "type":"standard", // 分词器类型
          "stopwords":"_english_"
        }
      }
    }
  }
}

执行成功后我们用之前说的测试分词器的方法来测试一下

GET /my_index/_analyze
{
  "analyzer": "es_std",  // 我们上面定义的分词器名称
  "text": "a dog is in the house"
}

返回结果:

{
  "tokens": [
    {
      "token": "dog",
      "start_offset": 2,
      "end_offset": 5,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "house",
      "start_offset": 16,
      "end_offset": 21,
      "type": "<ALPHANUM>",
      "position": 5
    }
  ]
}

可以看到停用词已经被去掉了

定制自己的分词器

我们先把创建的这个索引删除掉

1	DELETE /my_index

然后手动定制分词器


PUT /my_index
{
  "settings": {
    "analysis": {
      "char_filter": {  // 字符转换
        "&_to_and":{
          "type":"mapping",
          "mappings":["& => and "]  //& 转成 and 
        }
      },
      "filter": {
        "my_stop_words":{  // 自定义停用词过滤 
          "type":"stop",
          "stopwords":["the","a"]  // 要过滤的词
        }
      },
      "analyzer": {
        "my_analyzer":{ // 自定义名称
          "type":"custom",
          "char_filter":["html_strip","&_to_and"], // html脚本过滤和上面定义的&_to_and
          "tokenizer":"standard", 
          "filter":["lowercase","my_stop_words"] // 大小写转换 和 上面定义的停用词过滤
        }
      }
    }
  }
}

执行完毕后来测试一下

GET /my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "tom&jerry are a friend in the house, <a>, HAHA!!"
}

返回值:

{
  "tokens": [
    {
      "token": "tomandjerry",
      "start_offset": 0,
      "end_offset": 9,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "are",
      "start_offset": 10,
      "end_offset": 13,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "friend",
      "start_offset": 16,
      "end_offset": 22,
      "type": "<ALPHANUM>",
      "position": 3
    },
    {
      "token": "in",
      "start_offset": 23,
      "end_offset": 25,
      "type": "<ALPHANUM>",
      "position": 4
    },
    {
      "token": "house",
      "start_offset": 30,
      "end_offset": 35,
      "type": "<ALPHANUM>",
      "position": 6
    },
    {
      "token": "haha",
      "start_offset": 42,
      "end_offset": 46,
      "type": "<ALPHANUM>",
      "position": 7
    }
  ]
}

a the 这两个停用词被去掉了,&也转为and了,a标签被过滤掉,最后的大写也转成了小写

使用自定义分词器

上面我们自定义的分词器已经可以使用了,那么如何让type中的某个filed来使用我们自定义的分词器

PUT /my_index/_mapping/my_type 
{
  "properties": {
    "content":{  // field名称
      "type": "text",
      "analyzer": "my_analyzer" // 分词器名称
    }
  }
}