Elasticsearch-19-倒排索引核心原理

场景

假设我们现在有两个document.

document1: I really liked my small dogs, and I think my mom also liked them.

document2: He never liked any dogs, so I hope that my mom will not expect me to liked him.

第一步 分词,初步建立倒排索引

两个document中的数据将会被分词,比如分成这样

word document1 document2
I
really
liked
my
small
dogs
and
think
mom
also
them
He
never
any
so
hope
that
will
not
expect
me
to
him

这个时候我们如果搜索 mother like little dog 的时候,不会有任何结果的
先回对搜索条件拆词,拆分为
mother
like
little
dog

这个时候去上面的倒排索引去匹配,发现没有一个词是可以匹配的到的. 这显然不是我门想要的搜索结果

其实建立倒排索引的时候,还会做一件事,就是进行normalization标准化,包括时态转换,复数,同义词,大小写等,对拆出的各个单词进行相应的处理,以便后面搜索的时候能够搜索到相关联document的概率

进行normalization后的倒排索引:

word document1 document2 normalization
I
really
like liked – >like
my
little small –> little
dog dogs –> dog
and
think
mom
also
them
He
never
any
so
hope
that
will
not
expect
me
to
him

这时候再按上面的搜索条件 mother like little dog 搜索,将搜索条件分词,进行normalization后
mother –> mom
like –> like
little –> little
dog –> dog

这时候拿关键词去匹配上面的倒排索引,就能把document1和document2都搜索出来