场景
假设我们现在有两个document.
document1: I really liked my small dogs, and I think my mom also liked them.
document2: He never liked any dogs, so I hope that my mom will not expect me to liked him.
第一步 分词,初步建立倒排索引
两个document中的数据将会被分词,比如分成这样
word | document1 | document2 |
---|---|---|
I | √ | √ |
really | √ | |
liked | √ | √ |
my | √ | √ |
small | √ | |
dogs | √ | |
and | √ | |
think | √ | |
mom | √ | √ |
also | √ | |
them | √ | |
He | √ | |
never | √ | |
any | √ | |
so | √ | |
hope | √ | |
that | √ | |
will | √ | |
not | √ | |
expect | √ | |
me | √ | |
to | √ | |
him | √ |
这个时候我们如果搜索 mother like little dog 的时候,不会有任何结果的
先回对搜索条件拆词,拆分为
mother
like
little
dog
这个时候去上面的倒排索引去匹配,发现没有一个词是可以匹配的到的. 这显然不是我门想要的搜索结果
其实建立倒排索引的时候,还会做一件事,就是进行normalization标准化,包括时态转换,复数,同义词,大小写等,对拆出的各个单词进行相应的处理,以便后面搜索的时候能够搜索到相关联document的概率
进行normalization后的倒排索引:
word | document1 | document2 | normalization |
---|---|---|---|
I | √ | √ | |
really | √ | ||
like | √ | √ | liked – >like |
my | √ | √ | |
little | √ | small –> little | |
dog | √ | √ | dogs –> dog |
and | √ | ||
think | √ | ||
mom | √ | √ | |
also | √ | ||
them | √ | ||
He | √ | ||
never | √ | ||
any | √ | ||
so | √ | ||
hope | √ | ||
that | √ | ||
will | √ | ||
not | √ | ||
expect | √ | ||
me | √ | ||
to | √ | ||
him | √ |
这时候再按上面的搜索条件 mother like little dog 搜索,将搜索条件分词,进行normalization后
mother –> mom
like –> like
little –> little
dog –> dog
这时候拿关键词去匹配上面的倒排索引,就能把document1和document2都搜索出来