场景
假设我们现在有两个document.
document1: I really liked my small dogs, and I think my mom also liked them.
document2: He never liked any dogs, so I hope that my mom will not expect me to liked him.
第一步 分词,初步建立倒排索引
两个document中的数据将会被分词,比如分成这样
| word | document1 | document2 |
|---|---|---|
| I | √ | √ |
| really | √ | |
| liked | √ | √ |
| my | √ | √ |
| small | √ | |
| dogs | √ | |
| and | √ | |
| think | √ | |
| mom | √ | √ |
| also | √ | |
| them | √ | |
| He | √ | |
| never | √ | |
| any | √ | |
| so | √ | |
| hope | √ | |
| that | √ | |
| will | √ | |
| not | √ | |
| expect | √ | |
| me | √ | |
| to | √ | |
| him | √ |
这个时候我们如果搜索 mother like little dog 的时候,不会有任何结果的
先回对搜索条件拆词,拆分为
mother
like
little
dog
这个时候去上面的倒排索引去匹配,发现没有一个词是可以匹配的到的. 这显然不是我门想要的搜索结果
其实建立倒排索引的时候,还会做一件事,就是进行normalization标准化,包括时态转换,复数,同义词,大小写等,对拆出的各个单词进行相应的处理,以便后面搜索的时候能够搜索到相关联document的概率
进行normalization后的倒排索引:
| word | document1 | document2 | normalization |
|---|---|---|---|
| I | √ | √ | |
| really | √ | ||
| like | √ | √ | liked – >like |
| my | √ | √ | |
| little | √ | small –> little | |
| dog | √ | √ | dogs –> dog |
| and | √ | ||
| think | √ | ||
| mom | √ | √ | |
| also | √ | ||
| them | √ | ||
| He | √ | ||
| never | √ | ||
| any | √ | ||
| so | √ | ||
| hope | √ | ||
| that | √ | ||
| will | √ | ||
| not | √ | ||
| expect | √ | ||
| me | √ | ||
| to | √ | ||
| him | √ |
这时候再按上面的搜索条件 mother like little dog 搜索,将搜索条件分词,进行normalization后
mother –> mom
like –> like
little –> little
dog –> dog
这时候拿关键词去匹配上面的倒排索引,就能把document1和document2都搜索出来