haystack高亮和过滤

2017年9月5日 21:47 702  0  Django python

Django-haystack是django的开源搜索框架

  • 该框架支持Solr,Elasticsearch,Whoosh, Xapian搜索引擎,四种全文检索引擎,点击查看官方网站
  • whoosh:纯Python编写的全文搜索引擎,虽然性能比不上sphinx、xapian、Elasticsearc等,但是无二进制包,程序不会莫名其妙的崩溃,对于小型的站点,whoosh已经足够使用。
  • jieba:一款免费的中文分词包,如果觉得不好用可以使用一些收费产品。

具体配置操作这里不做累述,官方文档,网上各大教程都有。此次的重点在于其搜索出来的文章中搜索关键字高亮问题。
haystack为我们提供了 {% highlight %}标签

{% highlight <text_block> with <query> [css_class "class_name"] [html_tag "span"] [max_length 200] %}  

其效果为text_block 里的 query 添加css_class,html_tag,而max_length 为最终返回长度,默认的html_tag 值为 span ,css_class 值为 highlighted,max_length 值为 200,然后就可以通过CSS来添加效果,从而达到你想要的高亮效果。

#前省略...
{% highlight post.object.title with query %}
{% highlight post.object.body with query %}
#后省略...

然而在一次偶然中发现,如果搜索内容出现在标题中,会将搜索内容之前的内容被。。。代替掉,而我想达到的效果是标题中的内容是不应该被略的。 而且由于我后台加入了markdown编辑文章,在搜索出来的结果中会显示markdown标签。

于是我选择了去看highlight标签源码。然后发现源码中当搜索的关键词前有内容时(即开始偏移量start_offset)会将前面的内容给省去,用。。。替换。
为了解决这个问题,于是我抠出了highlight源码,为了不直接在源码上动刀子,我重新命名定义了属于自己的标签,添入了两个属性。
添加new_highlight.py 文件和 highlighting.py 文件到templatetags下(源码分别位于haystack/templatetags/lighlight.py 和 haystack/utils/lighlighting.py 中),修改如下:

new_highlight.py

# encoding: utf-8
from __future__ import absolute_import, division, print_function, unicode_literals

from django import template
from django.conf import settings
from django.core.exceptions import ImproperlyConfigured
from django.utils import six

from haystack.utils import importlib

register = template.Library()


class HighlightNode(template.Node):
    def __init__(self, text_block, query, html_tag=None, css_class=None, max_length=None, start_head=None,
                 filter_mark_down=None):
        self.text_block = template.Variable(text_block)
        self.query = template.Variable(query)
        self.html_tag = html_tag
        self.css_class = css_class
        self.max_length = max_length
        self.start_head = start_head
        self.filter_mark_down = filter_mark_down

        if html_tag is not None:
            self.html_tag = template.Variable(html_tag)

        if css_class is not None:
            self.css_class = template.Variable(css_class)

        if max_length is not None:
            self.max_length = template.Variable(max_length)

        if start_head is not None:
            self.start_head = template.Variable(start_head)

        if filter_mark_down is not None:
            self.filter_mark_down = template.Variable(filter_mark_down)

    def render(self, context):
        text_block = self.text_block.resolve(context)
        query = self.query.resolve(context)
        kwargs = {}

        if self.html_tag is not None:
            kwargs['html_tag'] = self.html_tag.resolve(context)

        if self.css_class is not None:
            kwargs['css_class'] = self.css_class.resolve(context)

        if self.max_length is not None:
            kwargs['max_length'] = self.max_length.resolve(context)

        if self.start_head is not None:
            kwargs['start_head'] = self.start_head.resolve(context)

        if self.filter_mark_down is not None:
            kwargs['filter_mark_down'] = self.filter_mark_down.resolve(context)

            # Handle a user-defined highlighting function.
        if hasattr(settings, 'HAYSTACK_CUSTOM_HIGHLIGHTER') and settings.HAYSTACK_CUSTOM_HIGHLIGHTER:
            # Do the import dance.
            try:
                path_bits = settings.HAYSTACK_CUSTOM_HIGHLIGHTER.split('.')
                highlighter_path, highlighter_classname = '.'.join(path_bits[:-1]), path_bits[-1]
                highlighter_module = importlib.import_module(highlighter_path)
                highlighter_class = getattr(highlighter_module, highlighter_classname)
            except (ImportError, AttributeError) as e:
                raise ImproperlyConfigured(
                    "The highlighter '%s' could not be imported: %s" % (settings.HAYSTACK_CUSTOM_HIGHLIGHTER, e))
        else:
            from .highlighting import Highlighter
            highlighter_class = Highlighter

        highlighter = highlighter_class(query, **kwargs)
        highlighted_text = highlighter.highlight(text_block)
        return highlighted_text


@register.tag
def new_highlight(parser, token):

    """
    Takes a block of text and highlights words from a provided query within that
    block of text. Optionally accepts arguments to provide the HTML tag to wrap
    highlighted word in, a CSS class to use with the tag and a maximum length of
    the blurb in characters.

    Syntax::

        {% highlight <text_block> with <query> [css_class "class_name"] [html_tag "span"] [max_length 200] %}

    Example::

        # Highlight summary with default behavior.
        {% highlight result.summary with request.query %}

        # Highlight summary but wrap highlighted words with a div and the
        # following CSS class.
        {% highlight result.summary with request.query html_tag "div" css_class "highlight_me_please" %}

        # Highlight summary but only show 40 characters.
        {% highlight result.summary with request.query max_length 40 %}
    """
    bits = token.split_contents()
    tag_name = bits[0]

    if not len(bits) % 2 == 0:
        raise template.TemplateSyntaxError(u"'%s' tag requires valid pairings arguments." % tag_name)

    text_block = bits[1]

    if len(bits) < 4:
        raise template.TemplateSyntaxError(u"'%s' tag requires an object and a query provided by 'with'." % tag_name)

    if bits[2] != 'with':
        raise template.TemplateSyntaxError(u"'%s' tag's second argument should be 'with'." % tag_name)

    query = bits[3]

    arg_bits = iter(bits[4:])
    kwargs = {}

    for bit in arg_bits:
        if bit == 'css_class':
            kwargs['css_class'] = six.next(arg_bits)

        if bit == 'html_tag':
            kwargs['html_tag'] = six.next(arg_bits)

        if bit == 'max_length':
            kwargs['max_length'] = six.next(arg_bits)

        if bit == 'start_head':
            kwargs['start_head'] = six.next(arg_bits)

        if bit == 'filter_mark_down':
            kwargs['filter_mark_down'] = six.next(arg_bits)

    return HighlightNode(text_block, query, **kwargs)

highlighting.py

# encoding: utf-8

from __future__ import absolute_import, division, print_function, unicode_literals

from django.utils.html import strip_tags
import markdown


class Highlighter(object):
    # 默认值
    css_class = 'highlighted'
    html_tag = 'span'
    max_length = 200
    start_head = False
    filter_mark_down = False
    text_block = ''

    def __init__(self, query, **kwargs):
        self.query = query

        if 'max_length' in kwargs:
            self.max_length = int(kwargs['max_length'])

        if 'html_tag' in kwargs:
            self.html_tag = kwargs['html_tag']

        if 'css_class' in kwargs:
            self.css_class = kwargs['css_class']

        if 'start_head' in kwargs:
            self.start_head = kwargs['start_head']

        if 'filter_mark_down' in kwargs:
            self.filter_mark_down = kwargs['filter_mark_down']

        self.query_words = set([word.lower() for word in self.query.split() if not word.startswith('-')])

    def highlight(self, text_block):

        # 若需将搜索结果里包含的markdown标签的去掉 filter_mark_down True 便可
        if self.filter_mark_down:
            md = markdown.Markdown(extensions=[
                'markdown.extensions.extra',
                'markdown.extensions.codehilite',
            ])

            text_block = strip_tags(md.convert(text_block))

        self.text_block = strip_tags(text_block)
        highlight_locations = self.find_highlightable_words()
        start_offset, end_offset = self.find_window(highlight_locations)
        return self.render_html(highlight_locations, start_offset, end_offset)

    def find_highlightable_words(self):
        # Use a set so we only do this once per unique word.
        word_positions = {}

        # Pre-compute the length.
        end_offset = len(self.text_block)
        lower_text_block = self.text_block.lower()

        for word in self.query_words:
            if not word in word_positions:
                word_positions[word] = []

            start_offset = 0

            while start_offset < end_offset:
                next_offset = lower_text_block.find(word, start_offset, end_offset)

                # If we get a -1 out of find, it wasn't found. Bomb out and
                # start the next word.
                if next_offset == -1:
                    break

                word_positions[word].append(next_offset)
                start_offset = next_offset + len(word)

        return word_positions

    def find_window(self, highlight_locations):
        best_start = 0
        best_end = self.max_length

        # First, make sure we have words.
        if not len(highlight_locations):
            return (best_start, best_end)

        words_found = []

        # Next, make sure we found any words at all.
        for word, offset_list in highlight_locations.items():
            if len(offset_list):
                # Add all of the locations to the list.
                words_found.extend(offset_list)

        if not len(words_found):
            return (best_start, best_end)

        if len(words_found) == 1:
            return (words_found[0], words_found[0] + self.max_length)

            # Sort the list so it's in ascending order.
        words_found = sorted(words_found)

        # We now have a denormalized list of all positions were a word was
        # found. We'll iterate through and find the densest window we can by
        # counting the number of found offsets (-1 to fit in the window).
        highest_density = 0

        if words_found[:-1][0] > self.max_length:
            best_start = words_found[:-1][0]
            best_end = best_start + self.max_length

        for count, start in enumerate(words_found[:-1]):
            current_density = 1

            for end in words_found[count + 1:]:
                if end - start < self.max_length:
                    current_density += 1
                else:
                    current_density = 0

                    # Only replace if we have a bigger (not equal density) so we
                # give deference to windows earlier in the document.
                if current_density > highest_density:
                    best_start = start
                    best_end = start + self.max_length
                    highest_density = current_density

        return (best_start, best_end)

    def render_html(self, highlight_locations=None, start_offset=None, end_offset=None):
        # Start by chopping the block down to the proper window.
        # text_block为内容,start_offset,end_offset分别为第一个匹配query开始和按长度截断位置
        text = self.text_block[start_offset:end_offset]

        # Invert highlight_locations to a location -> term list
        term_list = []

        for term, locations in highlight_locations.items():
            term_list += [(loc - start_offset, term) for loc in locations]

        loc_to_term = sorted(term_list)

        # Prepare the highlight template
        if self.css_class:
            hl_start = '<%s class="%s">' % (self.html_tag, self.css_class)
        else:
            hl_start = '<%s>' % (self.html_tag)

        hl_end = '</%s>' % self.html_tag

        # Copy the part from the start of the string to the first match,
        # and there replace the match with a highlighted version.
        # matched_so_far最终求得为text中最后一个匹配query的结尾
        highlighted_chunk = ""
        matched_so_far = 0
        prev = 0
        prev_str = ""

        for cur, cur_str in loc_to_term:
            # This can be in a different case than cur_str
            actual_term = text[cur:cur + len(cur_str)]

            # Handle incorrect highlight_locations by first checking for the term
            if actual_term.lower() == cur_str:
                if cur < prev + len(prev_str):
                    continue

                    # 分别添上每个query+其后面的一部分(下一个query的前一个位置)
                highlighted_chunk += text[prev + len(prev_str):cur] + hl_start + actual_term + hl_end
                prev = cur
                prev_str = cur_str

                # Keep track of how far we've copied so far, for the last step
                matched_so_far = cur + len(actual_term)

                # Don't forget the chunk after the last term
        # 加上最后一个匹配的query后面的部分
        highlighted_chunk += text[matched_so_far:]

        # 如果不要开头not start_head才加点
        if start_offset > 0 and not self.start_head:
            highlighted_chunk = '...%s' % highlighted_chunk

        if end_offset < len(self.text_block):
            highlighted_chunk = '%s...' % highlighted_chunk

            # 目前为止还不包含start_offset前面的,即第一个匹配的前面的部分(text_block[:start_offset])
            # 如需展示(当start_head为True时)便加上
        if self.start_head:

            highlighted_chunk = self.text_block[:start_offset] + highlighted_chunk
        return highlighted_chunk

这两个文件在源码的基础上多添加了两个选项: 一个是start_head ,默认为False,当设置为True时, 则搜索关键字前面部分不会省略。 一个是filter_mark_down ,默认为False,当设置为True时,则会过滤掉内容中的markdown标签。 于是在模板中载入新定义的new_highlight标签

#前省略...
{% new_highlight post.object.title with query start_head True%}
{% new_highlight post.object.body with query filter_mark_down True%}
#后省略...

来看看修改过后的效果吧: