Using Haystack to index non-database content

Over on ReadTheDocs, I wanted to build search around the documentation that we’re hosting. I chose Haystack and Solr for this, because it’s the best way to do search in Django these days. However, I’ve only ever used Haystack to index content that is in the database. I thought about trying to add all the rendered HTML from the documentation into the database, but that was a non-starter.

I ended up adding a ImportedFile model to the database, which would contain the metadata for the HTML file:

#!python
class ImportedFile(models.Model):
    project = models.ForeignKey(Project, related_name='imported_files')
    name = models.CharField(max_length=255)
    slug = models.SlugField()
    path = models.CharField(max_length=255)
    md5 = models.CharField(max_length=255)

This allows me to link the SearchIndex in haystack to a model. Then the interesting part is in the Haystack SearchIndex, where I override the prepare_text method, allowing me to read the data in from the filesystem instead of from the database.

#!python
class ImportedFileIndex(SearchIndex):
    text = CharField(document=True)
    author = CharField(model_attr='project__user')
    project = CharField(model_attr='project__name')
    title = CharField(model_attr='name')

    def prepare_text(self, obj):
        full_path = obj.project.full_html_path
        to_read = os.path.join(full_path, obj.path.lstrip('/'))
        try:
            content = codecs.open(to_read, encoding="utf-8", mode='r').read()
            return content
        except IOError:
            print "%s not found" % full_path

site.register(ImportedFile, ImportedFileIndex)

This means that I don’t have to bloat my database with all my rendered HTML, but have the full HTML stored in Solr which works for querying.



Hey there! I'm Eric and I work on communities in the world of software documentation. Feel free to email me if you have comments on this post!