Using Haystack to index non-database content¶
Over on ReadTheDocs, I wanted to build search around the documentation that we’re hosting. I chose Haystack and Solr for this, because it’s the best way to do search in Django these days. However, I’ve only ever used Haystack to index content that is in the database. I thought about trying to add all the rendered HTML from the documentation into the database, but that was a non-starter.
I ended up adding a ImportedFile model to the database, which would contain the metadata for the HTML file:
#!python class ImportedFile(models.Model): project = models.ForeignKey(Project, related_name='imported_files') name = models.CharField(max_length=255) slug = models.SlugField() path = models.CharField(max_length=255) md5 = models.CharField(max_length=255)
This allows me to link the SearchIndex in haystack to a model. Then the interesting part is in the Haystack SearchIndex, where I override the prepare_text method, allowing me to read the data in from the filesystem instead of from the database.
#!python class ImportedFileIndex(SearchIndex): text = CharField(document=True) author = CharField(model_attr='project__user') project = CharField(model_attr='project__name') title = CharField(model_attr='name') def prepare_text(self, obj): full_path = obj.project.full_html_path to_read = os.path.join(full_path, obj.path.lstrip('/')) try: content = codecs.open(to_read, encoding="utf-8", mode='r').read() return content except IOError: print "%s not found" % full_path site.register(ImportedFile, ImportedFileIndex)
This means that I don’t have to bloat my database with all my rendered HTML, but have the full HTML stored in Solr which works for querying.