You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Great idea! I was thinking about doing this myself too but you beat me to it.
Have you tested to make sure the monkey patching at the top doesn't affect other areas of the codebase that depend on the native database connection provided by django?
I tested this on my "prod" dataset, and the index it created was huge at 1GB, 30% of the size of the data/archive folder. This seemed odd, because it's using FTS5 "contentless" indexes. I'm fairly sure that what's happening is that SQLite is indexing very long "terms", in the form of base64 strings in data: URLs, when it is given the contents of singlefile.html as the indexable content.
I thought about and experimented with several approaches to this, and I think the best approach is to (very loosely) parse singlefile.html and pass only text and some attributes to the search backend, rather than the entire HTML contents. I've opened a pull request with this approach: #1244. This resulted in a 10x reduction in index size. I think that merging this PR without somehow addressing the ballooning index size would result in issues down the road.
I think that Sonic doesn't have this problem because it limits maximum term length, but FTS5 doesn't have a configuration option for maximum (or minimum) term length. Or, maybe it's just the overall MAX_SONIC_TEXT_TOTAL_LENGTH? Even if that's the case, the approach of parsing and including only meaningful text content, not JavaScript code, data URLs, and markup, would improve the signal-to-noise ratio of indexed content, so searching for "html" wouldn't hit on every document that had its singlefile.html indexed.
Use SQLite's FTS5 extension to power full-text search without any
additional dependencies. FTS5 was introduced in SQLite 3.9.0,
[released][1] in 2015 so should be available on most SQLite
installations at this point in time.
[1]: https://www.sqlite.org/changes.html#version_3_9_0
Clean up error handling, and report a better error message
on search and flush if FTS5 tables haven't yet been created.
Add some mypy comments to clean up type-checking errors.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.