datasheetsbeta

The technology

So... what makes SeekSeek tick? Let's get the boring bits out of the way first:

None of that is really very interesting, but people always ask about it. Let's move on to the interesting bits!

The goal

Before we can talk about the technology, we need to talk about what the technology was built for. SeekSeek is radical software. From the ground up, it was designed to be FOSS, collaborative and community-driven, non-commercial, ad-free, and to improve the world - in the case of SeekSeek specifically, to improve on the poor state of keyword-only searches by providing highly specialized search engines instead!

But... that introduces some unusual requirements:

At the time of writing, there's only a datasheet search engine. However, the long-term goal is for SeekSeek to become a large collection of specialized search engines - each one with a tailor-made UI that's ideal for the thing being searched through. So all of the above needs to be satisfied not just for a datasheet search engine, but for a potentially unlimited series of search engines, many of which are not even on the roadmap yet!

And well, the very short version is that none of the existing options that I've evaluated even came close to meeting these requirements. Existing scraping stacks, job queues, and so on tend to very much be designed for corporate environments with tight control over who works on what. That wasn't an option here. So let's talk about what we ended up with instead!

The scraping server

The core component in SeekSeek is the 'scraping server' - an experimental project called srap that was built specifically for SeekSeek; though also designed to be more generically useful. You can think of srap as a persistent job queue that's optimized for scraping.

So what does that mean? The basic idea behind srap is that you have a big pile of "items" - each item isn't much more than a unique identifier and some 'initial data' to represent the work to be done. Each item can have zero or more 'tags' assigned, which are just short strings. Crucially, none of these items do anything yet - they're really just a mapping from an identifier to some arbitrarily-shaped JSON.

The real work starts with the scraper configuration. Even though it's called a 'configuration', it's really more of a codebase - you can find the configuration that SeekSeek uses here. You'll notice that it defines a number of tasks and seed items. The seed items are simply inserted automatically if they don't exist yet, and define the 'starting point' for the scraper.

The tasks, however, define what the scraper does. Every task represents one specific operation in the scraping process; typically, there will be multiple tasks per source. One to find product categories, one to extract products from a category listing, one to extract data from a product page, and so on. Each of these tasks has its own concurrency settings, as well as a TTL (Time-To-Live) that defines after how long the scraper should revisit it.

Finally, what wires it all together are the tag mappings. These define what tasks should be executed for what tags - or more accurately, for all the items that are tagged with those tags. Tags associated with items are dynamic, they can be added or removed by any scraping task. This provides a huge amount of flexibility, because any task can essentially queue any other task, just by giving an item the right tag. The scraping server then makes sure that it lands at the right spot in the queue at the right time - the task itself doesn't need to care about any of that.

Here's a practical example, from the datasheet search tasks:

One thing that's not mentioned above is that lcsc:scrapeCategory doesn't actually scrape all of the items for a category - it just scrapes a specific page of them! The initiallcsc:findCategories task would have created as many of such 'page tasks' as there are pages to scrape, based on the amount of items a category is said to have.

More interesting, though, is that the scraping flow doesn't have to be this unidirectional - if the total amount of pages could only be learned from scraping the first page, it would have been entirely possible for the lcsc:scrapeCategory task to create additional lcsc:category items! The tag-based system makes recursive discovery like this a breeze, and because everything is keyed by a unique identifier and persistent, loops are automatically prevented.

You'll probably have noticed that none of the above mentions HTTP requests. That's because srap doesn't care - it has no idea what HTTP even is! All of the actual scraping logic is completely defined by the configuration - and that's what makes it a codebase. This is the scraping logic for extracting products from an LCSC category, for example. This is also why each page is its own item; that allows srap to rate-limit requests despite having absolutely no hooks into the HTTP library being used, by virtue of limiting each task to 1 HTTP request.

There are more features in srap, like deliberately invalidating past scraping results, item merges, and 'out of band' task result storage, but these are the basic concepts that make the whole thing work. As you can see, it's highly flexible, unopinionated, and easy to collaboratively maintain a scraper configuration for - every task functions more or less independently.

The datasheet search frontend

If you've used the datasheet search, you've probably noticed that it's really fast, it almost feels like it's all local. But no, your search queries really are going to a server. So how can it be that fast?

It turns out to be surprisingly simple: by default, the search is a prefix search only. That means that it will only search for items that start with the query you entered. This is usually what you want when you search for part numbers, and it also has some very interesting performance implications - because a prefix search can be done entirely on an index!

There's actually very little magic here - the PostgreSQL database that runs behind the frontend simply has a (normalized) index on the column for the part number, and the server is doing a LIKE 'yourquery%' query against it. That's it! This generally yields a search result in under 2 milliseconds, ie. nearly instantly. All it has to do is an index lookup, and those are fast.

On the browser side, things aren't much more complicated. Every time the query changes, it makes a new search request to the server, cancelling the old one if one was still in progress. When it gets results, it renders them on the screen. That's it. There are no trackers on the site, no weird custom input boxes, nothing else to slow it down. The result is a search that feels local :)

The source code

Right now, the source code for all of these things lives across three repositories:

At the time of writing, documentation is still pretty lacking across these repositories, and the code in the srap and UI repositories in particular is pretty rough! This will be improved upon quite soon, as SeekSeek becomes more polished.

Final words

Of course, there are many more details that I haven't covered in this post, but hopefully this gives you an idea of how SeekSeek is put together, and why!

Has this post made you interested in working on SeekSeek, or maybe your own custom srap-based project? Drop by in the chat! We'd be happy to give you pointers :)