Thursday the lead developer presented me with the solution they had chosen for the database behind the Showcase portal. Finding the right database or CMS (or building one from scratch) that will do what we want it to do was a crucial decision. Search is hard, which is why most places leave it to Google or some other major company that has mastered search algorithms.
What Dave and Patrick have found is
crate.io and
elastic.co. There were a lot of factors that went into their decision. Friday they built a test version with very basic data. It works for the simplest of our fields. Next week they will move on and add more complicated fields, ones that include synonyms. That's a big factor in making this work.
At this point, though, they have the first step down toward proof of concept. It's pretty exciting to see a working prototype, even if it has fake data and limited fields.
As for my part, I am reading through the documentation on crate and elastic. There are decisions to make:
- Do we want to expand or contract the synonym search? (See below.)
- Do we want to apply analyzers and filters at the point of indexing, at the point of search, or both?
- We need to finalize the controlled vocabulary and synonyms so the synonym file can be built.
- Do we need custom stop words or stems?
These are some of the questions we will need to address in the meeting scheduled for Monday morning.
It looks like I will help build the synonym file once we decide if we are expanding or contracting. I have an Excel file started, but that isn't the format the working file needs to be in. I am not a programmer, but I am certainly capable of following the format to write the synonym file. And my working on that will speed up development since Dave and Patrick can concentrate on the rest of the code rather than taking time for a tedious task like that.
Regarding expanding or contracting, it isn't as simple as it seems at first. Here's an example:
In our synonym file, "physics" is a synonym for our term "science and mathematics". Projects will be tagged with "science and mathematics", so if someone searches for "physics", we want the search to actually match "science and mathematics".
If the search contracts, which seems like the logical choice, a search for "physics" would contract and return only projects tagged with "science and mathematics".
If the search expands, a search for "physics" would expand and return results that match "physics" and "science and mathematics". Even though "science and mathematics" is the term that will be tagged, it is possible another field (title, description) could contain the term "physics", and we would probably want those projects to be returned in the search results first since they are more likely physics projects ("science and mathematics" would return biology, chemistry, etc).
So, it seems like expanding is actually the way we want to go. The file is similar either way; it's the syntax that is different.
Take away: There are a lot of decisions involved in the development side of this process. It's very interesting and useful to be included in some of those decisions and to see a little of how that side of things is done. That might be one of the more useful parts of this internship, actually.