Hi,
so ive been trying to figure this one out for quite some time here, and I’ve looked all over the internet… Seems simple enough, but I’ve been stuck for days.
Basically my problem is trying to figure out how to efficiently search based on both keywords and content when the keywords are normalized, and ideally count how many of my search keywords were found.
So I have a database of resumes. Each resume includes information about where the user is willing to work in the form of associations to entire states as well as associations to rectangular regions defined by 4 latitude and longitude numbers. The resume may have some associated text, like a description of the person, a desired job title, as well as the extracted text of the resume. Each resume can also be described by keywords, which are stored in a normalized keywords table.
The tables:
Resumes
-resume_id
-desired_job_title
-personal_statement
-extracted_text
States
-state_id
-state_name
-state_abbr
Resumes_states
-resume_id
-state_id
Resumes_regions
-resume_id
-north (these are all decimal numbers which represent a lat-lon box)
-south
-east
-west
Keywords
-keyword_id
-keyword
-rank (increases as people use it over time, used for suggestions and clouds)
Resumes_keywords
-resume_id
-keyword_id
Ideally, I want to be able to search based on a location, a set of keywords, and a minimum number of matching keywords. Searching based on a set of locations and a set of keywords is fairly easy, if I’m ok with showing all results which contain at least one keyword, and not ranking the results in any way. But I’d like to be able to order the results by the number of keywords that matched and then apply a minimum threshold, for example, "show me all resumes that are willing to work in DC and contain at least 2 of the keywords from {‘director’, ‘manager’, ‘computers’, ‘technology’}.
Less ideally, some sort of efficient query that will still order things appropriately would be acceptable, but the only solution I can think of for that would be using fulltext search on the text and on a concatenated list of the keywords.
Using 4 joins, I get a big table where each resume has LxK rows where L is the number of locations and K is the number of keywords. I’ve thought about searching this, but problems arise because of finding the search term in both the keywords AND in the text; Every row matches when the term is in the text (ie description or extracted text) and only a single row matches when its a keyword. I thought about using a UNION to only include the text data in one row, but then the row only counts once even if multiple keywords are present. I suspect the answer lies in using some addition to add these things up (which I cant figure out) but even then, problems also arise because of multiple locations (say I search for resumes in DC and in VA and the user has listed both!)
So I thought about first applying all my location and other meta-data query and then using a sub-query, but I know it’s bad practice and gets slow real fast. Maybe that doesnt matter because the query optimizer is smart and will actually only do text comparisons once? I hope?!
Any ideas out there? Is this even possible? Seems like lots of people need to search based on text content and on normalized keywords but I can’t find anything about it! Any help is MUCH appreciated!!! (Also probably important to mention, is that my tables are quite small, but I’d really like to figure this out and do it the right way so that things can grow. Also, using ajaxy searches, many queries are happening so its gotta be fast).
Thanks!