Complex Keyword + Content Searching Help

theozero · April 17, 2010, 4:15am

Hi,
so ive been trying to figure this one out for quite some time here, and I’ve looked all over the internet… Seems simple enough, but I’ve been stuck for days.
Basically my problem is trying to figure out how to efficiently search based on both keywords and content when the keywords are normalized, and ideally count how many of my search keywords were found.

So I have a database of resumes. Each resume includes information about where the user is willing to work in the form of associations to entire states as well as associations to rectangular regions defined by 4 latitude and longitude numbers. The resume may have some associated text, like a description of the person, a desired job title, as well as the extracted text of the resume. Each resume can also be described by keywords, which are stored in a normalized keywords table.

The tables:

Resumes
-resume_id
-desired_job_title
-personal_statement
-extracted_text

States
-state_id
-state_name
-state_abbr

Resumes_states
-resume_id
-state_id

Resumes_regions
-resume_id
-north (these are all decimal numbers which represent a lat-lon box)
-south
-east
-west

Keywords
-keyword_id
-keyword
-rank (increases as people use it over time, used for suggestions and clouds)

Resumes_keywords
-resume_id
-keyword_id

Ideally, I want to be able to search based on a location, a set of keywords, and a minimum number of matching keywords. Searching based on a set of locations and a set of keywords is fairly easy, if I’m ok with showing all results which contain at least one keyword, and not ranking the results in any way. But I’d like to be able to order the results by the number of keywords that matched and then apply a minimum threshold, for example, "show me all resumes that are willing to work in DC and contain at least 2 of the keywords from {‘director’, ‘manager’, ‘computers’, ‘technology’}.

Less ideally, some sort of efficient query that will still order things appropriately would be acceptable, but the only solution I can think of for that would be using fulltext search on the text and on a concatenated list of the keywords.

Using 4 joins, I get a big table where each resume has LxK rows where L is the number of locations and K is the number of keywords. I’ve thought about searching this, but problems arise because of finding the search term in both the keywords AND in the text; Every row matches when the term is in the text (ie description or extracted text) and only a single row matches when its a keyword. I thought about using a UNION to only include the text data in one row, but then the row only counts once even if multiple keywords are present. I suspect the answer lies in using some addition to add these things up (which I cant figure out) but even then, problems also arise because of multiple locations (say I search for resumes in DC and in VA and the user has listed both!)

So I thought about first applying all my location and other meta-data query and then using a sub-query, but I know it’s bad practice and gets slow real fast. Maybe that doesnt matter because the query optimizer is smart and will actually only do text comparisons once? I hope?!

Any ideas out there? Is this even possible? Seems like lots of people need to search based on text content and on normalized keywords but I can’t find anything about it! Any help is MUCH appreciated!!! (Also probably important to mention, is that my tables are quite small, but I’d really like to figure this out and do it the right way so that things can grow. Also, using ajaxy searches, many queries are happening so its gotta be fast).

Thanks!

r937 · April 17, 2010, 4:46am

"show me all resumes that are willing to work in DC and contain at least 2 of the keywords from {‘director’, ‘manager’, ‘computers’, ‘technology’}.

SELECT t.columns
     , k.how_many_keywords
  FROM Resumes AS t
INNER
  JOIN ( SELECT resume_id
              , COUNT(*) AS how_many_keywords
           FROM Resumes_keywords
          WHERE keyword_id IN 
                ( SELECT keyword_id
                    FROM Keywords
                   WHERE keyword IN
                         ( 'director'
                         , 'manager'
                         , 'computers'
                         , 'technology' ) )
         GROUP
             BY resume_id
         HAVING COUNT(*) >= 2 ) AS k
    ON k.resume_id = t.resume_id
 WHERE "willing to work in DC" -- you do this part

and of course if you want to know which keywords, you’ll have to join back to the Resumes_keywords and Keywords tables to get them

theozero · April 17, 2010, 5:13am

Hi Rudy,
Thanks so much for the quick reply. I was actually posting here hoping that you would respond (saw some of your other posts and you really know what you’re talking about). A few issues with your suggestion…
1- Ideally I would like to use LIKE or LOCATION to include substring matches as well (‘computer’ would also match ‘computers’, etc). If this is too big of a performance hit though, I’m willing to sacrifice it, and maybe do some suffix removal before searching… your thoughts?
2- While your solution works for checking the keywords table and requiring a minimum number of matches (and is probably way more efficient than the joins I was thinking about), it doesn’t solve the real problem; I need to search the text information as well, ie personal_statement and extracted_text. It gets more complicated to avoid counting them twice if they were found as a keyword AND in there…

r937 · April 17, 2010, 7:06am

not really sure how to respond to those additional nuances

perhaps you should declare your keywords in advance? or filter them somehow? i’m not sure how i would handle the “computer” versus “computers” keyword issue other than not allowing bofadem

to search the personal information, i would just use additional LIKE conditions in the WHERE clause, alongside whatever it is you’re doing with the regions

theozero · April 17, 2010, 6:14pm

One idea comes to mind of creating 2 binary variables (columns) for each search term, one for if its found as a keyword and one if its found in the text, then adding them up using an OR to avoid double counting. But I’m worried that doing it this way (if its even possible) would end up searching redundant text because of all the joins.

but if this all sounds silly, what would you recommend?
should I just give up on normalization, concatenate the keywords, and use full-text search?

r937 · April 17, 2010, 6:50pm

normalization should never be abandoned, full stop

maybe you’re over-thinking this

theozero · April 17, 2010, 8:23pm

sorry, shouldnt have said “give up on normalization” but rather “denormalize”

I think for now, Im going to just do a bunch of joins and look for anything that matches any of the search terms. I’m just stumped and am tired of being stuck for so long on this…

Searching normalized keywords AND full text content at the same time seems like a common enough situation. I really wonder how others are doing it.