Storing an Index in SQL Server

I’m not too sure if this is the correct forum to post this in, as if a database isn’t the best solution then it becomes a programming question, nevertheless…

I have developed an Inverted Index script using C# to create a structure of Dictionary<string, List<Word>> where the struct Word contains the row ID of the table and its whereabouts within the entry. Within my database I have a table consisting roughly of the following. To make the question easier to follow I’ll claim to have the following within my table taken from this Wikipedia entry:


ID        URL            CONTENT
------------------------------------------------------------
1         a.com          it is what it is
2         b.com          what is it
3         c.com          it is a banana

So, my inverted index will contain the following:


"a":      {(2, 2)}
"banana": {(2, 3)}
"is":     {(0, 1), (0, 4), (1, 1), (2, 1)}
"it":     {(0, 0), (0, 3), (1, 2), (2, 0)} 
"what":   {(0, 2), (1, 0)}

All I want to know is how I can store the above within my SQL Server database and whether it is wise to do so. I am writing an information-retrieval system that is queried to return a list of the most relevant rows within the table and whilst I would like to utilise SQL Server for this I have opted not to choose SQL Server Full Text Search because it is not flexible enough to accommodate the extensions that will be required once this is finished.

Can a .NET/SQL Server guru help me out?

yes, it is wise to store this into a database table

CREATE TABLE wordindex
( word VARCHAR(37) NOT NULL 
, urlid INTEGER NOT NULL
, pos SMALLINT NOT NULL
, PRIMARY KEY ( word,urlid,pos )
);
INSERT INTO wordindex VALUES
 ('a',3,3)
,('banana',3,4)
,('is',1,2),('is',1,5),('is',2,2),('is',3,2)
,('it',1,1),('it',1,4),('it',2,3),('it',3,1)
,('what',1,3),('what',2,1)

I originally thought of trying this, but when I realised the scope of what I want this system to do (handle around 100-1000 separate pages of pure text at around 300-500 words per page) the size of said index would be huge. I’ve never handled a table larger than a few hundred entries and I can easily see a table like this reaching into the thousands.

Additionally, to carry on with the examples on the Wikipedia page, if I were to search for “what is it” what query would be required to get a list of the relevant examples ordered by relevancy?

how do you define relevancy?

The document(s) that best match the search term. As stated in the Wikipedia entry if the user were to “search” for the term “what is it” the set:

{0,1} &#8745; {0,1,2} &#8745; {0,1,2} = {0,1}

Is returned, meaning that the terms exist in all current rows. As the term occurs consecutively in row 1 that would be the closest-matching row and would appear first. If I were to search for “what is it” I would like to see the following returned:


ID        URL            CONTENT
------------------------------------------------------------
2         b.com          what is it
1         a.com          it is what it is
3         c.com          it is a banana

“as the term occurs consecutively” is gonna be really hard to do with sql

here’s a query which searches for three words and returns urlids sequenced by how many of those words are found in the content column –

SELECT urlid
     , COUNT(*) AS hits
  FROM wordindex
 WHERE word IN ('what','is','it')
GROUP
    BY urlid
ORDER 
    BY COUNT(*) DESC

maybe you can do the relevancy from there in your application logic

I guess when the SQL expert on SitePoint says that something is really hard it’s gonna be really hard…

I was considering using code like this to do what you’ve said, and then moving the results into a DataTable and manipulating them from there, although as I am not an expert with C#/.NET it’ll be a struggle.

If I read this correctly, this will rank the rows based on how many of the words are found within the text, right? Wit not be possible to write some kind of stored procedure to do multiple checks, first for “what”, “is” and “it” (as individual hits), then for “what is” or “is it” (as double hits) and finally for “what is it” as the main hit, then to rank the overall results by all three? To be honest this is the first time I’ve ever had to do such a thing with SQL so it’s all a bit confusing.

yes, you could do it the way you suggest, first testing for individual words, then for two, then for three, and assign relevancy that way

however, you wouldn’t need your inverted index table for that, you’d use LIKE on the original content column, and the query would be really slow

That’s true, I guess I won’t be getting a true database solution for this outside of SQL Server’s Full Text Search, although storing the inverted index within the database and calling out the referenced rows will probably speed things up. Thanks for the help!