Most commonly used keywords in a single field

Hi guys

I’m really hoping that this at all possible?

Basically, I have a vehicle database with over 150,000 records. I have a single field called “description” which has approx. 200 words in it. An example of a description can be seen here:


Insurance Group 15, 2 Year Unlimited Mileage Ford Direct Warranty, 2 Year Roadside Assistance Package, Appearance Pack, DAB Radio/CD & Ford SYNC, One Owner, Full Service History, Quickclear Heated Windscreen, 17" Alloy Wheels, Privacy Glass, Remote Central Locking With Retractable Key, Electric Door Mirrors, Five Star Euro N-Cap Rating, Rear Parking Sensors, Power Assisted Steering (PAS), Anti Lock Braking System(ABS), Rear Child locks, Ford Easyfuel, Immobiliser, Dual Airbags, Front & Side Airbags, Isofix Brackets, Side Impact Protection, Adjustable Door Mirrors, Six Speed Gearbox, Air Conditioning, Adjustable Steering Column, Electric Front Windows, Height Adjustable Drivers seat, Multi Function Display, Outside Temperature Indicator, Rev Counter, Split Folding Rear Seats, Trip Computer, iPod/MP3 Connectivity, USB Functionality, Body Colour Door Handles, Body Coloured Rear Spoiler, Front Fog Lights, Metallic Paint, Rear Wash Wipe, Part Exchange Taken as Full Deposit, Photographs Of Actual Car, Physical Car Available Viewing Today

What I need to do is search the entire SQL database table and compile a list of the most commonly used words or phrases (that appear within commas). So for example, if the word “Immobiliser” is used in more than 10 records, then I need to display this. If “Rev Counter” is used in more than 10 records then this would display also. However, if “Outside Temperature Indicator” is only displayed in two separate records then this is not displayed in my list.

What I need to do is basically create a list of the most commonly used words and phrases (between commas) that can be found in the description fields within my table.

I have no idea where to start. What is the easiest way to achieve this?

Any help would be fully appreciated

Best regards

Rod from the UK

I think doing it in SQL alone would be quite complicated so a good idea would be to get some help in a scripting language like PHP (let’s assume you are using PHP). The question is how fast you need this kind of report to be made? If you have 150,000 records with such text fields then searching for anything there would take some time since you can’t really take advantage of any indexes. I would suggest creating another table that would hold the results of your report and then when needed just display it instantly and have a cron job that would update your table every 24 hours (or whatever).

So the other table would look like this:


CREATE TABLE `keywords_report` (
	`keyword` VARCHAR(100) NOT NULL,
	`count_no` MEDIUMINT(9) NOT NULL,
	PRIMARY KEY (`keyword`)
)
ENGINE=InnoDB;

And in PHP you loop over all your vehicle records and split the text into separate keywords by a comma. And then you add each keyword to your report table like this:


INSERT INTO keywords_report (keyword, count_no) VALUES ('Immobiliser', 1)
ON DUPLICATE KEY UPDATE count_no=count_no+1;

So if a keyword doesn’t exist yet it is added with count=1, if it exists then the counter is increased by 1.

You can expect this process to run a while (maybe around a minute or more - but don’t forget to wrap this loop in a transaction otherwise your inserts can be very slow!) but then getting the data for display is as simple as:


SELECT keyword, count_no
  FROM keywords_report
  WHERE count_no > 10
  ORDER BY count_no DESC;

Depending on the size of the reports table you may want to index count_no as well to make the select faster.

Hi Lemon Juice

Thank you so much for your response to my thread.

Unfortunately, my entire website is scripted in classic asp and I have no knowledge of PHP.

Edit:

thread moved

If it helps, I only need a snapshot of the most commonly used words/phrases so It’s not like I need to keep running a script/jobs to get this data of a frequent basis. Therefore, your idea of creating an additional table and inserting the most common words/phrases seems the best approach.

I do have access to our database so I am able to use a stored procdure it this makes things easier. However, I have no idea where to start regarding the scripts that needs to be executed within the stored procedure.

Again, any help would be fully appreciated

Best regards

Rod from the UK

If you are using ASP it makes no difference, the idea is still the same. Also, you didn’t specify what kind of database you are using so I assumed mysql.

Yes, you could do it in a stored procedure but you’d have to use SQL and SQL is not very friendly for the task of splitting strings into segments - at least not mysql, I don’t know about other databases. I suggested a scripting language because there you have language constructs that allow you to easily split a string, put the result in an array and loop over it. ASP should do the job fine.

Hi Lemon Juice

Thank so much for your reply.

How would I achieve this in ASP? Unfortunately, I have no ideas where to start.

PS: I am using Windows Server 2008 R2 Standard and SQL Server 2008 R2

I look forward to hearing from you

Best regards

Rod from the UK

Unfortunately, I’m a PHP guy and I don’t know ASP :confused:

If you’ve got 2008, you should be able to use this method to split the content: http://ole.michelsen.dk/blog/split-string-to-table-using-transact-sql/

So pseudo-code for your stored proc would be:

  1. Open a cursor which brings in the description on each record
  2. Run the cursor, passing the description, then inserting the resulting recordset into a temp table
    [LIST=1]
  3. SELECT id, data INTO temptable FROM dbo.Split(@inputString, ‘,’)
    [/LIST]
  4. Once the cursor is done, do a select w/ group by of the temp table
    [LIST=1]
  5. SELECT data, count() FROM temptable GROUP BY data ORDER BY COUNT() DESC
    [/LIST]

You can do similar in asp, but would most likely be a little slower

  1. Open recordset which contains just the description on each record
  2. Loop through the recordset (load it into an array for faster processing using getRows)
    [LIST=1]
  3. Split the field using split - arrDescriptions = Split(rs(“description”), “,”)
    [/LIST]
  4. Loop through the array from 2.1, inserting it into a temp table. Return to step 1 until done with recordset
  5. Do a select w/ group by of the temp table
    [LIST=1]
  6. SELECT data, count() FROM temptable GROUP BY data ORDER BY COUNT() DESC
    [/LIST]

Hi Dave

Thanks so much for your response.

Unfortunately, my database is 2000. Will your 2nd solution (asp) work for me or would I still need v2008? If you could let me know then that would be great.

I look forward to hearing from you

Best regards

Rod from the UK

Ugh! Really? If it’s at all within your power, that’s something you should really consider upgrading. The advances since then are astonishing.

The classic asp approach (the 2nd one) will work regardless of DBMS platform/version. It’s the split function (which is what that link points to) which won’t work for you in 2000.

Hi Dave

Thanks so much for your response.

I think you’re right, it makes sense to update the database so I have instructed our DBA to sort this out for us.

Thanks so much for all your help on this

Best regards

Rod from the UK