Best table type/solution for > +10M rows a day

scoutz · June 26, 2010, 11:25am

I’m working on creating my own adserver and came across a “problem”.

I’ve created a table that logs the raw impressions served (more than 10 million rows a day added). I’m using an innodb table to prevent locking when running queries on this table.

Running simple queries is taking way too long tho (> 5 minutes).

I’m not sure if innodb has a hard limit to the amount of rows but in the end i’ll run into problems anyways as the amount of rows increase over time.

What would be best to keep performance and scalability up (one that would work with +100 million rows a day for example)? I was thinking of creating a table for each day.

I’m also up to looking for something else besides mysql as long as the database supports php.

Table structure:

date int 10
ip int 4 unsigned
frequency int 4
size int 2
userid int 5

guelphdad · July 4, 2010, 12:58pm

I’m curious, how does the FROM_UNIXTIME in both the select and group by clause affect the speed of the query, would it be faster if a timestamp or datetime column was used instead? Anyone know the expected performance hit?

scoutz · July 4, 2010, 1:14pm

I’m already doing hourly updates via cron instead of real-time. With the correct indexes the queries are faster but still take a few minutes to complete.

I currently don’t need the raw impressions data, but I might need them in the future for other statistics. I rather just keep the raw data for possible future usage.

The raw data table currently has 97803714 rows and is about 16GB… I think in the future I could dump data from old months to a different table/db with the archive table type.

TimIgoe · July 3, 2010, 10:43pm

Do you need to keep the raw impressions or would it be worth writing a (hourly?) cron to ‘compress’ them down to more useful statistics - that way your reporting happens on the pre-processed data rather than on the raw. (OpenX, the largest open source banner solution does this method).

Keeping raw logs could generate huge amounts of data.

r937 · June 26, 2010, 1:42pm

for that particular query, consider either

KEY user_date_ix ( userid, date )

or more likely

KEY date_user_ix ( date, userid )

also, you cannot have networkid in your ORDER BY clause if it isn’t in the SELECT clause (and likely also in the GROUP BY clause)

scoutz · June 26, 2010, 1:38pm

CREATE TABLE impressionsraw (
userid int(5) NOT NULL,
networkid int(4) NOT NULL,
ip int(4) unsigned NOT NULL,
geo varchar(2) NOT NULL,
freq int(4) NOT NULL,
size int(6) NOT NULL,
date int(10) NOT NULL,
KEY ip (ip),
KEY networkid (networkid),
KEY userid (userid),
KEY date (date)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

example query:
SELECT count(0) as impressions, FROM_UNIXTIME(date, ‘%h’) as hour
FROM impressionsraw
WHERE userid = 1 AND date > (time()-43200)
GROUP BY FROM_UNIXTIME(i.date, ‘%h’), networkid
ORDER BY date, networkid;

table currently has about 700M rows.

r937 · June 26, 2010, 1:24pm

could you do a SHOW CREATE TABLE please

if running simple queries takes minutes, i’m guessing you don’t have the proper indexes defined

a table for each day is a very poor solution, especially if you hope to track any kind of activity across more than one day

scoutz · June 26, 2010, 6:46pm

ah sorry yeah missed that it is actually also getting the network id in the select clause.

I’m creating the index now but I think it will take a few hours with about 700M rows in the table right now.

How will innodb handle a table with for example 7B rows though?

I’ve also been looking at the archive table type which seems ok as well, slightly slower though (I won’t need to edit or remove any rows.).