I’m working on creating my own adserver and came across a “problem”.
I’ve created a table that logs the raw impressions served (more than 10 million rows a day added). I’m using an innodb table to prevent locking when running queries on this table.
Running simple queries is taking way too long tho (> 5 minutes).
I’m not sure if innodb has a hard limit to the amount of rows but in the end i’ll run into problems anyways as the amount of rows increase over time.
What would be best to keep performance and scalability up (one that would work with +100 million rows a day for example)? I was thinking of creating a table for each day.
I’m also up to looking for something else besides mysql as long as the database supports php.
Table structure:
date int 10
ip int 4 unsigned
frequency int 4
size int 2
userid int 5
I’m curious, how does the FROM_UNIXTIME in both the select and group by clause affect the speed of the query, would it be faster if a timestamp or datetime column was used instead? Anyone know the expected performance hit?
I’m already doing hourly updates via cron instead of real-time. With the correct indexes the queries are faster but still take a few minutes to complete.
I currently don’t need the raw impressions data, but I might need them in the future for other statistics. I rather just keep the raw data for possible future usage.
The raw data table currently has 97803714 rows and is about 16GB… I think in the future I could dump data from old months to a different table/db with the archive table type.
Do you need to keep the raw impressions or would it be worth writing a (hourly?) cron to ‘compress’ them down to more useful statistics - that way your reporting happens on the pre-processed data rather than on the raw. (OpenX, the largest open source banner solution does this method).
Keeping raw logs could generate huge amounts of data.
CREATE TABLE impressionsraw ( userid int(5) NOT NULL, networkid int(4) NOT NULL, ip int(4) unsigned NOT NULL, geo varchar(2) NOT NULL, freq int(4) NOT NULL, size int(6) NOT NULL, date int(10) NOT NULL,
KEY ip (ip),
KEY networkid (networkid),
KEY userid (userid),
KEY date (date)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
example query:
SELECT count(0) as impressions, FROM_UNIXTIME(date, ‘%h’) as hour
FROM impressionsraw
WHERE userid = 1 AND date > (time()-43200)
GROUP BY FROM_UNIXTIME(i.date, ‘%h’), networkid
ORDER BY date, networkid;