How to select first and last created in given time frame

Hi,

I have a table which collects vehicle information (fms1). Most of these things count up, thus to get the ‘amount’ added in a period I need to subtract the last and the first message in that period.
These messages belong to a unit (gps_unit_id) and units belong to a relation (relation_id)

I have this query:

SELECT latest.gps_unit_id, latest.odometer - earliest.odometer AS diff_odometer, latest.total_fuel - earliest.total_fuel AS diff_total_fuel, latest.engine_hours - earliest.engine_hours AS diff_engine_hours, TIMEDIFF( latest.created, earliest.created ) AS diff_created
FROM (

SELECT gps_unit_id, MIN( created ) AS min_created, MAX( created ) AS max_created
FROM fms1
WHERE created
BETWEEN '2012-04-01'
AND '2012-04-30'
AND gps_unit_id
IN ( 76, 81, 133 )
GROUP BY gps_unit_id
) AS minmax
INNER JOIN fms1 AS earliest ON earliest.gps_unit_id = minmax.gps_unit_id
AND earliest.created = minmax.min_created
INNER JOIN fms1 AS latest ON latest.gps_unit_id = minmax.gps_unit_id
AND latest.created = minmax.max_created
WHERE minmax.gps_unit_id
IN (

SELECT id
FROM gps_units
WHERE relation_id =16
AND id
IN ( 76, 81, 133 )
)

If I change to a bigger time interval, let’s say 2 months, or a year, the query gets very slow since every message is viewed for the min() and max(). The problem lies in the subquery:


SELECT gps_unit_id, MIN( created ) AS min_created, MAX( created ) AS max_created
FROM fms1
WHERE created
BETWEEN '2012-04-01'
AND '2012-04-30'
AND gps_unit_id
IN ( 76, 81, 133 )
GROUP BY gps_unit_id

The explain command says:

id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 3 Using where
1 PRIMARY earliest ref created_gps_unit_id_idx,gps_unit_id created_gps_unit_id_idx 12 minmax.min_created,minmax.gps_unit_id 1
1 PRIMARY latest ref created_gps_unit_id_idx,gps_unit_id created_gps_unit_id_idx 12 minmax.max_created,freetrack.earliest.gps_unit_id 1 Using where
3 DEPENDENT SUBQUERY gps_units unique_subquery PRIMARY,relation_id PRIMARY 4 func 1 Using where
2 DERIVED fms1 range created_gps_unit_id_idx,gps_unit_id created_gps_unit_id_idx 12 NULL 224517 Using where; Using index; Using temporary; Using filesort

the last reveals that 224517 rows are evaluated for the subquery.
besides the query not being ‘scalable’ this seems like a lott of work for something rather simple.

We can assume that messages with a higher id also are also later in time. Thus you can select the first message for a given unit next to minimum or previous to the maximum date.

I just can’t think of a way to put that in query…

please do a SHOW CREATE TABLE for the fms1 table, so that we can see which indexes it has

please do an EXPLAIN for that subquery on its own


EXPLAIN SELECT gps_unit_id, MIN( created ) AS min_created, MAX( created ) AS max_created
FROM fms1
WHERE created
BETWEEN '2012-04-01'
AND '2012-04-30'
AND gps_unit_id
IN ( 76, 81, 133 )
GROUP BY gps_unit_id 

id 	select_type 	table 	type 	possible_keys 	key 	key_len 	ref 	rows 	Extra
1 	SIMPLE 	fms1 	range 	created_gps_unit_id_idx,gps_unit_id 	created_gps_unit_id_idx 	12 	NULL 	224517 	Using where; Using index; Using temporary; Using filesort

SHOW CREATE TABLE fms1

CREATE TABLE `fms1` (
 `id` int(10) unsigned NOT NULL auto_increment,
 `gps_unit_id` int(10) unsigned NOT NULL,
 `odometer` double default NULL,
 `total_fuel` float default NULL,
 `engine_hours` float default NULL,
 `actual_speed` float default NULL,
 `actual_engine_speed` float default NULL,
 `actual_engine_torque` int(11) default NULL,
 `kickdown_switch` tinyint(1) default NULL,
 `accelerator_position` float default NULL,
 `brake_switch` tinyint(1) default NULL,
 `clutch_switch` tinyint(1) default NULL,
 `cruise_active` tinyint(1) default NULL,
 `pto_active` tinyint(1) default NULL,
 `fuel_level` float default NULL,
 `engine_temperature` int(11) default NULL,
 `turbo_pressure` float default NULL,
 `axle_weight_0` float default NULL,
 `axle_weight_1` float default NULL,
 `axle_weight_2` float default NULL,
 `axle_weight_3` float default NULL,
 `service_distance` int(11) default NULL,
 `created` datetime NOT NULL,
 PRIMARY KEY  (`id`),
 KEY `created_gps_unit_id_idx` (`created`,`gps_unit_id`),
 KEY `gps_unit_id` (`gps_unit_id`,`id`)
) ENGINE=MyISAM AUTO_INCREMENT=2673414 DEFAULT CHARSET=latin1

in your EXPLAIN, “Using index;” is very good, but “Using temporary; Using filesort;” is very bad

i’m going to guess that if you had an index on (gps_unit_id,created), you would see a massive improvement in that query

by the way, your condition of –

WHERE created BETWEEN '2012-04-01' AND '2012-04-30'

overlooks an entire day’s worth of data in april :slight_smile:

Hey Thanks,

that’s a real speed boost! we’re back in tens of milliseconds.

I do have a rather weird problem. For some units it returns double results, even some slightly different ones. If i run the subqueries, I get just 3 results for the 3 selected id’s:


SELECT gps_unit_id, MIN( created ) AS min_created, MAX( created ) AS max_created
FROM fms1
WHERE created
BETWEEN '2012-05-01'
AND '2012-05-30'
AND gps_unit_id
IN ( 90, 91, 92 )
GROUP BY gps_unit_id

// returns
// min_created and max_created for 90, 91 and 92


SELECT id
FROM gps_units
WHERE relation_id =20
AND id
IN ( 90, 91, 92 )
LIMIT 0 , 30

// returns 90, 91, 92

but, the whole query:


SELECT latest.gps_unit_id, latest.odometer - earliest.odometer AS diff_odometer, latest.total_fuel - earliest.total_fuel AS diff_total_fuel, latest.engine_hours - earliest.engine_hours AS diff_engine_hours, TIMEDIFF( latest.created, earliest.created ) AS diff_created
FROM (

SELECT gps_unit_id, MIN( created ) AS min_created, MAX( created ) AS max_created
FROM fms1
WHERE created
BETWEEN '2012-05-01'
AND '2012-05-30'
AND gps_unit_id
IN ( 90, 91, 92 )
GROUP BY gps_unit_id
) AS minmax
INNER JOIN fms1 AS earliest ON earliest.gps_unit_id = minmax.gps_unit_id
AND earliest.created = minmax.min_created
INNER JOIN fms1 AS latest ON latest.gps_unit_id = minmax.gps_unit_id
AND latest.created = minmax.max_created
WHERE minmax.gps_unit_id
IN (

SELECT id
FROM gps_units
WHERE relation_id =20
AND id
IN ( 90, 91, 92 )
)

returns:


Rows: 7
gps_unit_id 	diff_odometer 	diff_total_fuel 	diff_engine_hours 	diff_created
90 	7825.73000000001 	3607 	193.2001953125 	685:21:57
91 	5513.67499999999 	2805 	154.7001953125 	681:05:40
92 	9569.245 	4043.5 	214.75 	682:44:35
92 	9569.19 	4043 	214.75 	682:44:35
92 	9569.175 	4043 	214.75 	682:44:35
92 	9569.175 	4043 	214.75 	682:44:35
92 	9569.17 	4043 	214.75 	682:44:35

The explain (notice the difference in rows evaluated, just 1 instead of more then 200000):


Rows: 5
id 	select_type 	table 	type 	possible_keys 	key 	key_len 	ref 	rows 	Extra
1 	PRIMARY 	<derived2> 	ALL 	NULL 	NULL 	NULL 	NULL 	3 	Using where
1 	PRIMARY 	earliest 	ref 	created_gps_unit_id_idx,gps_unit_id,gps_unit_id_cr... 	gps_unit_id_created_idx 	12 	minmax.gps_unit_id,minmax.min_created 	1 	 
1 	PRIMARY 	latest 	ref 	created_gps_unit_id_idx,gps_unit_id,gps_unit_id_cr... 	gps_unit_id_created_idx 	12 	freetrack.earliest.gps_unit_id,minmax.max_created 	1 	Using where
3 	DEPENDENT SUBQUERY 	gps_units 	unique_subquery 	PRIMARY,relation_id 	PRIMARY 	4 	func 	1 	Using where
2 	DERIVED 	fms1 	range 	created_gps_unit_id_idx,gps_unit_id,gps_unit_id_cr... 	gps_unit_id_created_idx 	12 	NULL 	1 	Using where; Using index for group-by

this also happend before I added the index, but I just noticed it since I was playing around with date between period and different units. Any ideas why?

it’s gotta be your data

yeah, you’re right. I added “latest.created, earliest.created” to the fieldlist and offcourse noticed these are the same for the double rows. Searching on these times in de fms1 table I got 5 results for one created time.

Is it possible to fix this in the query?By getting the lowest id for the startdate and the highest id for the stopdate?

These messages are transmitted to my server if a unit has a GPRS connection and has the engine running. They are send every 30seconds. If there is no GPRS connection, the messages are stored until connection is made. Then they are transmitted in buld. They have no timestamp, so you don’t really know to what actual time they belong.

um…

so using the id is a viable solution to not knowing when they were made?

yes, they are for sure added in chronological order and if multiple messages are added to the database in one time-instance (that is, the same second) then they all made before that time!
So if there are multiple results for the stop time, we need to select the one with the highest ID
and if there are multiple resutls for the start time, we need to select the one with the highest ID, since that would be the one closest to the actual time, all the other messages were generated before that and most likely do not belong to the selected period. Offcourse I know this is not true for all cases, but I guess it’s the best choice and the easiest way to implement.

Any ideas how to incorperate this into the query? Many thanks.

sorry, no