SELECT bookmarks.uri
, COUNT(DISTINCT bookmarks.id) AS dupes
, GROUP_CONCAT(userbookmarks.user_id) AS users
FROM bookmarks
LEFT OUTER
JOIN userbookmarks
ON userbookmarks.bookmark_id = bookmarks.id
GROUP
BY bookmarks.uri
the count of distinct ids for each bookmark should always be 1… i just threw that in there in case you might have duplicates (which you can avoid by defining a UNIQUE index on the uri, and then you won’t need this count)
I got an error relating to a non unique alias for the links table, which I removed from the FROM. But then I got another error:
Unknown column 'bookmarks.id' in 'on clause'
Here’s the sequel statement in full:
SELECT bookmarks.id, COUNT(DISTINCT bookmarks.id) AS duplicates, GROUP_CONCAT(links.user_id) AS users, links.user_id, bookmarks.url, bookmarks.title, SUBSTR(bookmarks.snippet, 1, 100) as snippet, GROUP_CONCAT(tags.tag) AS tags, bookmarks.datetime, links.status FROM bookmarks, tags LEFT OUTER JOIN links ON (links.bookmark_id = bookmarks.id) WHERE (links.status = 'public') AND (links.bookmark_id = bookmarks.id) GROUP BY bookmarks.url ORDER BY bookmarks.datetime DESC LIMIT 0, 9
Needless to say, there most certainly is a column called ‘id’ for the bookmarks table.
You’ll perhaps notice that there’s a duplicate (links.bookmark_id = bookmarks.id) in the statement; that’s a hold over from the previous statement. Even if I remove it, I still get the error.
okay, we’ll have to deconstruct your latest query to focus on what you’re really after
what i gave you in post #3 was based on your original requirements, but your latest query adds all kinds of extra stuff, much of which is quite incompatible with the original GROUP BY clause
let me give you an analogy to illustrate the conceptual problem you’re having
suppose you have a school consisting of multiple classrooms, where each classroom has multiple students
now i will ask you to write a query which returns the number of students in each classroom, along with the student’s last name
your first and immediate response should be “which student?”
it is the same issue – when you have a GROUP BY clause, every column in the SELECT clause must either be mentioned in the GROUP BY clause or be the argument of an aggregate function, like COUNT() or MAX()
Why would I be asking you which student? The whole purpose of this query is to just return the names of those in the same classrooms, not to return a specific student by name.
That aside, what do you want me to do? I’ assuming the statement will need carving into two, or something like that.
because a GROUP BY on the classroom returns a single aggregate row for the classroom, and since there are multiple ~different~ student names in a classroom, asking for one of them, without stating which one, is nonsensical
similarly, you have both
, GROUP_CONCAT(links.user_id) AS users
, links.user_id
in the SELECT clause – the first one is okay, because it aggregates all the users into a single string, but the second one isn’t, because it’s not an aggregate function
let’s get back to your fixing your query, though
we’ll begin with simple retrieval from the bookmarks table –
SELECT bookmarks.id
, bookmarks.url
, bookmarks.title
, SUBSTR(bookmarks.snippet, 1, 100) as snippet
, bookmarks.datetime
FROM bookmarks
you mentioned that there could be duplicate urls in this table – may i ask why?
okay, so given that a particular url can be in the table more than once, will ~all~ rows for the same url have identical titles and snippets and datetimes?
presumably the answer is no
so let me ask you then, if you were to aggregate/collapse all the rows for a particular url into one result row, which title or snippet or datetime would you like to see? keep in mind the student name analogy for the classroom count
So far, Under Cloud can handle duplicate bookmarks on an individual basis, like the aforementioned.
As with the example I provided, the first instance would be the proper bookmark to use, as all others are more recent.
So in the index view, when a duplication is discovered, the very first bookmark would be used, along with the date and time by which it was added.
I suppose the next logical question to you would be, are the duplicates detected within the rows retrieved within the remit of the LIMIT clause, or across the whole of the table?
here’s the query which counts the bookmarks, augmented with the “earliest” data…
SELECT bookmarks.url
, m.dupes
, bookmarks.title
, SUBSTR(bookmarks.snippet, 1, 100) as snippet
, bookmarks.datetime
FROM ( SELECT url
, COUNT(*) AS dupes
, MIN(datetime) AS earliest
FROM bookmarks
GROUP
BY url ) AS m
INNER
JOIN bookmarks
ON bookmarks.url = m.url
AND bookmarks.datetime = m.earliest
Index is simply the listing of all bookmarks in date order.
I just tried that on the live data and it didn’t combine the two most recent duplicates “Alan Turing’s Patterns in Nature, and Beyond”, which was added by another user and myself.
They were, but the I’ve since edited mine (removing extraneous name-value parameters).
Having run the query again, the “dupes” column is now showing two and there’s only one bookmark being listed, and the date and time relates to the first bookmark added by Anna. So that’s now working.
Well it was such an appalling mess, I just reverted back!
I was trying to merge the parts of the original query (posted previously) with your own.
I’ll be honest, all of this way beyond me — the queries I write are, for the most part, very basic.
SELECT
bookmarks.id, source.duplicates, GROUP_CONCAT(links.user_id) AS users, links.user_id, bookmarks.url, bookmarks.title, SUBSTR(bookmarks.snippet, 1, 100) as snippet, GROUP_CONCAT(tags.tag) AS tags, bookmarks.datetime, links.status
FROM (
SELECT url, COUNT(*) AS duplicates, MIN(datetime) AS earliest
FROM bookmarks GROUP BY url) AS source, links, tags
INNER JOIN bookmarks ON (bookmarks.url = source.url)
AND (bookmarks.datetime = source.earliest)
AND (links.status = 'public')
AND (links.bookmark_id = bookmarks.id) GROUP BY bookmarks.url
ORDER BY bookmarks.datetime DESC LIMIT 0, 9
All of which gives me an error:
#1064 - You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ‘ON (bookmarks.url = source.url) AND (bookmarks.datetime = source.earliest) AND (’ at line 1
From what I can grasp, MySQL doesn’t know to handle the various references to columns belonging to the links and tags tables, and I don’t know where to put the table names.