DISTINCT vs GROUPBY ?

#1 September 12th, 2005, 08:12 PM

DISTINCT vs GROUPBY ?

Hello all,

I'm trying to figure out if there is any difference of performance using
either
GROUPBY or DISTINCT in the following scenario.

I have a table named [Orders] that contains many fields, among them two
called [Supplier_ID] and [InvoiceNumber]. The combo of the 2 is not
unique and may repeat over several item records.

I'm trying to generate a unique list of all [Supplier_ID][InvoiceNumber]

Is it faster/better to use GROUPBY or DISTINCT ?

many thx

#2 September 12th, 2005, 08:45 PM

Do you have separate Supplier and Invoice Tables? If not, I would start
there, and use one to many relationships. Ex. One Supplier will have many
Invoices; One Invoice will have many Orders.
K Board

Stef wrote:
DISTINCT vs GROUPBY ?

Hello all,

I'm trying to figure out if there is any difference of performance using
either
GROUPBY or DISTINCT in the following scenario.

I have a table named [Orders] that contains many fields, among them two
called [Supplier_ID] and [InvoiceNumber]. The combo of the 2 is not
unique and may repeat over several item records.

I'm trying to generate a unique list of all [Supplier_ID][InvoiceNumber]

Is it faster/better to use GROUPBY or DISTINCT ?

many thx

--
Message posted via http://www.accessmonster.com

#3 September 12th, 2005, 09:17 PM

Hi,

GROUP BY should be faster, since both are likely to relay on sorting, and
DISTINCT has no idea about which fields are important to consider (it will
sort the whole SELECTed list of expressions), while GROUP BY can be
instructed. Furthermore, use a dummy aggregate for fields in the list, when
used with GROUP BY, such as LAST( ), on fields that are not defining a
"group":

SELECT FirstName, LastName, LAST(address), LAST(phoneNumber)
FROM myTable
GROUP BY FirstName, LastName

could be faster than

SELECT FirstName, LastName, address, phoneNumber
FROM myTable
GROUP BY FirstName, LastName, address, phoneNumber

if FirstName and LastName define the "groups", because here, the sorting
list has a larger number of bytes to handle, for each entry in the sorted
list.

Hoping it may help,
Vanderghast, Access MVP

"Stef" wrote in message
...
DISTINCT vs GROUPBY ?

Hello all,

I'm trying to figure out if there is any difference of performance using
either
GROUPBY or DISTINCT in the following scenario.

I have a table named [Orders] that contains many fields, among them two
called [Supplier_ID] and [InvoiceNumber]. The combo of the 2 is not
unique and may repeat over several item records.

I'm trying to generate a unique list of all [Supplier_ID][InvoiceNumber]

Is it faster/better to use GROUPBY or DISTINCT ?

many thx

#4 September 13th, 2005, 09:23 AM

Michel Walsh wrote:
GROUP BY should be faster, since both are likely to relay on sorting, and
DISTINCT has no idea about which fields are important to consider (it will
sort the whole SELECTed list of expressions), while GROUP BY can be
instructed. Furthermore, use a dummy aggregate for fields in the list, when
used with GROUP BY, such as LAST( ), on fields that are not defining a
"group"

I'm not convinced by your argument

Neither rely on sorting. Instinct tells me DISTINCT would be faster
because it doesn't have to perform any aggregation, dummy or otherwise.
DISTINCT merely ignores duplicated rows, which is easy for the
optimizer e.g. a hash table.

Here are my test results:

SELECT DISTINCT data_col
FROM MillionRowTable;
1.266 secs

SELECT COUNT(*), data_col
FROM MillionRowTable
GROUP BY data_col;
1.953 secs

SELECT LAST(data_col), data_col
FROM MillionRowTable
GROUP BY data_col;
1.937 secs

Based on the above, I wouldn't consider GROUP BY to be significantly
slower but still think DISTINCT is the best choice.

#5 September 13th, 2005, 12:12 PM

Hi,

Even if it hashes, the hash has to be computed on each and every selected
expression, in the case you use more that ONE expression in the SELECT
statement, and nothing insures that two different row would never produce
the SAME hash value....

You are just sure that if two rows produce a different hash, then they are
different, but not the reverse.

Hashing is O(1), while plain sorting by comparison is O(2), so internally,
the database engine should perform the hashing, anyhow, when sorting on
large set. I fail to really see your point.

Your test did not include the GROUP BY statement equivalent to your
distinct, which is:

SELECT data_col
FROM MillionRowTable
GROUP BY data_col;

All your actual GROUP BY statement are penalized by returning TWO columns
(on a million rows) while your DISTINCT just return one column. That is
plainly unfair (almost twice the IO to produce the result).

I can be unfair too, what about:

SELECT DISTINCT pk, f1, f2, f3, f4, f5
FROM myTableWithAprimaryKey

versus

SELECT pk, LAST(f1), LAST(f2), LAST(f3), LAST(f4), LAST(f5)
FROM myTableWithAprimaryKey
GROUP BY pk

Vanderghast, Access MVP

wrote in message
ups.com...

Michel Walsh wrote:
GROUP BY should be faster, since both are likely to relay on sorting, and
DISTINCT has no idea about which fields are important to consider (it
will
sort the whole SELECTed list of expressions), while GROUP BY can be
instructed. Furthermore, use a dummy aggregate for fields in the list,
when
used with GROUP BY, such as LAST( ), on fields that are not defining a
"group"

I'm not convinced by your argument

Neither rely on sorting. Instinct tells me DISTINCT would be faster
because it doesn't have to perform any aggregation, dummy or otherwise.
DISTINCT merely ignores duplicated rows, which is easy for the
optimizer e.g. a hash table.

Here are my test results:

SELECT DISTINCT data_col
FROM MillionRowTable;
1.266 secs

SELECT COUNT(*), data_col
FROM MillionRowTable
GROUP BY data_col;
1.953 secs

SELECT LAST(data_col), data_col
FROM MillionRowTable
GROUP BY data_col;
1.937 secs

Based on the above, I wouldn't consider GROUP BY to be significantly
slower but still think DISTINCT is the best choice.

#6 September 13th, 2005, 01:58 PM

Michel Walsh wrote:
Even if it hashes

I don't know that it does. I was simply guessing that a DISTINCT would
be easy to optimize.

Your test did not include the GROUP BY statement equivalent to your
distinct, which is:

SELECT data_col
FROM MillionRowTable
GROUP BY data_col;

I get 0.985 secs.

I can be unfair too

I wasn't trying to be unfair, I was merely being foolish

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Counting Distinct Entries	Rick Deemer	New Users	2	September 10th, 2005 06:55 PM
DISTINCT option of an aggregate function in Access 2003	[email protected]	Running & Setting Up Queries	1	May 5th, 2005 04:46 PM
Newbie: Distinct values	steve	Running & Setting Up Queries	3	October 28th, 2004 08:35 PM
A2K Strange Characters in Distinct Query/Form Reference/combo box	david epsom dot com dot au	Running & Setting Up Queries	0	September 15th, 2004 10:33 AM
SELECT DISTINCT	MacDuff	Running & Setting Up Queries	3	July 24th, 2004 07:46 PM