A Microsoft Office (Excel, Word) forum. OfficeFrustration

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Go Back   Home » OfficeFrustration forum » Microsoft Access » Running & Setting Up Queries
Site Map Home Register Authors List Search Today's Posts Mark Forums Read  

DISTINCT vs GROUPBY ?



 
 
Thread Tools Display Modes
  #1  
Old September 12th, 2005, 08:12 PM
Stef
external usenet poster
 
Posts: n/a
Default DISTINCT vs GROUPBY ?

DISTINCT vs GROUPBY ?


Hello all,

I'm trying to figure out if there is any difference of performance using
either
GROUPBY or DISTINCT in the following scenario.

I have a table named [Orders] that contains many fields, among them two
called [Supplier_ID] and [InvoiceNumber]. The combo of the 2 is not
unique and may repeat over several item records.

I'm trying to generate a unique list of all [Supplier_ID][InvoiceNumber]

Is it faster/better to use GROUPBY or DISTINCT ?

many thx


  #2  
Old September 12th, 2005, 08:45 PM
BabyATX13 via AccessMonster.com
external usenet poster
 
Posts: n/a
Default

Do you have separate Supplier and Invoice Tables? If not, I would start
there, and use one to many relationships. Ex. One Supplier will have many
Invoices; One Invoice will have many Orders.
K Board

Stef wrote:
DISTINCT vs GROUPBY ?

Hello all,

I'm trying to figure out if there is any difference of performance using
either
GROUPBY or DISTINCT in the following scenario.

I have a table named [Orders] that contains many fields, among them two
called [Supplier_ID] and [InvoiceNumber]. The combo of the 2 is not
unique and may repeat over several item records.

I'm trying to generate a unique list of all [Supplier_ID][InvoiceNumber]

Is it faster/better to use GROUPBY or DISTINCT ?

many thx



--
Message posted via http://www.accessmonster.com
  #3  
Old September 12th, 2005, 09:17 PM
Michel Walsh
external usenet poster
 
Posts: n/a
Default

Hi,


GROUP BY should be faster, since both are likely to relay on sorting, and
DISTINCT has no idea about which fields are important to consider (it will
sort the whole SELECTed list of expressions), while GROUP BY can be
instructed. Furthermore, use a dummy aggregate for fields in the list, when
used with GROUP BY, such as LAST( ), on fields that are not defining a
"group":


SELECT FirstName, LastName, LAST(address), LAST(phoneNumber)
FROM myTable
GROUP BY FirstName, LastName



could be faster than

SELECT FirstName, LastName, address, phoneNumber
FROM myTable
GROUP BY FirstName, LastName, address, phoneNumber


if FirstName and LastName define the "groups", because here, the sorting
list has a larger number of bytes to handle, for each entry in the sorted
list.



Hoping it may help,
Vanderghast, Access MVP



"Stef" wrote in message
...
DISTINCT vs GROUPBY ?


Hello all,

I'm trying to figure out if there is any difference of performance using
either
GROUPBY or DISTINCT in the following scenario.

I have a table named [Orders] that contains many fields, among them two
called [Supplier_ID] and [InvoiceNumber]. The combo of the 2 is not
unique and may repeat over several item records.

I'm trying to generate a unique list of all [Supplier_ID][InvoiceNumber]

Is it faster/better to use GROUPBY or DISTINCT ?

many thx



  #4  
Old September 13th, 2005, 09:23 AM
external usenet poster
 
Posts: n/a
Default


Michel Walsh wrote:
GROUP BY should be faster, since both are likely to relay on sorting, and
DISTINCT has no idea about which fields are important to consider (it will
sort the whole SELECTed list of expressions), while GROUP BY can be
instructed. Furthermore, use a dummy aggregate for fields in the list, when
used with GROUP BY, such as LAST( ), on fields that are not defining a
"group"


I'm not convinced by your argument

Neither rely on sorting. Instinct tells me DISTINCT would be faster
because it doesn't have to perform any aggregation, dummy or otherwise.
DISTINCT merely ignores duplicated rows, which is easy for the
optimizer e.g. a hash table.

Here are my test results:

SELECT DISTINCT data_col
FROM MillionRowTable;
1.266 secs

SELECT COUNT(*), data_col
FROM MillionRowTable
GROUP BY data_col;
1.953 secs

SELECT LAST(data_col), data_col
FROM MillionRowTable
GROUP BY data_col;
1.937 secs

Based on the above, I wouldn't consider GROUP BY to be significantly
slower but still think DISTINCT is the best choice.

  #5  
Old September 13th, 2005, 12:12 PM
Michel Walsh
external usenet poster
 
Posts: n/a
Default

Hi,


Even if it hashes, the hash has to be computed on each and every selected
expression, in the case you use more that ONE expression in the SELECT
statement, and nothing insures that two different row would never produce
the SAME hash value....

You are just sure that if two rows produce a different hash, then they are
different, but not the reverse.

Hashing is O(1), while plain sorting by comparison is O(2), so internally,
the database engine should perform the hashing, anyhow, when sorting on
large set. I fail to really see your point.

Your test did not include the GROUP BY statement equivalent to your
distinct, which is:


SELECT data_col
FROM MillionRowTable
GROUP BY data_col;


All your actual GROUP BY statement are penalized by returning TWO columns
(on a million rows) while your DISTINCT just return one column. That is
plainly unfair (almost twice the IO to produce the result).

I can be unfair too, what about:


SELECT DISTINCT pk, f1, f2, f3, f4, f5
FROM myTableWithAprimaryKey


versus


SELECT pk, LAST(f1), LAST(f2), LAST(f3), LAST(f4), LAST(f5)
FROM myTableWithAprimaryKey
GROUP BY pk





Vanderghast, Access MVP



wrote in message
ups.com...

Michel Walsh wrote:
GROUP BY should be faster, since both are likely to relay on sorting, and
DISTINCT has no idea about which fields are important to consider (it
will
sort the whole SELECTed list of expressions), while GROUP BY can be
instructed. Furthermore, use a dummy aggregate for fields in the list,
when
used with GROUP BY, such as LAST( ), on fields that are not defining a
"group"


I'm not convinced by your argument

Neither rely on sorting. Instinct tells me DISTINCT would be faster
because it doesn't have to perform any aggregation, dummy or otherwise.
DISTINCT merely ignores duplicated rows, which is easy for the
optimizer e.g. a hash table.

Here are my test results:

SELECT DISTINCT data_col
FROM MillionRowTable;
1.266 secs

SELECT COUNT(*), data_col
FROM MillionRowTable
GROUP BY data_col;
1.953 secs

SELECT LAST(data_col), data_col
FROM MillionRowTable
GROUP BY data_col;
1.937 secs

Based on the above, I wouldn't consider GROUP BY to be significantly
slower but still think DISTINCT is the best choice.



  #6  
Old September 13th, 2005, 01:58 PM
external usenet poster
 
Posts: n/a
Default


Michel Walsh wrote:
Even if it hashes


I don't know that it does. I was simply guessing that a DISTINCT would
be easy to optimize.

Your test did not include the GROUP BY statement equivalent to your
distinct, which is:

SELECT data_col
FROM MillionRowTable
GROUP BY data_col;


I get 0.985 secs.

I can be unfair too


I wasn't trying to be unfair, I was merely being foolish

 




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Counting Distinct Entries Rick Deemer New Users 2 September 10th, 2005 06:55 PM
DISTINCT option of an aggregate function in Access 2003 [email protected] Running & Setting Up Queries 1 May 5th, 2005 04:46 PM
Newbie: Distinct values steve Running & Setting Up Queries 3 October 28th, 2004 08:35 PM
A2K Strange Characters in Distinct Query/Form Reference/combo box david epsom dot com dot au Running & Setting Up Queries 0 September 15th, 2004 10:33 AM
SELECT DISTINCT MacDuff Running & Setting Up Queries 3 July 24th, 2004 07:46 PM


All times are GMT +1. The time now is 02:35 PM.


Powered by vBulletin® Version 3.6.4
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright ©2004-2024 OfficeFrustration.
The comments are property of their posters.