Useless Joins


Print this article  Send the URL to a friend

Environment

Tests on products from this company are performed by a (large) pool of volunteers, who are recorded in the TAB_VOLUNTEERS table and identified by a number, VOL_ID.
When they participate in a test (identified by TST_ID) they become testers, and their association with a given test is recorded, among other things, in the TAB_TESTERS table.

Code

The following query was caught in memory - all constant values being hard-coded, which is already a pretty bad sign. Obviously the purpose was to identify the volunteers who had participated in three, probably correlated, tests.

SQL> Select distinct '2004-0143',
  2                  34567,
  3                  VOL.VOL_ID,
  4                  1,
  5                  sysdate
  6  From TAB_VOLUNTEERS VOL,
  7       TAB_TESTERS TESTER0,
  8       TAB_TESTERS TESTER1,
  9       TAB_TESTERS TESTER2
 10  Where
 11   VOL.VOL_ACTIVE = 1
 12   And VOL.VOL_DATE_VM >= Trunc (sysdate - 425)
 13   And ( TESTER0.TST_ID='2003-0126'
 14         OR TESTER1.TST_ID='2003-0125'
 15         OR TESTER2.TST_ID='2003-0127')
 16   And TESTER0.TESTER_STATUS = 0
 17   And TESTER1.TESTER_STATUS = 0
 18   And TESTER2.TESTER_STATUS = 0
 19   And VOL.VOL_ID = TESTER0.VOL_ID
 20   And VOL.VOL_ID = TESTER1.VOL_ID
 21   And VOL.VOL_ID = TESTER2.VOL_ID

An inquisitive look into DBA_TABLES show the tables to be of a rather modest size (we know the statistics to be up-to-date). So, this is a problem query ?

OWNER           TABLE_NAME                       NUM_ROWS
--------------- ------------------------------ ----------
TESTING         TAB_VOLUNTEERS                       3529
TESTING         TAB_TESTERS                         32797

A good way to get a starting point is to run the query as is under SQL*Plus (in AUTOTRACE TRACEONLY mode):

SQL> /

30 rows selected.

Elapsed: 00:02:32.55

As a general rule, the time to run a query should be, as much as possible, related to the amount of data returned.
One of the numerous ergonomical problems associated with database queries is that what the end user sees is what is returned, not the amount of work required to filter out the data. In an ideal world, ’no data found’ should return instantly. A user may be ready to wait 2 and a half minutes for 500 or 1,000 rows - not for 30. Can we make this query run faster ?

Here are the execution plan and statistics displayed by the AUTOTRACE TRACEONLY SQL*Plus setting :

Execution Plan
----------------------------------------------------------
 0    SELECT STATEMENT Optimizer=CHOOSE (Cost=145 Card=28 Bytes=1736)
 1  0   SORT (UNIQUE) (Cost=145 Card=28 Bytes=1736)
 2  1     HASH JOIN (Cost=137 Card=28 Bytes=1736)
 3  2       HASH JOIN (Cost=95 Card=1467 Bytes=67482)
 4  3         HASH JOIN (Cost=56 Card=299 Bytes=8970)
 5  4           TABLE ACCESS (FULL) OF 'TAB_VOLUNTEERS' (Cost=17 Card=61 Bytes=854)
 6  4         INDEX (FAST FULL SCAN) OF 'I_1' (NON-UNIQUE) (Cost=38 Card=17253 Bytes=276048)
 7  3       INDEX (FAST FULL SCAN) OF 'I_1' (NON-UNIQUE) (Cost=38 Card=17253 Bytes=276048)
 8  2     INDEX (FAST FULL SCAN) OF 'I_1' (NON-UNIQUE) (Cost=38 Card=17253 Bytes=276048)

Statistics
----------------------------------------------------------
       6119  recursive calls
       1078  db block gets
      19025  consistent gets
       9661  physical reads
     162976  redo size
       2681  bytes sent via SQL*Net to client
        716  bytes received via SQL*Net from client
          3  SQL*Net roundtrips to/from client
          1  sorts (memory)
          0  sorts (disk)
         30  rows processed

There is nothing to worry about the full scan of TAB_VOLUNTEERS - which isn’t that big. However, the amount of logical gets (the sum of db block gets and consistent gets) looks enormous for only 30 rows returned. Do we really need to visit over 20,000 Oracle blocks to return 30 miserable rows ? The search conditions don’t look so complicated. What is wrong ? Hash-joins ? Should we force nested loops instead ? Must we create new indexes ?

Improving the query

There are several mistakes in this query. First of all, whence does the data we want come ? A single table, TAB_VOLUNTEERS. Why, then, is there anything else than TAB_VOLUNTEERS in the FROM clause ?
TAB_TESTERS is here as a filtering condition - and has its place in the WHERE clause, as a subquery. This may look like an alternate way of writing the query, but it runs deeper than that. For one thing, it will allow to get rid of this ugly DISTINCT. Come on, we are fetching from TAB_VOLUNTEERS a column which awfully looks like the primary key of that very same table. Why would we need a DISTINCT ? It contradicts the very concept of primary key. Unless (perish the thought) that it is an attempt at hiding faulty logic ?

Another, and actually more lethal, mistake are the multiple of occurrences of TAB_TESTERS in the query. It is fairly easy to understand the developer’s reasoning : I get the volunteers who have participated in a test by joining TAB_VOLUNTEERS to TAB_TESTERS, I have three tests to check, therefore I join TAB_TESTERS three times. The title of this paper has leaked it : this is an utterly wrong way to look at the data. The proper way to see what is occurring in the execution of the query is that we are going three times to the same table because we don’t think in terms of sets - we have a set of three tests and just want the set of volunteers who have participated in them. We need TAB_VOLUNTEERS once, and TAB_TESTERS once.
How could we write the query then ? Well, very easily :

SQL> set timing on
SQL> set autotrace on
SQL> Select '2004-0143',
  2         34567,
  3         VOL.VOL_ID,
  4         1,
  5         sysdate
  6  From TAB_VOLUNTEERS VOL
  7  Where
  8   VOL.VOL_ACTIVE = 1
  9   And VOL.VOL_DATE_VM >= Trunc (sysdate - 425)
 10   AND VOL.VOL_ID in
 11        (select TESTER.VOL_ID
 12         from TAB_TESTERS TESTER
 13         where TESTER.TESTER_STATUS = 0
 14           And TESTER.TST_ID in('2003-0126',
 15                                '2003-0125',
 16                                '2003-0127'))
 17  /


30 rows selected.

Elapsed: 00:00:02.18

Execution Plan
----------------------------------------------------------
  0     SELECT STATEMENT Optimizer=CHOOSE (Cost=29 Card=61 Bytes=1586)
  1  0    HASH JOIN (Cost=29 Card=61 Bytes=1586)
  2  1      TABLE ACCESS (FULL) OF 'TAB_VOLUNTEERS' (Cost=17 Card=61 Bytes=854)
  3  1      VIEW OF 'VW_NSO_1' (Cost=11 Card=67 Bytes=804)
  4  3        SORT (UNIQUE) (Cost=11 Card=67 Bytes=1072)
  5  4          INLIST ITERATOR
  6  5            INDEX (RANGE SCAN) OF 'TAB_TESTERS_IDX2' (NON-UNIQUE) (Cost=4 Card=68 Bytes=1088)

Statistics
----------------------------------------------------------
          0  recursive calls
          6  db block gets
        280  consistent gets
        267  physical reads
          0  redo size
       2681  bytes sent via SQL*Net to client
        716  bytes received via SQL*Net from client
          3  SQL*Net roundtrips to/from client
          4  sorts (memory)
          0  sorts (disk)
         30  rows processed

A sudden 60-fold improvement ...

At this stage, we can wonder about indexes. Are we using them as we should ?

SQL> @idx TAB_VOLUNTEERS

INDEX_NAME                              KEYS POS COLUMN_NAME            DIST_VAL
--------------------------------- ---------- --- -------------------- ----------
TAB_VOLUNTEERS_IDX1                     3527   1 VOL_ACTIVE                    2
                                               2 VOL_NAME                   3071
                                               3 VOL_FIRSTNAME              1372
TAB_VOLUNTEERS_PK(U)                    3529   1 VOL_ID                     3529

Elapsed: 00:00:00.91

SQL> @idx TAB_TESTERS

INDEX_NAME                     KEYS POS COLUMN_NAME            DIST_VAL
------------------------ ---------- --- -------------------- ----------
TAB_TESTERS_FK1                3519   1 VOL_ID                     3519
TAB_TESTERS_FK2                2979   1 TST_ID                     2523
                                      2 SES_ID                     2979
TAB_TESTERS_IDX1              32167   1 TESTER_PAYMENT                2
                                      3 VOL_ID                     3519
                                      2 TST_ID                     2523
TAB_TESTERS_IDX2              32167   1 TST_ID                     2523
                                      2 TESTER_STATUS                 4
                                      3 VOL_ID                     3519
TAB_TESTERS_PK(U)             32797   1 TST_ID                     2523
                                      2 SES_ID                     2979
                                      3 VOL_ID                     3519
I_1                           32167   1 TST_ID                     2523
                                      2 VOL_ID                     3519
                                      3 TESTER_STATUS                 4


15 rows selected.

It may seem surprising that Oracle chooses not to use the primary key index on TAB_VOLUNTEERS. We are not great fans of hints, but this looks a little strange and at this stage, we should try to see which of /*+ FIRST_ROWS */ (to favour indexes) and /*+ ALL_ROWS */ (to favour hash joins) runs fastest. But from now on it becomes trivial ...

Copyright © RoughSea Ltd, 2004