6 Lessons Learned From Bad SQL

Here are some interesting variations of SQL syntax. The simple example I chose is probably ill suited for any actual program, but that makes it more interesting.

The Assignment

We want to gather the hours reported on timecards for each employee over a two week pay period. We’ll use the PS_EMPLOYEES table for the employee ID (EMPLID) and name (NAME). We’ll use the PS_TL_RPTD_TIME table for the hours reported (RPTD) by each employee.

PS_EMPLOYEES is a summary table that many Peoplesoft HCM customers generate each night. Originally, it was generated by an SQR program, written by Peoplesoft Corporation, called PER099.SQR. Currently, it is generated by an Application Engine program, also by Peoplesoft Corporation, still called PER099.

PS_EMPLOYEES contains all the people, and only the people, who have ‘EMP’ in the PER_ORG field of their row in PS_PER_ORG_ASGN and ‘A’ in the HR_STATUS field of their current row of PS_JOB. In other words, PS_EMPLOYEES contains all current employees.

PS_TL_RPTD_TIME is the reported time table in the Time & Labor module. It can contain punch time records (first row: employee punched in at 8 AM, second row: employee punched out at 5 PM) and elapsed time records (employee worked for eight hours).

My organization uses elapsed time only. It also has “exception time reporters” who don’t need to enter their usual eight work hour days, but need to enter any other day (4 hours of work, 4 hours of sick time). We also have “time” card rows for non-time entries (reimbursements for miles driven or for work-related expenses).

1. Row By Row

Here is one way to code it.

move '' to $prev_emplid
begin-select
E.EMPLID
E.NAME
T.TL_QUANTITY
if &E.EMPLID <> $prev_emplid
if $prev_emplid <> ''
show $prev_emplid ' ' $prev_name ' ' #total_quantity
end-if
move &E.EMPLID to $prev_emplid
move &E.NAME to $prev_name
move 0 to #total_quantity
end-if
add &T.TL_QUANTITY to #total_quantity
from PS_EMPLOYEES E, PS_TL_RPTD_TIME T
where T.EMPLID = E.EMPLID
and T.EMPL_RCD = E.EMPL_RCD
and T.DUR between '31-MAY-2009' and '13-JUN-2009'
order by E.EMPLID
end-select
show $prev_emplid ' ' $prev_name ' ' #total_quantity

The TL_QUANTITY field is the “time and labor quantity,” which can be hours, dollars, or some other unit (miles). The DUR field is the “date under report,” which is the day on which the hours were worked, or the expense was incurred, or the miles were driven. We sort by EMPLID, and whenever it changes we output the ID, name, and total for the employee we’ve just finished. We have to remember to do this one more time when the begin-select loop finishes.

We could use SQR’s on-break feature to avoid testing for EMPLID changes and to handle the end-of-loop output, but there’s a simpler way to dispense with the change-of-value issue entirely.

2. SQL Aggregate Function

Here’s another way to code it.

begin-select
E.EMPLID
E.NAME
sum(T.TL_QUANTITY) &T.TOTAL_QUANTITY
show &E.EMPLID ' ' &E.NAME ' ' &T.TOTAL_QUANTITY
from PS_EMPLOYEES E, PS_TL_RPTD_TIME T
where T.EMPLID = E.EMPLID
and T.EMPL_RCD = E.EMPL_RCD
and T.DUR between '31-MAY-2009' and '13-JUN-2009'
group by E.EMPLID, E.NAME
order by E.EMPLID
end-select

We’ve delegated work to the database server and eliminated much of the potential for program bugs. We may have performance improvements as well if the database server has a faster (or less busy) CPU than the application server, or if we eliminate significant traffic from the network.

This particular example is fine, but sometimes a selection has many unaggregated fields that correlate one-to-one, and an extensive, unnecessary grouping process. For example, suppose we list EMPLID, NAME, DEPTID, JOBCODE, POSITION_NBR, and HOURLY_RT from PS_EMPLOYEES. There is only one row per person in PS_EMPLOYEES, yet the database server has to prepare to group each employee with multiple names, each name with multiple departments, each department with multiple jobcodes, and so on.

3. SQL Aggregate Function, Part 2

Here’s another way to code it.

begin-select
E.EMPLID
max(E.NAME) &E.NAME
sum(T.TL_QUANTITY) &T.TOTAL_QUANTITY
show &E.EMPLID ' ' &E.NAME ' ' &T.TOTAL_QUANTITY
from PS_EMPLOYEES E, PS_TL_RPTD_TIME T
where T.EMPLID = E.EMPLID
and T.EMPL_RCD = E.EMPL_RCD
and T.DUR between '31-MAY-2009' and '13-JUN-2009'
group by E.EMPLID
order by E.EMPLID
end-select

In this example, we’ve replaced an unnecessary grouping (by E.NAME) with an unnecessary aggregate function (max(E.NAME)). Is it faster? Is it more readable? It’s hard to say. Character fields can use the max() or min() functions. Numeric fields can also use the avg() or sum() functions.

We can also take this approach when there are multiple rows in which (1) we know all the values for that field are the same or (2) we don’t care which value we use. For example, a single PS_PAY_CHECK row may have one or more PS_PAY_EARNINGS rows and each PS_PAY_EARNINGS row may have zero or more PS_PAY_OTH_EARNS rows. The same ERNCD (earnings code) could appear several times for one paycheck. We might join PS_PAY_EARNINGS and PS_PAY_OTH_EARNS, group by ERNCD, and sum OTH_EARNS (dollar amount). We want the EMPLID, DEPTID, and JOBCODE from PS_PAY_EARNINGS, and we know they are the same for all the rows of the particular paycheck.

4. A Table Can Be A Selection

Here’s another way to code it.

begin-select
E.EMPLID
E.NAME
T.TOTAL_QUANTITY
show &E.EMPLID ' ' &E.NAME ' ' &T.TOTAL_QUANTITY
from PS_EMPLOYEES E,
(select EMPLID, sum(TL_QUANTITY) TOTAL_QUANTITY
from PS_TL_RPTD_TIME
where DUR between '31-MAY-2009' and '13-JUN-2009'
group by EMPLID) T
where T.EMPLID = E.EMPLID
end-select

We’ve replaced a table name with a select command which groups and aggregates a single table, and stores it temporarily with columns EMPLID and TOTAL_QUANTITY. Now our actual grouping requirement is clearer; that we intend to group by EMPLID only, and not by NAME. This technique can be extended. If we’re joining many tables together, we can separate the criteria and grouping that apply to each table from the criteria that join the tables. That may improve readibility or maintainability. I can’t predict how it will affect performance.

5. A Column Can Be A Selection

Here’s another way to code it.

begin-select
E.EMPLID
E.NAME
(select sum(TL_QUANTITY) from PS_TL_RPTD_TIME
where EMPLID = E.EMPLID
and DUR between '31-MAY-2009' and '13-JUN-2009') &T.TOTAL_QUANTITY
show &E.EMPLID ' ' &E.NAME ' ' &T.TOTAL_QUANTITY
from PS_EMPLOYEES E
end-select

We’ve replaced a column name with a select command. Note that we can still name the column &T.TOTAL_QUANTITY even though we never assign the alias T to any table. Also note that the selection, in parentheses, can span multiple lines and the following lines need not start at the beginning of their line. However, a follow-on line of a subselection should never start with the from keyword because the SQR compiler will interpret it as belonging to the main selection.

When we move the join into the subselection, the results change significantly. In all the previous selections, we had one row for each employee in PS_EMPLOYEES who also had timecard rows in the two week interval. That omitted the exception time reporters who had no exceptions and the employees on leave of absence (who also have no timecards). In this selection, we have all those employees and their &T.TOTAL_QUANTITY equals null.

Of the other aggregation functions, min() and max() will also return null for those employees. The count() function will return zero. The avg() function will return null, which is a good result, even though averages are sums (null) divided by counts (zero).

6. Think First, Then Code

The original theme of this blog entry was to show the various ways to join two tables in SQL and in begin-select. I picked tables with a one-to-many relationship that would contrast two approaches, (1) aggregating one table before joining the other versus (2) joining two tables and then aggregating.

In preparing the examples, I realized that there was no actual business problem that could be solved correctly and efficiently with this approach. I used the examples solely to illustrate the syntax, but they also illustrate a subtle pitfall any programmer may face. Our original understanding of our assignment was “to gather the hours reported on timecards for each employee over a two week pay period.” We reacted to the keywords “timecards” and “employee,” chose our tables, and started to program.

As programmer/analysts or software engineers, we recognize that design and coding are two separate stages of development. The challenge is to know when we’ve spent enough time thinking about our problem and our solution before we start to write a program. The assignment may have seemed easier than it really was. Let’s reconsider the problem.

  • Include everyone who was an employee during a given two week period.
  • Include everyone who had timecard entries within that two week period, even if they were not an employee at the time. (Perhaps they made timecard entries in advance and then left the organization.)
  • Report the total quantity from each person’s timecards. If someone on the list didn’t have timecards in this period, report the total quantity as zero. (Normally we would challenge this; it doesn’t make sense to add hours, dollars, miles, and other units, but let’s assume our organization only has hours.)

This is far more difficult than our first impression of the assignment. And before we leap back into the text editor with SQL outer joins or unions, let’s think about it … until next week.

Comments are closed.