As you know, the WHERE statement or option in the DATA step or other procedures is very useful in selecting observations from a data set based on some criteria. In PROC SQL, the WHERE clause in the SELECT statement can also be used to subset data based on specified conditions. Any valid SAS expression can be put inside the WHERE clause, including functions, comparison or logical operators, and even some special operators. Making good use of it can increase programming efficiency and save computing resources greatly. As always, we will work through this subject with examples.
Example 32.9 Section
The following example uses the WHERE clause to select employees who work at a police department and have the job title of sergeant:
PROC SQL;
select Name,
Department,
Employee_annual_salary
from stat482.salary
where Department='POLICE' AND Position_title='SERGEANT';
QUIT;
Name | Department | Employee Annual Salary |
---|---|---|
PENDARVIS | POLICE | 103590 |
ODUM | POLICE | 106920 |
DANIELS | POLICE | 106920 |
TATE JR | POLICE | 103590 |
GOLDSMITH | POLICE | 110370 |
FRANKO | POLICE | 106920 |
RADDATZ | POLICE | 110370 |
FLEMING | POLICE | 100440 |
CASEY | POLICE | 110370 |
MC COY | POLICE | 110370 |
JACOBS | POLICE | 106920 |
To run this program, please download the SAS data set salary.sas7bdat and save it to your computer.
Reading through the program, you must have known that it selects the name, department, and annual salary information from salary data for police sergeants. Note that the columns in the WHERE clause do not have to be specified in the SELECT clause, (such as Position title), which is used in the WHERE clause but not in the SELECT clause. However, for the sake of the results checking, I would suggest keeping these columns in the query until verified.
Launch and run the SAS program, and review the output to convince yourself that the records have been selected as described.
We saw two types of operators used in the above program, the comparison (=) and the logical (and). Besides these common ones, another type that could be very useful in your programming is called a conditional operator. You may know some of them already, like IN, CONTAINS, and MISSING. You can find the complete list of operators in the SAS documentation. Next, let’s look at a couple of examples of this using BETWEEN AND and LIKE.
BETWEEN value-1 AND value-2
Both value-1 and value-2 are end values. So you can use the BETWEEN AND operator to specify a range of values, such as from one date to another, or from lower limit to upper limit. The smaller value does not have to be the first.
Example 32.10 Section
The following program uses the operator, BETWEEN AND, to select observations from salary data whose annual salary is between \65,000 and \70,000, and also works in the Fire department:
PROC SQL;
select Name,
Department,
Employee_annual_salary label='Salary' format=DOLLAR12.2
from stat482.salary
where Employee_annual_salary between 65000 and 70000
and Department='FIRE';
QUIT;
Name | Department | Salary |
---|---|---|
KELLY | FIRE | $65,946.00 |
KOCHANEY | FIRE | $65,946.00 |
Launch and run the SAS program, and review the query output to convince yourself that the SAS yield the result as expected.
Column LIKE ‘pattern’
With the LIKE operator, you have to specify a column name and the pattern to be matched. Regarding the pattern, first, it is case-sensitive and has to be enclosed in quotation marks; secondly, it may contain a special character, either an underscore(_) and/or a percent sign(%). The underscore character stands for any single character and the percent sign for any sequence of zero or more characters. For example, assume that you are working with a table containing these values for a column.
- Cathy
- Kathy
- Kathie
- Katherine
Now using different patterns, the selection results are different:
Patterns | Results |
---|---|
Kath_ | Kathy |
Kath__ | Kathie |
Kath% | Kathy, Kathie, Katherine |
_ath% | All of the names above |
Example 32.11 Section
The following program shows the use of the LIKE operator in a WHERE clause to select name, department, position title and annual salary information for people whose name starts with R and the third letter is B:
PROC SQL;
select Name,
Department,
Position_title,
Employee_annual_salary label='Salary' format=DOLLAR12.2
from stat482.salary
where Name like 'R_B%';
QUIT;
Name | Department | Position Title | Salary |
---|---|---|---|
ROBINSON | WATER MGMNT | OPERATING ENGINEER-GROUP C | $93,745.60 |
RABANALES | FINANCE | AUDITOR II | $87,912.00 |
ROBERTS | FIRE | PARAMEDIC I/C | $79,404.00 |
ROBINSON | WATER MGMNT | CONSTRUCTION LABORER | $79,040.00 |
Launch and run the SAS program, and review the query output to convince yourself that the SAS behaves as described.
Another point worthy of being made here is the CALCULATED keywords. In the last section, you learned that we can perform calculations in the SELECT statement and assign an alias to that new column. However, because SAS processes the WHERE clause prior to the SELECT clause, you will run into a problem if the calculated column is used in a WHERE clause as a condition. Therefore, the keyword CALCULATED has to be inserted into the WHERE clause along with the alias to inform SAS that the value is calculated within the query. This point will be illustrated by the following programs.
Example 32.12 Section
The following program attempts to calculate the bonus for every employee, then select ones who have more than $2,000 as a bonus:
PROC SQL;
select Name,
Department,
Employee_annual_salary label='Salary' format=DOLLAR12.2,
Employee_annual_salary * 0.02 as Bonus
from stat482.salary
where Bonus >2000 ;
QUIT;
Launch and run the SAS program. You may want to see what’s going wrong yourself. In the log window, SAS delivered an error message that the column Bonus cannot be found.
ERROR: The Following columns were not found in the contributing tables: Bonus.
That’s because SAS processes the WHERE clause before the SELECT clause. To make it right, add CALCULATED in the WHERE clause as shown below.
PROC SQL;
select Name,
Department,
Employee_annual_salary label='Salary' format=DOLLAR12.2,
Employee_annual_salary * 0.02 as Bonus format=DOLLAR10.2
from stat482.salary
where CALCULATED Bonus >2000 ;
QUIT;
Name | Department | Salary | Bonus |
---|---|---|---|
PENDARVIS | POLICE | $103,590.00 | $2,071.80 |
SAWYER | CITY COUNCIL | $117,333.00 | $2,346.66 |
ODUM | POLICE | $106,920.00 | $2,138.40 |
PLANTZ | GENERAL SERVICES | $102,060.00 | $2,041.20 |
FORD | FIRE | $100,440.00 | $2,008.80 |
DANIELS | POLICE | $106,920.00 | $2,138.40 |
MALONEY | FIRE | $127,566.00 | $2,551.32 |
MOLLOY | COMMUNITY DEVELOPMENT | $102,060.00 | $2,041.20 |
NIEGO | FIRE | $143,682.00 | $2,873.64 |
PRICE | FIRE | $103,590.00 | $2,071.80 |
TATE JR | POLICE | $103,590.00 | $2,071.80 |
JIMENEZ | PROCUREMENT | $113,448.00 | $2,268.96 |
CEBALLOS | WATER MGMNT | $113,448.00 | $2,268.96 |
FERMAN | FIRE | $131,466.00 | $2,629.32 |
SHUM | TRANSPORTN | $104,736.00 | $2,094.72 |
FUNK | FIRE | $101,688.00 | $2,033.76 |
WRZESINSKI | FIRE | $105,918.00 | $2,118.36 |
GOLDSMITH | POLICE | $110,370.00 | $2,207.40 |
FRANKO | POLICE | $106,920.00 | $2,138.40 |
RADDATZ | POLICE | $110,370.00 | $2,207.40 |
MC NABB | FIRE | $105,918.00 | $2,118.36 |
FLEMING | POLICE | $100,440.00 | $2,008.80 |
CASEY | POLICE | $110,370.00 | $2,207.40 |
MACELLAIO JR | WATER MGMNT | $101,920.00 | $2,038.40 |
DARLING | LAW | $149,160.00 | $2,983.20 |
NASH | FINANCE | $103,740.00 | $2,074.80 |
MC COY | POLICE | $110,370.00 | $2,207.40 |
HOLDER | WATER MGMNT | $108,534.40 | $2,170.69 |
TAYLOR | FIRE | $108,462.00 | $2,169.24 |
PERFETTI | POLICE | $112,068.00 | $2,241.36 |
JACOBS | POLICE | $106,920.00 | $2,138.40 |
HENRY | AVIATION | $108,534.40 | $2,170.69 |
IRELAND | FIRE | $113,400.00 | $2,268.00 |
Now it’s working! Make the same change to your program. Check the output to make sure that SAS processes the data properly.
An alternative to using the keyword CALCULATED is to repeat the calculation expression in the WHERE clause. In the preceding program, the WHERE clause can be rewritten as:
where Employee_annual_salary *0.02 >2000;
But note that this is not an efficient way to do this because SAS has to do the calculation twice.