In this section, we learn how to use a BY statement, in conjunction with a SET statement, to identify the first and last observations among a group of observations. As we'll see in the examples that follow, once we've identified the first and last observations in a group, we can use the information to modify existing variables, create new variables, change the structure of our data set, and so on.
Throughout this section, we'll work with a permanent SAS data set called sales:
Store | Dept | Quarter | Sales |
---|---|---|---|
101 | 10 | 1 | 110001.50 |
101 | 10 | 2 | 113101.20 |
101 | 10 | 3 | 111932.15 |
101 | 10 | 4 | 99901.10 |
101 | 20 | 1 | 110002.36 |
101 | 20 | 2 | 99922.39 |
101 | 20 | 3 | 98832.98 |
101 | 20 | 4 | 110101.70 |
121 | 20 | 1 | 121947.10 |
121 | 20 | 2 | 119964.69 |
121 | 20 | 3 | 122136.28 |
121 | 20 | 4 | 120111.11 |
121 | 10 | 1 | 127192.92 |
121 | 10 | 2 | 125280.13 |
121 | 10 | 3 | 128203.56 |
121 | 10 | 4 | 123632.29 |
109 | 10 | 1 | 120422.77 |
109 | 10 | 2 | 123984.32 |
109 | 10 | 3 | 121801.29 |
109 | 10 | 4 | 122125.66 |
109 | 30 | 1 | 98310.13 |
109 | 30 | 2 | 97331.25 |
109 | 30 | 3 | 96386.28 |
109 | 30 | 4 | 98511.90 |
109 | 20 | 1 | 115239.09 |
109 | 20 | 2 | 113001.98 |
109 | 20 | 3 | 114234.32 |
109 | 20 | 4 | 114122.65 |
containing four variables:
- Store: a three-digit identification number of a store
- Dept: a two-digit number identifying a department within each store (10 = clothing, 20 = housewares, 30 = sporting goods, say)
- Quarter: quarter of the year (1 = first three months, 2 = second three months, ..., and so on)
- Sales: the total dollar amount of the sales for the quarter in the given store's department
Let's be concrete here ... suppose we use a BY statement, in conjunction with a SET statement, to tell SAS to group the observations in the sales data set by Store:
DATA salesgrouped;
set sales;
by Store;
RUN;
Then, SAS knows that the first eight observations when Store = 101 comprise the first group, the next eight observations when Store = 121 comprise the second group, and the last twelve observations when Store = 109 comprise the last group. Well, okay, that's not technically quite correct! There's a little rule that we can't ignore ... when you use the BY statement with a SET statement, the data set(s) that are listed in the SET statement must be sorted by the values of the BY variable(s). As you can see, the sales data set is not yet sorted by the Store value. We'll worry about such details in the programs that follow. For now, know that our interest is in identifying the first and last observations within the Store = 101 group, the first and last observations within the Store = 121 group, and so on. Let's work our way through an example!
Finding the First and Last Observations in a Group Section
When you specify just one variable name in a BY statement, you can identify the first and last observations in a group.
Example 13.11
The following program tells SAS to process the sales data set by Store, just so we can get a behind-the-scenes look at how SAS groups observations and how we can subsequently find the first and last observations of each group:
LIBNAME stat481 'C:\yourdrivename\Stat481WC\01sasdata\sasndata';
PROC SORT data = stat481.sales out = srtdsales;
by Store;
RUN;
DATA storesales;
set srtdsales;
by Store;
firstStore = first.Store;
lastStore = last.Store;
RUN;
PROC PRINT data= storesales;
title 'Behind the scene view of the storesales data set';
id Store;
RUN;
Store | Dept | Quarter | Sales | firstStore | lastStore |
---|---|---|---|---|---|
101 | 10 | 1 | 110001.50 | 1 | 0 |
101 | 10 | 2 | 113101.20 | 0 | 0 |
101 | 10 | 3 | 111932.15 | 0 | 0 |
101 | 10 | 4 | 99901.10 | 0 | 0 |
101 | 20 | 1 | 110002.36 | 0 | 0 |
101 | 20 | 2 | 99922.39 | 0 | 0 |
101 | 20 | 3 | 98832.98 | 0 | 0 |
101 | 20 | 4 | 110101.70 | 0 | 1 |
109 | 10 | 1 | 120422.77 | 1 | 0 |
109 | 10 | 2 | 123984.32 | 0 | 0 |
109 | 10 | 3 | 121801.29 | 0 | 0 |
109 | 10 | 4 | 122125.66 | 0 | 0 |
109 | 30 | 1 | 98310.13 | 0 | 0 |
109 | 30 | 2 | 97331.25 | 0 | 0 |
109 | 30 | 3 | 96386.28 | 0 | 0 |
109 | 30 | 4 | 98511.90 | 0 | 0 |
109 | 20 | 1 | 115239.09 | 0 | 0 |
109 | 20 | 2 | 113001.98 | 0 | 0 |
109 | 20 | 3 | 114234.32 | 0 | 0 |
109 | 20 | 4 | 114122.65 | 0 | 1 |
121 | 20 | 1 | 121947.10 | 1 | 0 |
121 | 20 | 2 | 119964.69 | 0 | 0 |
121 | 20 | 3 | 122136.28 | 0 | 0 |
121 | 20 | 4 | 120111.11 | 0 | 0 |
121 | 10 | 1 | 127192.92 | 0 | 0 |
121 | 10 | 2 | 125280.13 | 0 | 0 |
121 | 10 | 3 | 128203.56 | 0 | 0 |
121 | 10 | 4 | 123632.29 | 0 | 1 |
The first procedure just takes care of the details that we ignored earlier. Because we want SAS to process the sales data set BY Store, we first need to sort the data. The SORT procedure tells SAS to sort the sales data set by the value of Store and to call the resulting sorted data set srtdsales.
Now for the DATA step. The SET and BY statements tell SAS to process the data by grouping observations with a similar Store value together. How does SAS accomplish this? Whenever you use a BY statement with a SET statement, SAS automatically creates two temporary variables for each variable name that appears in the BY statement. One of the temporary variables is called FIRST.variable, where variable is the variable name appearing in the BY statement. The other temporary variable is called, what else but, LAST.variable. The two variables always equal either 1 or 0:
- FIRST.variable = 1 when an observation is the first observation in a BY group
- FIRST.variable = 0 when an observation is not the first observation in a BY group
- LAST.variable = 1 when an observation is the last observation in a BY group
- LAST.variable = 0 when an observation is not the last observation in a BY group
SAS uses the values of the FIRST.variable and LAST.variable temporary variables to identify the first and last observations in a group, and therefore the group itself. Oh, a comment about that adjective temporary ... SAS places FIRST.variable and LAST.variable in the program data vector and they are therefore available for DATA step programming, but SAS does not add them to the SAS data set being created. It is in that sense that they are temporary.
Because SAS does not write FIRST.variables and LAST.variables to output data sets, we have to do some finagling to see their contents. The two assignment statements:
firstStore = first.Store;
lastStore = last.Store;
simply tell SAS to assign the values of the temporary variables, FIRST.Store and LAST.Store, to permanent variables, firstStore and lastStore, respectively. The PRINT procedure tells SAS to print the resulting data set so that we can take an inside peek at the values of the FIRST.variables and LAST.variables.
If you haven't already done so, download the sales data set, and save it to a convenient location on your computer. Launch the SAS program and edit the LIBNAME statement to reflect the location in which you saved the data set. Then, run the program, and review the output.
You should see that firstStore (and hence the internal FIRST.Store variable) equals 1 for the first observation of Store 101, for the first observation of Store 109, and for the first observation of Store 121. And, firstStore equals 0 for all of the other observations in the data set. Likewise, lastStore (and hence the internal LAST.Store variable) equals 1 for the last observation of Store 101, for the last observation of Store 109, and for the last observation of Store 121. And, lastStore equals 0 for all of the other observations in the data set.
Now that we know the mechanism by which SAS can identify the first and last observations in a group, let's take a look at an example in which we use the information to create a new variable, as well as change the structure of a data set.
Example 13.12
The following program uses the SET and BY statements to tell SAS to identify the first and last observations for each Store that appears in the sales data set, and to subsequently use that information to determine and display the total sales (StoreTotal) — that is, across all of the departments and quarters — for each Store:
LIBNAME stat481 'C:\yourdrivename\Stat481WC\01sasdata\sasndata';
PROC SORT data = stat481.sales out = srtdsales;
by Store;
RUN;
DATA storesales;
set srtdsales;
by Store;
if first.Store then StoreTotal = 0;
StoreTotal + Sales;
if last.Store;
drop Dept Quarter Sales;
format StoreTotal Dollar13.2;
RUN;
PROC PRINT data= storesales;
title 'Sales by Store';
id Store;
sum StoreTotal;
RUN;
Store | StoreTotal |
---|---|
101 | $853,795.38 |
109 | $1,335,471.64 |
121 | $988,468.08 |
$3,177,735.10 |
We better start by having you run the program just to make sure you see first what the program does ... If you haven't already done so, download the sales data set, and save it to a convenient location on your computer. Launch the SAS program and edit the LIBNAME statement to reflect the location in which you saved the data set. Then, run the program, and review the output. In short, the program takes the sales data set displayed earlier and collapses it to create a new data set called storesales that contains just three observations — one for each Store — and two variables — the store (Store) and the total sales (StoreTotal).
Now, let's work our way through the code. Again, the SORT procedure tells SAS to sort the sales data set by the value of Store and to call the resulting sorted data set srtdsales. The SET and BY statements tell SAS to create the temporary variables FIRST.Store and LAST.Store. The next two statements in the DATA step:
if first.Store then StoreTotal = 0;
StoreTotal + Sales;
can almost be read as if they were in English. Set the store's total sales (StoreTotal) to 0 when you encounter the first observation for the store. Add the Sales for that observation to the StoreTotal. For every subsequent observation of the Store, add the Sales amount for the Store to the StoreTotal. When you encounter the last observation for the Store:
if last.Store;
drop Dept Quarter Sales;
format StoreTotal Dollar13.2;
include the observation in the storesales data set, and while so doing exclude the Dept, Quarter, and Sales variables and format the StoreTotal variable so its values are displayed as dollar amounts.
It is the subsetting IF statement:
if last.Store;
that deserves a bit more of a mention. First, it is equivalent to saying:
if last.Store = 1;
Recall that a subsetting IF statement tells SAS which variables to include in the output data set. Here, the subsetting IF statement tells SAS to include only those observations that correspond to the last observation for each Store. It is therefore the statement that takes the sales data set containing 28 observations and collapses it into the storesales data set containing just 3 observations. After all, the LAST.Store variable = 1 only for three observations — the three observations that correspond to the last observation for each Store.
Finding the First and Last Observations in Subgroups Section
In the previous examples, we specified just one variable name, Store, in the BY statement. Doing so allowed us to identify the first and last observations for each Store group. When you specify multiple variable names in a BY statement, you can identify the first and last observations in, not just one group, but multiple subgroups.
Example 13.13
The following program tells SAS to process the sales data set by Store and Dept, so we can get a behind-the-scenes look at how we can find the first and last observations of two subgroups:
LIBNAME stat481 'C:\yourdrivename\Stat481WC\01sasdata\sasndata';
PROC SORT data = stat481.sales out = srtdsales;
by Store Dept;
RUN;
DATA storesales;
set srtdsales;
by Store Dept;
firstStore = first.Store;
lastStore = last.Store;
firstDept = first.Dept;
lastDept = last.Dept;
RUN;
PROC PRINT data = storesales;
title 'Behind the scene view of the storesales data set';
id Store;
RUN;
Store | Dept | Quarter | Sales | firstStore | lastStore | firstDept | lastDept |
---|---|---|---|---|---|---|---|
101 | 10 | 1 | 110001.50 | 1 | 0 | 1 | 0 |
101 | 10 | 2 | 113101.20 | 0 | 0 | 0 | 0 |
101 | 10 | 3 | 111932.15 | 0 | 0 | 0 | 0 |
101 | 10 | 4 | 99901.10 | 0 | 0 | 0 | 1 |
101 | 20 | 1 | 110002.36 | 0 | 0 | 1 | 0 |
101 | 20 | 2 | 99922.39 | 0 | 0 | 0 | 0 |
101 | 20 | 3 | 98832.98 | 0 | 0 | 0 | 0 |
101 | 20 | 4 | 110101.70 | 0 | 1 | 0 | 1 |
109 | 10 | 1 | 120422.77 | 1 | 0 | 1 | 0 |
109 | 10 | 2 | 123984.32 | 0 | 0 | 0 | 0 |
109 | 10 | 3 | 121801.29 | 0 | 0 | 0 | 0 |
109 | 10 | 4 | 122125.66 | 0 | 0 | 0 | 1 |
109 | 20 | 1 | 115239.09 | 0 | 0 | 1 | 0 |
109 | 20 | 2 | 113001.98 | 0 | 0 | 0 | 0 |
109 | 20 | 3 | 114234.32 | 0 | 0 | 0 | 0 |
109 | 20 | 4 | 114122.65 | 0 | 0 | 0 | 1 |
109 | 30 | 1 | 98310.13 | 0 | 0 | 1 | 0 |
109 | 30 | 2 | 97331.25 | 0 | 0 | 0 | 0 |
109 | 30 | 3 | 96386.28 | 0 | 0 | 0 | 0 |
109 | 30 | 4 | 98511.90 | 0 | 1 | 0 | 1 |
121 | 10 | 1 | 127192.92 | 1 | 0 | 1 | 0 |
121 | 10 | 2 | 125280.13 | 0 | 0 | 0 | 0 |
121 | 10 | 3 | 128203.56 | 0 | 0 | 0 | 0 |
121 | 10 | 4 | 123632.29 | 0 | 0 | 0 | 1 |
121 | 20 | 1 | 121947.10 | 0 | 0 | 1 | 0 |
121 | 20 | 2 | 119964.69 | 0 | 0 | 0 | 0 |
121 | 20 | 3 | 122136.28 | 0 | 0 | 0 | 0 |
121 | 20 | 4 | 120111.11 | 0 | 1 | 0 | 1 |
Again, because we want SAS to process the sales data set BY Store and Dept, we first need to sort the data. The SORT procedure tells SAS to sort the sales data set by the value of Store and Dept and to call the resulting sorted data set srtdsales. The SET and BY statements tell SAS to process the srtdsales data set by Store and Dept. Because two variables appear in the BY statement, SAS creates two temporary variables for each one:
- FIRST.Store = 1 when SAS encounters a Store's first observation, and 0 otherwise
- LAST.Store = 1,
- when SAS encounters a Store's last observation, and 0 otherwise
- FIRST.Dept = 1, when SAS encounters a Dept's first observation, and 0 otherwise
- LAST.Dept = 1, when SAS encounters a Dept's last observation, and 0 otherwise
Because SAS does not write FIRST.variables and LAST.variables to output data sets, we again do some finagling to see their contents. The four assignment statements:
firstStore = first.Store;
lastStore = last.Store;
firstDept = first.Dept;
lastDept = last.Dept;
tell SAS to assign the values of the temporary variables to permanent variables so we can view their contents. The PRINT procedure tells SAS to print the resulting data set so that we can take an inside peek at the values of the FIRST.variables and LAST.variables.
If you haven't already done so, download the sales data set, and save it to a convenient location on your computer. Launch the SAS program and edit the LIBNAME statement to reflect the location in which you saved the data set. Then, run the program, and review the output to convince yourself that SAS uses the two FIRST.variables and the two LAST.variables to identify the first and last observations of each Store and of each Dept within each Store.
Example 13.14
The following program uses the SET and BY statements to tell SAS to identify the first and last observations for each Store and for each Dept within each Store, and to subsequently use that information to determine and display the total sales (DeptTotal) for each Dept within each Store:
LIBNAME stat481 'C:\yourdrivename\Stat481WC\01sasdata\sasndata';
PROC SORT data = stat481.sales out = srtdsales;
by Store Dept;
RUN;
DATA storesales;
set srtdsales;
by Store Dept;
if first.Dept then DeptTotal = 0;
DeptTotal + Sales;
if last.Dept;
drop Quarter Sales;
format DeptTotal Dollar13.2;
RUN;
PROC PRINT data = storesales NOOBS;
title 'Sales by Store and Department';
sum DeptTotal;
RUN;
Store | Dept | DeptTotal |
---|---|---|
101 | 10 | $434,935.95 |
101 | 20 | $418,859.43 |
109 | 10 | $488,334.04 |
109 | 20 | $456,598.04 |
109 | 30 | $390,539.56 |
121 | 10 | $504,308.90 |
121 | 20 | $484,159.18 |
$3,177,735.10 |
In reviewing the program, you should see that the program is just a slightly modified version of that of Example 1.12. Let's start again by having you launch and run the SAS program, and reviewing the output just to make sure you see first what the program does. (Before running, don't forget to edit the LIBNAME statement to reflect the location of your stored sales data set.) In short, the program takes the sales data set displayed earlier and collapses it to create a new data set called storesales that contains seven observations — one for each Dept within each Store — and three variables — the store (Store), the department (Dept) and the total sales (DeptTotal).
Again, the SORT procedure tells SAS to sort the sales data set by the value of Store and Dept and to call the resulting sorted data set srtdsales. The SET and BY statements tell SAS to create the temporary variables FIRST.Store, LAST.Store, FIRST.Dept, and LAST.Dept. The next two statements in the DATA step:
if first.Dept then DeptTotal = 0;
DeptTotal + Sales;
can again be read as if they were in English. Set the department's total sales (DeptTotal) to 0 when you encounter the first observation for the department. Add the Sales for that observation to the DeptTotal. For every subsequent observation for the Dept, add the Sales amount for the Dept to the DeptTotal. When you encounter the last observation for the Dept (within each Store):
if last.Dept;
drop Quarter Sales;
format DeptTotal Dollar13.2;
include the observation in the storesales data set, and while so doing exclude the Quarter and Sales variables, and format the DeptTotal variable so its values are displayed as dollar amounts.
It is again the subsetting IF statement:
if last.Dept;
that takes the sales data set containing 28 observations and collapses it into the storesales data set containing just 7 observations. After all, the LAST.Dept variable = 1 for just seven observations — the seven observations that correspond to the last observation for each Dept within each Store.