# 13.3 - Finding First and Last Observations

13.3 - Finding First and Last ObservationsIn this section, we learn how to use a BY statement, in conjunction with a SET statement, to identify the first and last observations among a group of observations. As we'll see in the examples that follow, once we've identified the first and last observations in a group, we can use the information to modify existing variables, create new variables, change the structure of our data set, and so on.

Throughout this section, we'll work with a permanent SAS data set called *sales*:

Store | Dept | Quarter | Sales |
---|---|---|---|

101 | 10 | 1 | 110001.50 |

101 | 10 | 2 | 113101.20 |

101 | 10 | 3 | 111932.15 |

101 | 10 | 4 | 99901.10 |

101 | 20 | 1 | 110002.36 |

101 | 20 | 2 | 99922.39 |

101 | 20 | 3 | 98832.98 |

101 | 20 | 4 | 110101.70 |

121 | 20 | 1 | 121947.10 |

121 | 20 | 2 | 119964.69 |

121 | 20 | 3 | 122136.28 |

121 | 20 | 4 | 120111.11 |

121 | 10 | 1 | 127192.92 |

121 | 10 | 2 | 125280.13 |

121 | 10 | 3 | 128203.56 |

121 | 10 | 4 | 123632.29 |

109 | 10 | 1 | 120422.77 |

109 | 10 | 2 | 123984.32 |

109 | 10 | 3 | 121801.29 |

109 | 10 | 4 | 122125.66 |

109 | 30 | 1 | 98310.13 |

109 | 30 | 2 | 97331.25 |

109 | 30 | 3 | 96386.28 |

109 | 30 | 4 | 98511.90 |

109 | 20 | 1 | 115239.09 |

109 | 20 | 2 | 113001.98 |

109 | 20 | 3 | 114234.32 |

109 | 20 | 4 | 114122.65 |

containing four variables:

*Store*: a three-digit identification number of a store*Dept*: a two-digit number identifying a department within each store (10 = clothing, 20 = housewares, 30 = sporting goods, say)*Quarter*: quarter of the year (1 = first three months, 2 = second three months, ..., and so on)*Sales*: the total dollar amount of the sales for the quarter in the given store's department

Let's be concrete here ... suppose we use a BY statement, in conjunction with a SET statement, to tell SAS to group the observations in the *sales* data set by *Store*:

```
DATA salesgrouped;
set sales;
by Store;
RUN;
```

Then, SAS knows that the first eight observations when *Store* = 101 comprise the first group, the next eight observations when *Store* = 121 comprise the second group, and the last twelve observations when *Store* = 109 comprise the last group. Well, okay, that's not technically quite correct! There's a little rule that we can't ignore ... **when you use the BY statement with a SET statement, the data set(s) that are listed in the SET statement must be sorted by the values of the BY variable(s). **As you can see, the *sales* data set is not yet sorted by the *Store* value. We'll worry about such details in the programs that follow. For now, know that our interest is in identifying the first and last observations within the *Store* = 101 group, the first and last observations within the *Store* = 121 group, and so on. Let's work our way through an example!

## Finding the First and Last Observations in a Group

When you specify just one variable name in a BY statement, you can identify the first and last observations in a group.

## Example 13.11

The following program tells SAS to process the `sales` data set by `Store`, just so we can get a behind-the-scenes look at how SAS groups observations and how we can subsequently find the first and last observations of each group:

```
LIBNAME stat481 'C:\yourdrivename\Stat481WC\01sasdata\sasndata';
PROC SORT data = stat481.sales out = srtdsales;
by Store;
RUN;
DATA storesales;
set srtdsales;
by Store;
firstStore = first.Store;
lastStore = last.Store;
RUN;
PROC PRINT data= storesales;
title 'Behind the scene view of the storesales data set';
id Store;
RUN;
```

Store | Dept | Quarter | Sales | firstStore | lastStore |
---|---|---|---|---|---|

101 | 10 | 1 | 110001.50 | 1 | 0 |

101 | 10 | 2 | 113101.20 | 0 | 0 |

101 | 10 | 3 | 111932.15 | 0 | 0 |

101 | 10 | 4 | 99901.10 | 0 | 0 |

101 | 20 | 1 | 110002.36 | 0 | 0 |

101 | 20 | 2 | 99922.39 | 0 | 0 |

101 | 20 | 3 | 98832.98 | 0 | 0 |

101 | 20 | 4 | 110101.70 | 0 | 1 |

109 | 10 | 1 | 120422.77 | 1 | 0 |

109 | 10 | 2 | 123984.32 | 0 | 0 |

109 | 10 | 3 | 121801.29 | 0 | 0 |

109 | 10 | 4 | 122125.66 | 0 | 0 |

109 | 30 | 1 | 98310.13 | 0 | 0 |

109 | 30 | 2 | 97331.25 | 0 | 0 |

109 | 30 | 3 | 96386.28 | 0 | 0 |

109 | 30 | 4 | 98511.90 | 0 | 0 |

109 | 20 | 1 | 115239.09 | 0 | 0 |

109 | 20 | 2 | 113001.98 | 0 | 0 |

109 | 20 | 3 | 114234.32 | 0 | 0 |

109 | 20 | 4 | 114122.65 | 0 | 1 |

121 | 20 | 1 | 121947.10 | 1 | 0 |

121 | 20 | 2 | 119964.69 | 0 | 0 |

121 | 20 | 3 | 122136.28 | 0 | 0 |

121 | 20 | 4 | 120111.11 | 0 | 0 |

121 | 10 | 1 | 127192.92 | 0 | 0 |

121 | 10 | 2 | 125280.13 | 0 | 0 |

121 | 10 | 3 | 128203.56 | 0 | 0 |

121 | 10 | 4 | 123632.29 | 0 | 1 |

The first procedure just takes care of the details that we ignored earlier. Because we want SAS to process the `sales` data set BY `Store`, we first need to sort the data. The SORT procedure tells SAS to sort the `sales` data set by the value of `Store` and to call the resulting sorted data set `srtdsales`.

Now for the DATA step. The SET and BY statements tell SAS to process the data by grouping observations with a similar `Store` value together. How does SAS accomplish this? Whenever you use a BY statement with a SET statement, SAS automatically creates two temporary variables for each variable name that appears in the BY statement. One of the temporary variables is called * FIRST.variable*, where

*variable*is the variable name appearing in the BY statement. The other temporary variable is called, what else but,

*. The two variables always equal either 1 or 0:*

**LAST.variable***FIRST.variable*= 1 when an observation is the first observation in a BY group*FIRST.variable*= 0 when an observation is not the first observation in a BY group*LAST.variable*= 1 when an observation is the last observation in a BY group*LAST.variable*= 0 when an observation is not the last observation in a BY group

SAS uses the values of the *FIRST.variable* and *LAST.variable* temporary variables to identify the first and last observations in a group, and therefore the group itself. Oh, a comment about that adjective temporary ... SAS places *FIRST.variable* and *LAST.variable* in the program data vector and they are therefore available for DATA step programming, but SAS **does not add them to the SAS data set being created**. It is in that sense that they are temporary.

Because SAS does not write *FIRST.variables* and *LAST.variables* to output data sets, we have to do some finagling to see their contents. The two assignment statements:

```
firstStore = first.Store;
lastStore = last.Store;
```

simply tell SAS to assign the values of the temporary variables, *FIRST.Store* and *LAST.Store*, to permanent variables, `firstStore` and `lastStore`, respectively. The PRINT procedure tells SAS to print the resulting data set so that we can take an inside peek at the values of the *FIRST.variables* and *LAST.variables*.

If you haven't already done so, download the sales data set, and save it to a convenient location on your computer. Launch the SAS program and edit the LIBNAME statement to reflect the location in which you saved the data set. Then, run * * the program, and review the output.

You should see that `firstStore` (and hence the internal *FIRST.Store* variable) equals 1 for the first observation of `Store 101`, for the first observation of `Store 109`, and for the first observation of `Store 121`. And, `firstStore` equals 0 for all of the other observations in the data set. Likewise, `lastStore` (and hence the internal *LAST.Store* variable) equals 1 for the last observation of `Store 101`, for the last observation of `Store 109`, and for the last observation of `Store 121`. And, `lastStore` equals 0 for all of the other observations in the data set.

Now that we know the mechanism by which SAS can identify the first and last observations in a group, let's take a look at an example in which we use the information to create a new variable, as well as change the structure of a data set.

## Example 13.12

The following program uses the SET and BY statements to tell SAS to identify the first and last observations for each `Store` that appears in the `sales` data set, and to subsequently use that information to determine and display the total sales (`StoreTotal`) — that is, across all of the departments and quarters — for each `Store`:

```
LIBNAME stat481 'C:\yourdrivename\Stat481WC\01sasdata\sasndata';
PROC SORT data = stat481.sales out = srtdsales;
by Store;
RUN;
DATA storesales;
set srtdsales;
by Store;
if first.Store then StoreTotal = 0;
StoreTotal + Sales;
if last.Store;
drop Dept Quarter Sales;
format StoreTotal Dollar13.2;
RUN;
PROC PRINT data= storesales;
title 'Sales by Store';
id Store;
sum StoreTotal;
RUN;
```

Store | StoreTotal |
---|---|

101 | $853,795.38 |

109 | $1,335,471.64 |

121 | $988,468.08 |

$3,177,735.10 |

We better start by having you run the program just to make sure you see first what the program does ... If you haven't already done so, download the sales data set, and save it to a convenient location on your computer. Launch the SAS program and edit the LIBNAME statement to reflect the location in which you saved the data set. Then, run * * the program, and review the output. In short, the program takes the

`sales`data set displayed earlier and collapses it to create a new data set called

`storesales`that contains just three observations — one for each

`Store`— and two variables — the store (

`Store`) and the total sales (

`StoreTotal`).

Now, let's work our way through the code. Again, the SORT procedure tells SAS to sort the `sales` data set by the value of `Store` and to call the resulting sorted data set `srtdsales`. The SET and BY statements tell SAS to create the temporary variables *FIRST.Store* and *LAST.Store*. The next two statements in the DATA step:

```
if first.Store then StoreTotal = 0;
StoreTotal + Sales;
```

can almost be read as if they were in English. Set the store's total sales (`StoreTotal`) to 0 when you encounter the first observation for the store. Add the `Sales` for that observation to the `StoreTotal`. For every subsequent observation of the `Store`, add the `Sales` amount for the `Store` to the `StoreTotal`. When you encounter the last observation for the `Store`:

```
if last.Store;
drop Dept Quarter Sales;
format StoreTotal Dollar13.2;
```

include the observation in the `storesales` data set, and while so doing exclude the `Dept`, `Quarter`, and `Sales` variables and format the `StoreTotal` variable so its values are displayed as dollar amounts.

It is the subsetting IF statement:

`if last.Store;`

that deserves a bit more of a mention. First, it is equivalent to saying:

`if last.Store = 1;`

Recall that a subsetting IF statement tells SAS which variables to *include* in the output data set. Here, the subsetting IF statement tells SAS to include only those observations that correspond to the last observation for each `Store`. It is therefore the statement that takes the `sales` data set containing 28 observations and collapses it into the `storesales` data set containing just 3 observations. After all, the *LAST.Store* variable = 1 only for three observations — the three observations that correspond to the last observation for each `Store`.

## Finding the First and Last Observations in Subgroups

In the previous examples, we specified just one variable name, *Store*, in the BY statement. Doing so allowed us to identify the first and last observations for each *Store* group. When you specify multiple variable names in a BY statement, you can identify the first and last observations in, not just one group, but multiple subgroups.

## Example 13.13

The following program tells SAS to process the *sales* data set by *Store *and* Dept*, so we can get a behind-the-scenes look at how we can find the first and last observations of two subgroups:

```
LIBNAME stat481 'C:\yourdrivename\Stat481WC\01sasdata\sasndata';
PROC SORT data = stat481.sales out = srtdsales;
by Store Dept;
RUN;
DATA storesales;
set srtdsales;
by Store Dept;
firstStore = first.Store;
lastStore = last.Store;
firstDept = first.Dept;
lastDept = last.Dept;
RUN;
PROC PRINT data = storesales;
title 'Behind the scene view of the storesales data set';
id Store;
RUN;
```

Store | Dept | Quarter | Sales | firstStore | lastStore | firstDept | lastDept |
---|---|---|---|---|---|---|---|

101 | 10 | 1 | 110001.50 | 1 | 0 | 1 | 0 |

101 | 10 | 2 | 113101.20 | 0 | 0 | 0 | 0 |

101 | 10 | 3 | 111932.15 | 0 | 0 | 0 | 0 |

101 | 10 | 4 | 99901.10 | 0 | 0 | 0 | 1 |

101 | 20 | 1 | 110002.36 | 0 | 0 | 1 | 0 |

101 | 20 | 2 | 99922.39 | 0 | 0 | 0 | 0 |

101 | 20 | 3 | 98832.98 | 0 | 0 | 0 | 0 |

101 | 20 | 4 | 110101.70 | 0 | 1 | 0 | 1 |

109 | 10 | 1 | 120422.77 | 1 | 0 | 1 | 0 |

109 | 10 | 2 | 123984.32 | 0 | 0 | 0 | 0 |

109 | 10 | 3 | 121801.29 | 0 | 0 | 0 | 0 |

109 | 10 | 4 | 122125.66 | 0 | 0 | 0 | 1 |

109 | 20 | 1 | 115239.09 | 0 | 0 | 1 | 0 |

109 | 20 | 2 | 113001.98 | 0 | 0 | 0 | 0 |

109 | 20 | 3 | 114234.32 | 0 | 0 | 0 | 0 |

109 | 20 | 4 | 114122.65 | 0 | 0 | 0 | 1 |

109 | 30 | 1 | 98310.13 | 0 | 0 | 1 | 0 |

109 | 30 | 2 | 97331.25 | 0 | 0 | 0 | 0 |

109 | 30 | 3 | 96386.28 | 0 | 0 | 0 | 0 |

109 | 30 | 4 | 98511.90 | 0 | 1 | 0 | 1 |

121 | 10 | 1 | 127192.92 | 1 | 0 | 1 | 0 |

121 | 10 | 2 | 125280.13 | 0 | 0 | 0 | 0 |

121 | 10 | 3 | 128203.56 | 0 | 0 | 0 | 0 |

121 | 10 | 4 | 123632.29 | 0 | 0 | 0 | 1 |

121 | 20 | 1 | 121947.10 | 0 | 0 | 1 | 0 |

121 | 20 | 2 | 119964.69 | 0 | 0 | 0 | 0 |

121 | 20 | 3 | 122136.28 | 0 | 0 | 0 | 0 |

121 | 20 | 4 | 120111.11 | 0 | 1 | 0 | 1 |

Again, because we want SAS to process the *sales* data set BY *Store* and *Dept*, we first need to sort the data. The SORT procedure tells SAS to sort the *sales* data set by the value of *Store* and *Dept* and to call the resulting sorted data set *srtdsales*. The SET and BY statements tell SAS to process the *srtdsales* data set by *Store* and *Dept*. Because two variables appear in the BY statement, SAS creates two temporary variables for each one:

*FIRST.Store*= 1 when SAS encounters a*Store*'s first observation, and 0 otherwise*LAST.Store*= 1,- when SAS encounters a
*Store*'s last observation, and 0 otherwise *FIRST.Dept*= 1, when SAS encounters a*Dept*'s first observation, and 0 otherwise*LAST.Dept*= 1, when SAS encounters a*Dept*'s last observation, and 0 otherwise

Because SAS does not write *FIRST.variables* and *LAST.variables* to output data sets, we again do some finagling to see their contents. The four assignment statements:

```
firstStore = first.Store;
lastStore = last.Store;
firstDept = first.Dept;
lastDept = last.Dept;
```

tell SAS to assign the values of the temporary variables to permanent variables so we can view their contents. The PRINT procedure tells SAS to print the resulting data set so that we can take an inside peek at the values of the *FIRST.variables* and *LAST.variables*.

If you haven't already done so, download the sales data set, and save it to a convenient location on your computer. Launch the SAS program and edit the LIBNAME statement to reflect the location in which you saved the data set. Then, run * * the program, and review the output to convince yourself that SAS uses the two

*FIRST.variables*and the two

*LAST.variables*to identify the first and last observations of each

*Store*and of each

*Dept*within each

*Store*.

## Example 13.14

The following program uses the SET and BY statements to tell SAS to identify the first and last observations for each *Store* and for each *Dept* within each *Store*, and to subsequently use that information to determine and display the total sales (*DeptTotal*) for each *Dept* within each *Store*:

```
LIBNAME stat481 'C:\yourdrivename\Stat481WC\01sasdata\sasndata';
PROC SORT data = stat481.sales out = srtdsales;
by Store Dept;
RUN;
DATA storesales;
set srtdsales;
by Store Dept;
if first.Dept then DeptTotal = 0;
DeptTotal + Sales;
if last.Dept;
drop Quarter Sales;
format DeptTotal Dollar13.2;
RUN;
PROC PRINT data = storesales NOOBS;
title 'Sales by Store and Department';
sum DeptTotal;
RUN;
```

Store | Dept | DeptTotal |
---|---|---|

101 | 10 | $434,935.95 |

101 | 20 | $418,859.43 |

109 | 10 | $488,334.04 |

109 | 20 | $456,598.04 |

109 | 30 | $390,539.56 |

121 | 10 | $504,308.90 |

121 | 20 | $484,159.18 |

$3,177,735.10 |

In reviewing the program, you should see that the program is just a slightly modified version of that of Example 1.12. Let's start again by having you launch and run * * the SAS program, and reviewing the output just to make sure you see first what the program does. (Before running, don't forget to edit the LIBNAME statement to reflect the location of your stored

*sales*data set.) In short, the program takes the

*sales*data set displayed earlier and collapses it to create a new data set called

*storesales*that contains seven observations — one for each

*Dept*within each

*Store*— and three variables — the store (

*Store*), the department (

*Dept*) and the total sales (

*DeptTotal*).

Again, the SORT procedure tells SAS to sort the *sales* data set by the value of *Store* and *Dept* and to call the resulting sorted data set *srtdsales*. The SET and BY statements tell SAS to create the temporary variables *FIRST.Store*,* LAST.Store*,* FIRST.Dept*, and* LAST.Dept*. The next two statements in the DATA step:

```
if first.Dept then DeptTotal = 0;
DeptTotal + Sales;
```

can again be read as if they were in English. Set the department's total sales (*DeptTotal*) to 0 when you encounter the first observation for the department. Add the *Sales* for that observation to the *DeptTotal*. For every subsequent observation for the *Dept*, add the *Sales* amount for the *Dept* to the *DeptTotal*. When you encounter the last observation for the *Dept *(within each *Store*):

```
if last.Dept;
drop Quarter Sales;
format DeptTotal Dollar13.2;
```

include the observation in the *storesales* data set, and while so doing exclude the *Quarter* and *Sales* variables, and format the *DeptTotal* variable so its values are displayed as dollar amounts.

It is again the subsetting IF statement:

`if last.Dept;`

that takes the *sales* data set containing 28 observations and collapses it into the *storesales* data set containing just 7 observations. After all, the *LAST.Dept* variable = 1 for just seven observations — the seven observations that correspond to the last observation for each *Dept *within each* Store*.