# 16.1 - Match-Merging SAS Data Sets

16.1 - Match-Merging SAS Data SetsMatch-merging is one of the most powerful methods of combining two or more SAS data sets. A match-merge combines observations across data sets based on the values of one or more common variables. For example, when data set *bird*:

Year | X |
---|---|

2000 | 1 |

2001 | 2 |

2002 | 3 |

2003 | 4 |

2004 | 5 |

and data set *bee*:

Year | X |
---|---|

2000 | 1 |

2001 | 2 |

2002 | 3 |

2003 | 4 |

2004 | 5 |

are match-merged by the variable *year*, we get data set *baby*:

Year | X | Y |
---|---|---|

2000 | 1 | 1 |

2000 | 1 | 2 |

2001 | 2 | . |

2002 | 3 | 3 |

2003 | 4 | 4 |

2004 | 5 | 5 |

As you can see, the observations in the *baby* data set are created by joining observations from the *bird* and *bee* data sets that share the same *year*. To match-merge, you simply specify the data sets you would like to merge in a MERGE statement, and indicate the variables on which you would like to merge in a BY statement. One thing to keep in mind, though, is **you can't match-merge SAS data sets unless they are sorted by the variables appearing in the BY statement**.

Throughout the rest of this section, we will look at examples of match-merges under various situations. In increasing order of complexity, the situations are:

- a match-merge in which the data sets to be merged contain the same number of observations, and each observation in the first data set matches with exactly one observation in the second data set
- a match-merge in which the data sets to be merged are sorted by the BY variable in descending order
- a match-merge in which the data sets to be merged contain a different number of observations, and each observation in the first data set matches with no more than one observation in the second data set
- a match-merge in which the data sets to be merged share, aside from the variable that links the data sets together, at least one common variable name
- a match-merge in which the observations in the first data set to be merged matches one or more observations in the second data set to be merged

Let's jump right in!

## Example 16.1

The following program illustrates the simplest case of a match-merge, in which the data sets to be merged contain the same number of observations, and each observation in the first data set matches with exactly one observation in the second data set. Specifically, the *demog* and *status* data sets each contains five observations — one for each `subj` 1000, 1001, 1002, 1003, and 1004:

```
DATA demog;
input subj gender $ age;
cards;
1000 M 42
1001 M 20
1002 F 53
1003 F 40
1004 M 29
;
RUN;
DATA status;
input subj disease $ test $ ;
cards;
1000 Y Y
1001 N Y
1002 N N
1003 Y Y
1004 N N
;
RUN;
DATA patients;
merge demog status;
by subj;
RUN;
PROC PRINT data=patients NOOBS;
title 'The patients data set';
RUN;
```

Of course, the first two DATA steps just read in the respective *demog* and *status* data sets. Note that the two data sets are "**linked**" by the subject's ID number (`subj`). The meat of the match-merge takes place in the third (and last) DATA step, in which we see a MERGE statement and a BY statement. The DATA step tells SAS that we want to match-merge the data sets appearing in the MERGE statement (*demog* and *status*), by the variable appearing in the BY statement (`subj`), and to store the result in a new data set called *patients*. This is a perfectly valid request because both of the data sets (*demog* and *status*) are already sorted — in ascending order — by the `subj` variable.

Launch and run * * the SAS program. Review the output from the PRINT procedure to convince yourself that the data set *patients* contains five observations — one observation for each of the five subjects appearing in both the *demog* and *status* data sets.

## Example 16.2

The following program is another example of a simple match-merge, except this time the data sets to be merged — *descdemog* and *descstatus* — are sorted by the BY variable — `subj` — in descending order:

` ````
PROC SORT data = demog out = descdemog;
by descending subj;
RUN;
PROC SORT data = status out = descstatus;
by descending subj;
RUN;
DATA descpatients;
merge descdemog descstatus;
by descending subj;
RUN;
PROC PRINT data = descpatients NOOBS;
title 'The descpatients data set';
RUN;
```

The two SORT procedures tell SAS to sort the *demog* and *status* data sets by the variable `subj`, and to store the results, respectively, in data sets called *descdemog* and *descstatus*. The keyword DESCENDING that precedes the `subj` variable in each SORT procedure tells SAS to sort the `subj` values in descending order — that is, with the largest `subj` value appearing first, and the smallest *subj* value appearing last.

The match-merge again takes place in the third DATA step, in which we see a MERGE statement and a BY statement. The DATA step tells SAS that we want to match-merge the data sets appearing in the MERGE statement (*descdemog* and *descstatus*), by the variable appearing in the BY statement (`subj`), and to store the result in a new data set called *descpatients*. Note that, to accomplish the match-merge, we have to place the DESCENDING keyword in the BY statement before the `subj` variable, so that SAS knows that the input data sets (*descdemog* and *descstatus*) are sorted in descending order.

Launch and run * * the SAS program. Review the output from the PRINT procedure to convince yourself that the data set *descpatients* contains five observations created from the observations in *descdemog* and *descstatus*, and arranged in descending order of the *subj* variable.

## Example 16.3

The following program illustrates a match-merge in which the data sets to be merged contain a different number of observations, and each observation in the first data set matches with no more than one observation in the second data set. Specifically, the *newdemog* data set contains six observations — one for each `subj` 1000, 1001, 1002, 1003, 1004, and 1005 — while the *status* data set contains just five observations — one for each `subj` 1000, 1001, 1002, 1003, and 1004:

` ````
DATA newdemog;
input subj gender $ age;
cards;
1000 M 42
1001 M 20
1002 F 53
1003 F 40
1004 M 29
1005 F 29
;
RUN;
DATA status;
input subj disease $ test $ ;
cards;
1000 Y Y
1001 N Y
1002 N N
1003 Y Y
1004 N N
;
RUN;
DATA newpatients;
merge newdemog status;
by subj;
RUN;
PROC PRINT data=newpatients NOOBS;
title 'The newpatients data set';
RUN;
```

Again, the first two DATA steps just tell SAS to read in the *newdemog* and *status* data sets. The third DATA step tells SAS to match-merge the *newdemog* and *status* data sets by the `subj` variable, and to store the result in a new data set called *newpatients*. Although the *newdemog* and *status* data sets contain different numbers of observations, SAS continues to merge the two data sets until it runs out of observations. Where necessary, SAS fills variables with missing values.

Launch and run * * the SAS program. Review the output from the PRINT procedure to convince yourself that the merged data set *newpatients* contains six subjects. SAS fills the variables in the *newpatients* data set with values based on the variables' values in the data set from which they came. Because subject 1005 does not appear in the* status *data set, SAS assigns a missing value to the variables *disease* and *test* for subject 1005.

## Example 16.4

In the examples we've looked at so far, the variables in the data sets to be merged were unique. That is, the data sets did not share any common variable names other than the variable that linked the data sets together. The following program illustrates how SAS merges two data sets when there are common variables across the data sets besides the linking variable(s). Specifically, in addition to the linking variable *subj*, the *moredemog* and *morestatus* data sets share a variable called *v_date* (for visit date):

```
DATA moredemog;
input subj gender $ age v_date mmddyy8.;
format v_date mmddyy8.;
DATALINES;
1000 M 42 03/10/96
1001 M 20 02/19/96
1002 F 53 02/01/96
1003 F 40 12/31/95
1004 M 29 01/10/97
;
RUN;
DATA morestatus;
input subj disease $ test $ v_date mmddyy8.;
format v_date mmddyy8.;
DATALINES;
1000 Y Y 03/17/96
1001 N Y 03/01/96
1002 N N 02/18/96
1003 Y Y 01/15/96
1004 N N 02/01/97
;
RUN;
DATA morepatients;
merge moredemog morestatus;
by subj;
RUN;
PROC PRINT data=morepatients NOOBS;
title 'The morepatients data set';
RUN;
```

Again, the first two DATA steps just tell SAS to read in the *moredemog* and *morestatus* data sets. As you can see, the *moredemog *and *morestatus* data sets have in common not only the linking variable, *subj*, but also the *v_date* variable. So, when the two data sets are merged into a data set called *morepatients*, as happens in the third DATA step, the variable *v_date* gets its values from the *morestatus* data set. That's because *morestatus* appears last in the MERGE statement. So, the values of *v_date* from* moredemog *get over-written in the program data vector by the values of *v_date* in *morestatus*. Launch and run * * the SAS program. Review the output from the PRINT procedure to convince yourself that the values for *v_date* in the merged data set *morepatients* are the same as the values for *v_date* in the *morestatus* data set.

So, in general, if data sets share common variable names, the variable in the merged data set takes its value from the data set appearing last in the MERGE statement. To reinforce the point, the following data step merges data sets *morestatus* and *moredemog* again, but this time with *moredemog *appearing after *morestatus* in the MERGE statement:

```
DATA morepatients2;
merge morestatus moredemog;
by subj;
RUN;
PROC PRINT data=morepatients2 NOOBS;
title 'The morepatients2 data set';
RUN;
```

Here, the variable *v_date* in the *morepatients2* data set gets its value from the *moredemog* data set. That's because *moredemog* appears last in the MERGE statement. Therefore, the values of *v_date* from* morestatus *get over-written in the program data vector by the values of *v_date* in *moredemog*. Launch and run * * the SAS program. Review the output from the PRINT procedure to convince yourself that this time the values for *v_date* in the merged data set *morepatients2 * are the same as the values for *v_date* in the *moredemog* data set.

## Example 16.5

The following program illustrates a match-merge in which the observations in the first data set to be merged matches one or more observations in the second data set to be merged. Case in point, the *salesone* data set contains two observations for the 2004 year, while the *salestwo* data set contains three observations for the 2004 year:

```
DATA salesone;
input year prd sales;
DATALINES;
2004 1 100
2004 2 200
2005 3 300
2006 4 400
2007 5 500
2008 6 600
;
RUN;
DATA salestwo;
input year loc sales;
DATALINES;
2004 7 700
2004 8 800
2004 9 900
2006 10 950
2007 11 960
2008 12 970
;
RUN;
DATA allsales;
merge salesone salestwo;
by year;
RUN;
PROC PRINT data=allsales NOOBS;
title 'The allsales data set';
RUN;
```

Again, the first two DATA steps just tell SAS to read in the *salesone* and *salestwo* data sets. Note that the *salesone *data set* *contains two 2004 years, while the *salestwo* data set contains three 2004 years. And, although the *salesone* data set contains a 2005 year, the *salestwo *data set contains no 2005 year. The third DATA step tells SAS to match-merge the *salesone* and *salestwo* data sets by the *year* variable, and to store the result in a new data set called *allsales*. When you launch and run * * this program, you should get the following output:

##### the allsales data set

year | prd | sales | loc |
---|---|---|---|

2004 | 1 | 700 | 7 |

2004 | 2 | 800 | 8 |

2004 | 2 | 900 | 9 |

2005 | 3 | 300 | . |

2006 | 4 | 950 | 10 |

2007 | 5 | 960 | 11 |

2008 | 6 | 970 | 12 |

Read what follows slowly and carefully, as it gets messy! As you can see, the last three observations for the 2006, 2007, and 2008 years are obtained simply:

- by merging the 2006 observation in
*salesone*with the 2006 observation in*salestwo*, - by merging the 2007 observation in
*salesone*with the 2007 observation in*salestwo*, - and by merging the 2008 observation in
*salesone*with the 2008 observation in*salestwo*.

Note that because the *sales* variable is common to both the *salesone* and* salestwo *data sets, its values in the merged *allsales* data set comes from *salestwo*,* *the data set appearing last in the MERGE statement.

Let's now focus on the more difficult years of 2004 and 2005. Recall that the *salesone* data set contains two 2004 years and one 2005 year, while the* salestwo* data set contains three 2004 years and no 2005 years. The first two observations are again obtained simply enough:

- by merging the first 2004 observation in the
*salesone*data set with the first 2004 observation in the*salestwo*data set, - and by merging the second 2004 observation in the
*salesone*data set with the second 2004 observation in the*salestwo*data set*.*

Again, because the *sales* variable is common to both the *salesone* and* salestwo *data sets, its values in the merged *allsales* data set comes from *salestwo*,* *the data set appearing last in the MERGE statement.

Now, for the messiest observation of all! SAS creates the third observation for 2004 by first getting what it can from the *salesone* data set — it gets *year* 2004, *prd* 2 (the last known value of it), and *sales* 200 (the last known value of it). Then, it gets what it can from the *salestwo* data set — it gets *year* 2004, *loc *9, and *sales* 900 (and thereby writing over the value of 200 it obtained from the *salesone* data set).

Similarly, SAS creates the fourth observation for 2005 by first getting what it can from the *salesone* data set* * — it gets *year* 2005, *prd* 3, and *sales* 300. Then, SAS gets what it can from the *salestwo * data set —which is nothing. Hence, *loc* is assigned a missing value for 2005.

So, do you get it? What do you think the data set would look like if you merged in the reverse order? The following DATA step tells SAS to merge *salestwo* and *salesone* by *year* and to store the result in a new data set called *allsales2*:

` ````
DATA allsales2;
merge salestwo salesone;
by year;
RUN;
PROC PRINT data=allsales2 NOOBS;
title 'The allsales2 data set';
RUN;
```

Launch and run * * this program. Then, see if you can justify each of the values obtained in the resulting output:

##### the allsales2 data set

year | loc | sales | prd |
---|---|---|---|

2004 | 7 | 100 | 1 |

2004 | 8 | 200 | 2 |

2004 | 9 | 900 | 2 |

2005 | . | 300 | 3 |

2006 | 10 | 400 | 4 |

2007 | 11 | 500 | 5 |

2008 | 12 | 600 | 6 |

**One closing comment**. Note that although each of our match-merging examples involved combining just two data sets, you can specify any number of data sets in a MERGE statement.