16.2 - How SAS Match-Merges

The match-merge examples we looked at in the previous section started first with the simplest case and then increased in complexity. The truth is that match-merging can be even more complex than we've seen. For example, we haven't yet investigated the situation in which we merge by more than one variable at a time. It is because of this potential complexity that you will be best served by understanding how the DATA step performs match-merges. Only then will you be able to predict the results of, and therefore perform, match-merges correctly.

In this section, we'll revisit two examples from the previous section first Example 16.4 and then Example 16.5. Our focus this time won't be on how to perform the match-merge, but rather to see how SAS performs the match-merge.

Example 16.6 Section

The following program is identical to the first program that appears in Example 16.4:

DATA moredemog;
    input subj gender $ age v_date mmddyy8.;
    format v_date mmddyy8.;
    DATALINES;
    1000 M 42 03/10/96
    1001 M 20 02/19/96
    1002 F 53 02/01/96
    1003 F 40 12/31/95
    1004 M 29 01/10/97
    ;
RUN;

DATA morestatus;
    input subj disease $ test $ v_date mmddyy8.;
    format v_date mmddyy8.;
    DATALINES;
    1000 Y Y 03/17/96
    1001 N Y 03/01/96
    1002 N N 02/18/96
    1003 Y Y 01/15/96
    1004 N N 02/01/97
    ;
RUN;

DATA morepatients;
    merge moredemog morestatus;
    by subj;
RUN;
      
PROC PRINT data=morepatients NOOBS;
        title 'The morepatients data set';
RUN;

You might want to launch and re-run the program, so that you can review the output again to reacquaint yourself with the issues of this particular match-merge. Specifically, recall that, in addition to the linking variable subj, the moredemog and morestatus data sets share a variable called v_date. In this case, the variable v_date gets its values from the morestatus data set, since it is the last data set to appear in the MERGE statement.

Our focus this time is going to be to work our way through the program to see how SAS match-merges the moredemog and morestatus data sets to get the morepatients data set. As always, at the end of the compile phase, SAS will have created a program data vector containing the variables from the moredemog and morestatus data sets in the order in which they appear in the DATA step:

_N_ _ERROR_ subj gender age v_date disease test
1 0 .     .    

As always, SAS starts the execution phase having initialized the values of all of the data set variables in the program data vector to missing. Then, SAS looks at the first "BY group" in each data set to determine which BY group should appear first. In this case, the first BY group, that is, observations for which subj = 1000, is the same in both data sets. So, SAS reads the first observation from the moredemog data set:

_N_ _ERROR_ subj gender age v_date disease test
1 0 1000 M 42 03/10/96    

and then SAS reads the first observation from the morestatus data set:

_N_ _ERROR_ subj gender age v_date disease test
1 0 1000 M 42 03/17/96 Y Y

As you can see, the value of v_date from the moredemog data set (03/10/96) gets written over by the value of v_date from the morestatus data set (03/17/96). Having no more data sets from which to read, SAS writes the first observation to the morepatients data set, and retains the values in the program data vector. (If the program data vector contained variables created in the DATA step, SAS would set them to missing after writing their values to the new data set.)

SAS then looks in each data set for a second observation in the subj = 1000 BY group. Neither data set has one. Therefore, SAS sets all of the values in the program data vector to missing and begins processing the next BY group:

_N_ _ERROR_ subj gender age v_date disease test
2 0 .   . .    

SAS reads the second observation from the moredemog data set:

_N_ _ERROR_ subj gender age v_date disease test
2 0 1001 M 20 02/19/96    

and then SAS reads the second observation from the morestatus data set:

_N_ _ERROR_ subj gender age v_date disease test
2 0 1001 M 20 03/01/96 N Y

Again, the value of v_date from the moredemog data set (02/19/96) gets written over by the value of v_date from the morestatus data set (03/01/96). Having no more data sets from which to read, SAS writes the second observation to the morepatients data set, and retains the values in the program data vector.

SAS then looks in each data set for another observation in the subj = 1001 BY group. Because neither data set has one, SAS sets all of the values in the program data vector to missing and begins processing the next BY group. Shall we stop this seemingly endless cycle? You should have the basic idea now... SAS continues processing observations as just described until it exhausts all of the observations in both data sets.

Example 16.7 Section

The following program is identical to the first program that appears in Example 16.5:

    DATA salesone;
    input year prd sales;
    DATALINES;
    2004 1 100
    2004 2 200
    2005 3 300
    2006 4 400
    2007 5 500
    2008 6 600
;
RUN;
    
    DATA salestwo;
    input year loc sales;
    DATALINES;
    2004 7  700
    2004 8  800
    2004 9  900
    2006 10 950
    2007 11 960
    2008 12 970
;
RUN;
    
DATA allsales;
        merge salesone salestwo;
        by year;
RUN;
    
PROC PRINT data=allsales NOOBS;
    title 'The allsales data set';
RUN;

Again, you might want to launch and re-run the program, so that you can review the output again to reacquaint yourself with the issues of this particular match-merge. Specifically, recall that the observations in the first data set to be merged matches one or more observations in the second data set to be merged. For example, the salesone data set contains two observations for the 2004 year, while the salestwo data set contains three observations for the 2004 year.

Let's work our way through this match-merge. At the end of the compile phase, SAS will have created a program data vector containing the variables from the salesone and salestwo data sets in the order in which they appear in the DATA step:

_N_ _ERROR_ year prd sales loc
1 0 .     .

Again, SAS starts the execution phase having initialized the values of all of the data set variables in the program data vector to missing. Then, SAS looks at the first "BY group" in each data set to determine which BY group should appear first. In this case, the first BY group, that is, observations for which year = 2004, is the same in both data sets. So, SAS reads the first observation from the salesone data set:

_N_ _ERROR_ year prd sales loc
1 0 2004 1 100 .

and then SAS reads the first observation from the salestwo data set:

_N_ _ERROR_ year prd sales loc
1 0 2004 1 700 7

You now shouldn't be surprised to see that the value of sales from the salesone data set (100) gets written over by the value of sales from the salestwo data set (700). Having no more data sets from which to read, SAS writes the first observation to the allsales data set, and retains the values in the program data vector. (Again, if the program data vector contained variables created in the DATA step, SAS would set them to missing after writing their values to the new data set.)

SAS then looks in each data set for a second observation in the year = 2004 BY group. Both data sets have one. Therefore, SAS reads the second observation from the salesone data set:

_N_ _ERROR_ year prd sales loc
2 0 2004 2 200 7

The value 7 stored in the loc variable is a remnant from the previous iteration of the DATA step. It doesn't get written over until SAS reads the second observation from the salestwo data set:

_N_ _ERROR_ year prd sales loc
2 0 2004 2 800 8

Again, the value of sales from the salesone data set (200) gets written over by the value of sales from the salestwo data set (800). Having no more data sets from which to read, SAS writes the second observation to the allsales data set, and retains the values in the program data vector.

SAS then looks in each data set for another observation in the subj = 2004 BY group. The salesone data set has no more, but the salestwo data set has one more. So, SAS retains the value of the prd variable from the previous iteration and reads the next observation from the salestwo data set:

_N_ _ERROR_ year prd sales loc
3 0 2004 2 900 9

Again, the value of sales that was retained from the last iteration (800) gets written over by the value of sales from the salestwo data set (900). Having no more data sets from which to read, SAS writes the third observation to the allsales data set, and retains the values in the program data vector.

SAS then looks in each data set for another observation in the year = 2004 BY group. Neither data set has one. Therefore, SAS sets all of the values in the program data vector to missing and begins processing the next BY group:

_N_ _ERROR_ year prd sales loc
4 0 . . . .

When is this circus going to end?! Soon... bear with us! Okay... so SAS determines that the year = 2005 is the next BY group and that salesone has an observation for that BY group, while salestwo does not. SAS reads the available observation from the salesone data set:

_N_ _ERROR_ year prd sales loc
4 0 2005 3 300 .

and having no more observations to read, SAS writes the contents of the program data vector to the fourth observation in the allsales data set.

SAS then looks in each data set for another observation in the year = 2005 BY group. Neither data set has one. Therefore, SAS sets all of the values in the program data vector to missing and begins processing the next BY group:

_N_ _ERROR_ subj gender age v_date
5 0 .   . .

Okay, let's cry uncle! At this point, it really is a straightforward trek to the finish line provided you follow the general guidelines, which are probably worth summarizing once and for all:

  • At the beginning of the execution phase, set the values in the program data vector to missing.
  • Determine the first BY group, and read all of the observations from all of the data sets having the BY group.
  • Write the contents of the program data vector to the output data set.
  • If there are more observations in the BY group, retain the values in the program data vector.
  • If there are no more observations in the BY group, set all of the values in the program data vector to missing and begin processing the next BY group.
  • Continue processing observations as described until all of the observations in all of the input data sets are exhausted.

Using these guidelines, you should be able to predict the contents of just about any match-merge you encounter.