16.4 - Excluding Unmatched Observations
16.4 - Excluding Unmatched ObservationsBy default, when match-merging, the DATA step combines all of the observations in all of the input data sets. There may be situations, however, in which we want SAS to select only those observations for which a match exists in all of the input data sets. This is a rather common thing to do when, for instance, one of the data sets contains a master list of patients, say, and the other data sets contain information on actions taken by the patients. When we merge the data sets, we'd only want to include those patients who appear both in the master list data set and in the action data set. For example, suppose we have a data set containing a master list of patients:
id | v_date |
---|---|
110011 | 01/01/06 |
110012 | 01/02/06 |
110013 | 01/04/06 |
and a data set called allvoids containing information about some patients' urinary voiding habits (void_no and volume) for a given visit date (v_date):
id | v_date | void_no | volume |
---|---|---|---|
110011 | 01/01/06 | 1 | 250 |
110011 | 01/01/06 | 2 | 300 |
110011 | 01/01/06 | 3 | 302 |
110011 | 01/01/06 | 4 | 231 |
110012 | 01/02/06 | 1 | 305 |
110012 | 01/02/06 | 2 | 225 |
110012 | 01/02/06 | 3 | 400 |
110013 | 01/04/06 | 1 | 300 |
110013 | 01/04/06 | 2 | 333 |
110013 | 01/04/06 | 3 | 401 |
110013 | 01/04/06 | 4 | 404 |
110014 | 01/06/06 | 1 | 398 |
110014 | 01/06/06 | 2 | 413 |
Note that the patients data set contains three id numbers (110011, 110012, and 110013), whereas the allvoids data set contains information on four id numbers (110011, 110012, 110013, and 110014). Because the patients data set is viewed as the master list of patients to include in any statistical analysis, we would want to merge the patients and allvoids data set in such a way to ensure that only those patients whose id appears in both input data sets are included in the output analysis data set:
id | v_date | void_no | volume |
---|---|---|---|
110011 | 01/01/06 | 1 | 250 |
110011 | 01/01/06 | 2 | 300 |
110011 | 01/01/06 | 3 | 302 |
110011 | 01/01/06 | 4 | 231 |
110012 | 01/02/06 | 1 | 305 |
110012 | 01/02/06 | 2 | 225 |
110012 | 01/02/06 | 3 | 400 |
110013 | 01/04/06 | 1 | 300 |
110013 | 01/04/06 | 2 | 333 |
110013 | 01/04/06 | 3 | 401 |
110013 | 01/04/06 | 4 | 404 |
Fortunately, telling SAS to exclude unmatched observations is as simple as using the IN= data set option along with a subsetting IF statement. Specifically:
- We use the IN= DATA step option to create a variable that indicates whether the data set contributed to the current observation.
- Then, we use a subsetting IF statement to check the IN= values and to write to the output (merged) data set only those observations that appear in the data sets for which IN= specified.
Let's try this approach out on our patients and allvoids data sets!
Example 16.9
The following program simply reads in the patients and allvoids data sets:
DATA patients;
input id v_date : mmddyy8.;
format v_date mmddyy8.;
DATALINES;
110011 01/01/06
110012 01/02/06
110013 01/04/06;
RUN;
DATA allvoids;
input id v_date : mmddyy8. void_no volume;
format v_date mmddyy8.;
DATALINES;
110011 01/01/06 1 250
110011 01/01/06 2 300
110011 01/01/06 3 302
110011 01/01/06 4 231
110012 01/02/06 1 305
110012 01/02/06 2 225
110012 01/02/06 3 400
110013 01/04/06 1 300
110013 01/04/06 2 333
110013 01/04/06 3 401
110013 01/04/06 4 404
110014 01/06/06 1 398
110014 01/06/06 2 413;
RUN;
Launch and run the program, so that we can work with the data sets. Note that the two data sets are linked by the variables id and v_date. The patients data set contains one observation for each of the three patients. And, the allvoids data set contains multiple observations for each of the four patients. Incidentally, in case you are curious, each observation in the allvoids data set corresponds to a patient's reported void of urine. For example, subject 110011 reported four voids on 01/01/06, subject 110012 reported three voids on 01/02/06, and so on.
Now, if we wanted to analyze the voiding data, we'd have to make sure that we didn't include any data from patients not included in the patients data set. That is, we'd want to exclude the voiding data corresponding to subject 110014. If we merge the patients and allvoids data sets by the id and v_date variables, we get an analysis data set that has one observation for each patient's reported void:
DATA analysis;
merge patients allvoids;
by id v_date;
RUN;
PROC PRINT data=analysis NOOBS;
title 'The analysis data set';
RUN;
id | v_date | void_no | volume |
---|---|---|---|
110011 | 01/01/06 | 1 | 250 |
110011 | 01/01/06 | 2 | 300 |
110011 | 01/01/06 | 3 | 302 |
110011 | 01/01/06 | 4 | 231 |
110012 | 01/02/06 | 1 | 305 |
110012 | 01/02/06 | 2 | 225 |
110012 | 01/02/06 | 3 | 400 |
110013 | 01/04/06 | 1 | 300 |
110013 | 01/04/06 | 2 | 333 |
110013 | 01/04/06 | 3 | 401 |
110013 | 01/04/06 | 4 | 404 |
110014 | 01/06/06 | 1 | 398 |
110014 | 01/06/06 | 2 | 413 |
Launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that the data sets were merged as described. You might also want to take particular note that this is the first example we've seen in which we have specified two variables in the BY statement. There is nothing that much more difficult about it. Other than making sure that both of the input data sets are sorted by both the id and v_date variables.
In reviewing the output, you should also note that we have not yet achieved what we set out to do, namely to create an analysis data set that contains only the voiding data for the patients appearing in the patients data set. That is, the analysis data set still includes the voiding data on subject 110014. The following code uses the IN= option and a subsetting IF statement to help us accomplish our task:
DATA analysis;
merge patients (in = inpatients)
allvoids (in = inallvoids);
by id v_date;
if inpatients and inallvoids;
RUN;
PROC PRINT data=analysis NOOBS;
title 'The analysis data set';
RUN;
id | v_date | void_no | volume |
---|---|---|---|
110011 | 01/01/06 | 1 | 250 |
110011 | 01/01/06 | 2 | 300 |
110011 | 01/01/06 | 3 | 302 |
110011 | 01/01/06 | 4 | 231 |
110012 | 01/02/06 | 1 | 305 |
110012 | 01/02/06 | 2 | 225 |
110012 | 01/02/06 | 3 | 400 |
110013 | 01/04/06 | 1 | 300 |
110013 | 01/04/06 | 2 | 333 |
110013 | 01/04/06 | 3 | 401 |
110013 | 01/04/06 | 4 | 404 |
The IN = inpatients option tells SAS to assign a value of 1 to the inpatients variable when an observation from the patients data set contributes to the current observation. Likewise, the IN = inallvoids option tells SAS to assign a value of 1 to the inallvoids variable when an observation from the allvoids data set contributes to the current observation. The subsetting IF statement tells SAS to write only those observations to the analysis data set whose value for both inpatients and inallvoids is 1, that is, only those observations that were created from observations in both the patients and allvoids data sets. Launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that the analysis data set now contains only the voiding data for the patients appearing in the patients data set.