There may be (many!) occasions in which you can get away with using fewer variables that
appear in your input SAS data set. The advantage is that, when you can get away with reading in fewer variables, SAS uses a smaller program data vector and therefore runs much more efficiently.
- The DROP= option tells SAS which variables you want to drop from a data set. If you place the DROP= option on the SET statement, SAS drops the specified variables when it reads the input data set. On the other hand, if you place the DROP= option on the DATA statement, SAS drops the specified variables when it writes to the output data set.
- The KEEP= option tells SAS which variables you want to keep in a data set. If you place the KEEP= option on the SET statement, SAS keeps the specified variables when it reads the input data set. On the other hand, if you place the KEEP= option on the DATA statement, SAS keeps the specified variables when it writes to the output data set.
Which option should you use? The choice of whether to use the DROP= option or the KEEP= option in a given situation is personal. Or rather, let's say that most people will choose the option that requires the least amount of typing! I still try though, whenever possible, to use the KEEP= option, as then when I read my programs I know exactly which variables my data sets contain. Using the DROP= option, on the other hand, requires one to think about what variables were in the data set before the dropping took place.
On the SET Statement Section
Example 14.3
The following program tells SAS to keep just three variables — subj, v_date, and b_date — when reading from the back1 data set in order to create the back2 data set:
DATA back2;
set back1 (keep = subj v_date b_date);
age = (v_date - b_date)/365; * Calculate AGE in years;
format age 4.1;
RUN;
PROC PRINT data=back2;
title 'Output Dataset: BACK2';
Obs | subj | v_date | b_date | age |
---|---|---|---|---|
1 | 110051 | 01/25/94 | 12/02/42 | 51.2 |
2 | 110052 | 01/27/94 | 01/04/25 | 69.1 |
3 | 110053 | 02/22/94 | 03/15/22 | 72.0 |
4 | 110055 | 03/15/94 | 03/31/41 | 53.0 |
5 | 110057 | 03/15/94 | 07/10/44 | 49.7 |
6 | 110058 | 03/18/94 | 09/09/50 | 43.6 |
7 | 110059 | 03/18/94 | 07/25/34 | 59.7 |
8 | 110060 | 06/14/94 | 05/29/36 | 58.1 |
9 | 110062 | 03/31/94 | 04/21/36 | 58.0 |
10 | 110065 | 04/04/94 | 10/12/52 | 41.5 |
11 | 110066 | 04/12/94 | 08/28/62 | 31.6 |
12 | 110067 | 04/26/94 | 02/22/72 | 22.2 |
13 | 110068 | 06/13/94 | 09/10/55 | 38.8 |
14 | 110069 | 05/31/94 | 08/17/38 | 55.8 |
The SET statement's KEEP= option is used to tell SAS to read three variables — subject (subj), visit date (v_date), and birth date (b_date) — from the temporary data set back1 and to store them in a new temporary data set back2. Note that the KEEP= option must be placed in parentheses and follow the name of the data set from you which you want SAS to select variables. The list of variables that you specify to keep must be separated by at least one blank space.
Then, the subject's age (age) at the time of the visit is calculated by subtracting the difference between the visit date (v_date) and birth date (b_date) and dividing by 365 to get (an approximate) age in years.
Launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that the temporary data set back2 contains the three kept variables (subj, v_date, and b_date), as well as the newly calculated variable (age).
Example 14.4
The following program tells SAS to drop nine variables — r_id, race, ..., and income — when reading from the back1 data set in order to create the back3 data set:
DATA back3;
set back1 (drop = r_id race ethnic relig mar_st
ed_level emp_st job_chng income);
RUN;
PROC PRINT data=back3;
title 'Output Dataset: BACK3';
RUN;
Obs | subj | v_type | v_date | b_date | sex | state | country |
---|---|---|---|---|---|---|---|
1 | 110051 | 0 | 01/25/94 | 12/02/42 | 2 | 42 | 1 |
2 | 110052 | 0 | 01/27/94 | 01/04/25 | 2 | 5 | 1 |
3 | 110053 | 0 | 02/22/94 | 03/15/22 | 2 | 5 | 1 |
4 | 110055 | 0 | 03/15/94 | 03/31/41 | 2 | 5 | 1 |
5 | 110057 | 0 | 03/15/94 | 07/10/44 | 2 | 5 | 1 |
6 | 110058 | 0 | 03/18/94 | 09/09/50 | 2 | . | 13 |
7 | 110059 | 0 | 03/18/94 | 07/25/34 | 2 | 13 | 1 |
8 | 110060 | 0 | 06/14/94 | 05/29/36 | 2 | 13 | 1 |
9 | 110062 | 0 | 03/31/94 | 04/21/36 | 2 | 3 | 1 |
10 | 110065 | 0 | 04/04/94 | 10/12/52 | 2 | 5 | 1 |
11 | 110066 | 0 | 04/12/94 | 08/28/62 | 2 | 5 | 1 |
12 | 110067 | 0 | 04/26/94 | 02/22/72 | 2 | 5 | 1 |
13 | 110068 | 0 | 06/13/94 | 09/10/55 | 2 | 25 | 1 |
14 | 110069 | 0 | 05/31/94 | 08/17/38 | 2 | 32 | 1 |
The SET statement's DROP= option tells SAS to drop nine variables — the id number of the reviewer of the data form (r_id), as well as the subject's race (race), ethnicity (ethnic), religion (relig), marital status (mar_st), education level (ed_level), employment status (emp_st), job change status (job_chng), and income (income) — from the temporary data set back1 and to store the remaining variables in a new temporary data set back3. Therefore, the program data vector used to create the SAS data set back3 contains only seven variables — subject id (subj), visit type (v_type), visit date (v_date), birth date (b_date), gender (sex), and the state (state) and country (country) in which the subject resides.
Note that, just like the KEEP= option, the DROP= option must be placed in parentheses and follow the name of the data set from you that you want SAS to exclude variables. The list of variables that you specify to drop must be separated by at least one blank space.
Launch and run the SAS program. Review the output from the PRINT procedure to convince yourself that the temporary data set back3 contains the seven remaining variables that were not dropped from the data set.
Example 14.5: On the DATA Statement Section
Rather than using the KEEP= option on the SET statement, you can use the KEEP= option on the DATA statement. The following program tells SAS to keep four variables — subj, v_date, b_date, and age — when writing to the output data set called back2a:
DATA back2a (keep = subj v_date b_date age);
set back1;
age = (v_date - b_date)/365; * Calculate AGE in years;
format age 4.1;
RUN;
PROC PRINT data=back2a;
title 'Output Dataset: BACK2A';
RUN;
Obs | subj | v_date | b_date | age |
---|---|---|---|---|
1 | 110051 | 01/25/94 | 12/02/42 | 51.2 |
2 | 110052 | 01/27/94 | 01/04/25 | 69.1 |
3 | 110053 | 02/22/94 | 03/15/22 | 72.0 |
4 | 110055 | 03/15/94 | 03/31/41 | 53.0 |
5 | 110057 | 03/15/94 | 07/10/44 | 49.7 |
6 | 110058 | 03/18/94 | 09/09/50 | 43.6 |
7 | 110059 | 03/18/94 | 07/25/34 | 59.7 |
8 | 110060 | 06/14/94 | 05/29/36 | 58.1 |
9 | 110062 | 03/31/94 | 04/21/36 | 58.0 |
10 | 110065 | 04/04/94 | 10/12/52 | 41.5 |
11 | 110066 | 04/12/94 | 08/28/62 | 31.6 |
12 | 110067 | 04/26/94 | 02/22/72 | 22.2 |
13 | 110068 | 06/13/94 | 09/10/55 | 38.8 |
14 | 110069 | 05/31/94 | 08/17/38 | 55.8 |
The DATA statement's KEEP= option is used to tell SAS to write four variables — subj, v_date, b_date, and age — from the program data vector at the end of each iteration of the DATA step to the back2a data set. As before, the KEEP= option must be placed in parentheses and follow the name of the data set from you which you want SAS to select variables. The list of variables that you specify to keep must be separated by at least one blank space.
Launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that the temporary data set back2a is identical to the back2 data set created in Example 2.3. That is, the back2a data set, like the back2 data set, contains the four kept variables (subj, v_date, b_date, and age).
Example 14.6 Section
Rather than using the DROP= option on the SET statement, you can use the DROP= option on the DATA statement. The following program tells SAS to drop nine variables — r_id, race, ..., and income — when writing to the output data set back3a:
DATA back3a (drop = r_id race ethnic relig mar_st
ed_level emp_st job_chng income);
set back1;
RUN;
PROC PRINT data=back3a;
title 'Output Dataset: BACK3A';
RUN;
Obs | subj | v_type | v_date | b_date | sex | state | country |
---|---|---|---|---|---|---|---|
1 | 110051 | 0 | 01/25/94 | 12/02/42 | 2 | 42 | 1 |
2 | 110052 | 0 | 01/27/94 | 01/04/25 | 2 | 5 | 1 |
3 | 110053 | 0 | 02/22/94 | 03/15/22 | 2 | 5 | 1 |
4 | 110055 | 0 | 03/15/94 | 03/31/41 | 2 | 5 | 1 |
5 | 110057 | 0 | 03/15/94 | 07/10/44 | 2 | 5 | 1 |
6 | 110058 | 0 | 03/18/94 | 09/09/50 | 2 | . | 13 |
7 | 110059 | 0 | 03/18/94 | 07/25/34 | 2 | 13 | 1 |
8 | 110060 | 0 | 06/14/94 | 05/29/36 | 2 | 13 | 1 |
9 | 110062 | 0 | 03/31/94 | 04/21/36 | 2 | 3 | 1 |
10 | 110065 | 0 | 04/04/94 | 10/12/52 | 2 | 5 | 1 |
11 | 110066 | 0 | 04/12/94 | 08/28/62 | 2 | 5 | 1 |
12 | 110067 | 0 | 04/26/94 | 02/22/72 | 2 | 5 | 1 |
13 | 110068 | 0 | 06/13/94 | 09/10/55 | 2 | 25 | 1 |
14 | 110069 | 0 | 05/31/94 | 08/17/38 | 2 | 32 | 1 |
The DATA statement's DROP= option tells SAS to drop nine variables — r_id, race, ethnic, relig, mar_st, ed_level, emp_st, job_chng, and income — before writing the contents of the program data vector into the back3a data set. Therefore, the program data vector starts out with sixteen data set variables, but only writes seven of them to the back3a data set. As before, the DROP= option must be placed in parentheses and follow the name of the data set from you that you want SAS to exclude variables. The list of variables that you specify to drop must be separated by at least one blank space.
Launch and run the SAS program. Review the output from the PRINT procedure to convince yourself that the back3a data set, just like the back3 data set, contains the seven variables that were not dropped from the data set.
What to Use When Section
After reviewing the previous four examples, your head might be spinning and you might be wondering how you would know which to use — the KEEP= option on the SET statement? the DROP= option on the SET statement? the KEEP= option on the DATA statement? or the DROP= option on the DATA statement? Here are the key facts that will help us devise a strategy:
- As stated earlier, the choice between KEEP= and DROP= is a matter of personal choice. It's whether you place the options on the SET statement or the DATA statement that makes a difference.
- In the SET statement, the DROP= and KEEP= options determine which variables are read from the input SAS dataset. That is, when you specify the DROP= or KEEP= option in the SET statement, the excluded variables are never even read into the program data vector.
- In the DATA statement, the DROP= and KEEP= options determine which variables are written from the program data vector to the resulting SAS data set. That is, when you specify the DROP= or KEEP= option in the DATA statement, all of the variables in the input data set are read into the program data vector, but only the selected variables are written from the program data vector to the output data set.
Those facts stated it becomes obvious that we can construct a more efficient DATA step by not reading unneeded variables from the input data set. That said, we might have a working strategy:
- use the KEEP= option (or DROP= option) in the SET statement to tell SAS only those variables that you need from the input data set read into the program data vector in order to achieve the goals of your program
- use the KEEP= option (or DROP= option) in the DATA statement to tell SAS all of the variables that you want to be written from the program data vector into the output data set
Let's try our working strategy out on an example!
Example 14.7
The following SAS program illustrates our working strategy of when to use the KEEP= option in the SET statement and when to use the KEEP= option in the DATA statement:
DATA back6 (keep = subj age);
set back1 (keep= subj v_date b_date);
age = (v_date - b_date)/365;
format age 4.1;
RUN;
PROC PRINT data=back6;
title 'Output Dataset: BACK6';
RUN;
Obs | subj | age |
---|---|---|
1 | 110051 | 51.2 |
2 | 110052 | 69.1 |
3 | 110053 | 72.0 |
4 | 110055 | 53.0 |
5 | 110057 | 49.7 |
6 | 110058 | 43.6 |
7 | 110059 | 59.7 |
8 | 110060 | 58.1 |
9 | 110062 | 58.0 |
10 | 110065 | 41.5 |
11 | 110066 | 31.6 |
12 | 110067 | 22.2 |
13 | 110068 | 38.8 |
14 | 110069 | 55.8 |
Here, the KEEP= option on the SET statement tells SAS to keep only three variables — subj, v_date, and b_date — from the input data set. The v_date and b_date variables are kept because we need them in order to calculate the age variable. The subject's age is calculated by subtracting b_date from v_date and dividing by 365. Therefore, the program data vector contains four variables — subj, v_date, b_date, and age —all, none or any of which can be written to the output back6 data set. The KEEP= option of the DATA statement tells SAS which variables to write from the program data vector to the output data set. Here, SAS is told to write only the subject's id (subj) and age (age) to back6.
Launch and run the SAS program. Review the output from the PRINT procedure to convince yourself that the temporary data set back6 does indeed contain only the subject's id (subj) and age (age).
Alternative KEEP and DROP Statements Section
Let's add another potential layer of confusion! Here goes ... the DROP= and KEEP= options in the DATA statement are comparable, respectively, to the DROP and KEEP statements placed within the DATA step. That is, the KEEP and DROP statements, just like the DROP= and KEEP= options on a DATA statement, affect which variables are written from the program data vector to the resulting SAS data set. Let's take a look at two examples.
Example 14.8
Rather than using the KEEP= option of the DATA statement, you can also tell SAS which variables in a data set to keep using a KEEP statement within your DATA step. The following SAS program illustrates the use of the KEEP statement:
DATA back4;
set back1;
age = (v_date - b_date)/365;
format age 4.1;
keep subj v_date b_date;
RUN;
PROC PRINT data=back4;
title 'Output Dataset: BACK4';
RUN;
Obs | subj | v_date | b_date |
---|---|---|---|
1 | 110051 | 01/25/94 | 12/02/42 |
2 | 110052 | 01/27/94 | 01/04/25 |
3 | 110053 | 02/22/94 | 03/15/22 |
4 | 110055 | 03/15/94 | 03/31/41 |
5 | 110057 | 03/15/94 | 07/10/44 |
6 | 110058 | 03/18/94 | 09/09/50 |
7 | 110059 | 03/18/94 | 07/25/34 |
8 | 110060 | 06/14/94 | 05/29/36 |
9 | 110062 | 03/31/94 | 04/21/36 |
10 | 110065 | 04/04/94 | 10/12/52 |
11 | 110066 | 04/12/94 | 08/28/62 |
12 | 110067 | 04/26/94 | 02/22/72 |
13 | 110068 | 06/13/94 | 09/10/55 |
14 | 110069 | 05/31/94 | 08/17/38 |
If you look at this program quickly, you might think that its effect should be the same as that of the program used in Example 2.3 to create the back2 data set. Note, though, that in using the KEEP statement, you must identify all of the variables that you want to keep in your output data set, regardless of whether the variables were obtained from the original input data set or whether they were created within the DATA step. Case in point — note that the age variable is not kept in the back4 data set, because SAS was not explicitly told to do so. Also, note that an equal sign (=) is not used in a KEEP statement as it is in the KEEP= option of a SET or DATA statement.
Launch and run the SAS program. Review the output from the PRINT procedure to convince yourself that the back4 data set contains the three kept variables (subj, v_date, and b_date), but not the newly calculated variable (age).
Example 14.9
Of course, rather than using the DROP= option of the DATA statement, you can also tell SAS which variables in a data set to drop using a DROP statement within your DATA step. The following SAS program illustrates the use of the DROP statement:
DATA back5;
set back1;
age = (v_date - b_date)/365;
format age 4.1;
drop r_id race ethnic relig mar_st ed_level emp_st job_chng
income sex state country;
RUN;
PROC PRINT data=back5;
title 'Output Dataset: BACK5';
RUN;
Obs | subj | v_type | v_date | b_date | age |
---|---|---|---|---|---|
1 | 110051 | 0 | 01/25/94 | 12/02/42 | 51.2 |
2 | 110052 | 0 | 01/27/94 | 01/04/25 | 69.1 |
3 | 110053 | 0 | 02/22/94 | 03/15/22 | 72.0 |
4 | 110055 | 0 | 03/15/94 | 03/31/41 | 53.0 |
5 | 110057 | 0 | 03/15/94 | 07/10/44 | 49.7 |
6 | 110058 | 0 | 03/18/94 | 09/09/50 | 43.6 |
7 | 110059 | 0 | 03/18/94 | 07/25/34 | 59.7 |
8 | 110060 | 0 | 06/14/94 | 05/29/36 | 58.1 |
9 | 110062 | 0 | 03/31/94 | 04/21/36 | 58.0 |
10 | 110065 | 0 | 04/04/94 | 10/12/52 | 41.5 |
11 | 110066 | 0 | 04/12/94 | 08/28/62 | 31.6 |
12 | 110067 | 0 | 04/26/94 | 02/22/72 | 22.2 |
13 | 110068 | 0 | 06/13/94 | 09/10/55 | 38.8 |
14 | 110069 | 0 | 05/31/94 | 08/17/38 | 55.8 |
Launch and run the SAS program. Review the output from the PRINT procedure to convince yourself that the back5 data set has, as it should, the same structure and contents of the back2 data set that was created above in Example 14.3, except for the additional v_type variable.