9.5 - Using Codebooks to Help Define Formats

It is very common for discrete (categorical) variables to have many, many (hundreds, perhaps even thousands of) possible values. Examples include:

  • diseases may be coded by an integer
  • surgical treatments may be coded by an integer
  • medications may be coded by an integer

An electronic "codebook" is typically used to keep track of the meaning of each of the integer codes. That is, codebooks contain two variables, the code and a text description of the code. For example, disease 1124 could be defined in the text as "Rheumatoid Arthritis."

One would find it extremely tedious and time-consuming to have to type a FORMAT procedure that redefines the codes and text contained in these codebooks. Instead, one can take advantage of the fact that the codebook is already in an electronic format, such as a database table, an ASCII file, or even a SAS data set.

When the codebook is contained in a SAS data set with three required variables:

  • start: the variable that contains the starting range values (that is, the codes)
  • label: the variable that contains the text definition
  • fmtname: the format name

SAS can create the appropriate format using the CNTLIN = option in the PROC FORMAT statement. Let's take a look at an example.

Example 9.11 Section

The following SAS program creates a SAS data set called states from state_cd, which is the codebook for the variable state that is collected on the ICDB background form. Here's what the first ten observations of the state_cd data set look like:

Codebook for States

Obs

name

CODE

1

alabama

1

2

alaska

2

3

arizona

3

4

arkansas

4

5

california

5

6

colorado

6

7

connecticuit

7

8

delaware

8

9

florida

9

10

georgia

10

The data set states is then used in the FORMAT procedure to define the format for the variable state in the back data set:

PROC PRINT data=icdb.state_cd;
   title 'Codebook for States';
 RUN;
 DATA states;
   set icdb.state_cd (rename = (code = start name=label));
   fmtname = 'stat2fmt';
 RUN;
 PROC FORMAT cntlin=states;
 RUN;
 PROC FREQ data=back;
   title 'Freqency Count of STATE (stat2fmt)';
   format state stat2fmt.;
   table state;
 RUN;

Frequency count of STATE (stat2fmt)
The FREQ Procedure

state

Frequency

Percent

Cumulative Frequency

Cumulative percent

indiana

1

12.50

1

12.50

massachusetts

1

12.50

2

25.00

michigan

2

25.00

4

50.00

minnesota

1

12.50

5

62.50

pennsylvania

1

12.50

6

75.00

tennessee

1

12.50

7

87.50

wisconsin

1

12.50

8

100.00

Frequency Missing = 2

Before running this program, you'll have to download the codebook state_cd. Save it to the location on your computer that you referenced in the earlier LIBNAME statement by the libref icdb. Then, go ahead and launch and run  the program.

As you can see from the output, the PRINT procedure merely prints the icdb.state_cd codebook for your review. You should notice that the variable names in state_cd do not meet SAS requirements for codebooks. Therefore, the DATA step that creates the data set states merely renames the code and name variables in icdb.state_cd so that they meet SAS requirements. The RENAME= option on the SET statement is what is used to change the code variable to start and the name variable to label. The general syntax of the RENAME= option on the SET statement is:

set dsname (rename = (oldvr1 = newvr1 oldvr2 = newvr2 ...));

An assignment statement is then used to assign the value stat2fmt to the variable fmtname for each observation (that is, code) that appears in the icdb.state_cd data set.

Then, the FORMAT procedure with the CNTLIN = states option tells SAS to create the format stat2fmt based on the contents of the data set states. Finally, the FREQ procedure illustrates the use of the stat2fmt after it was created in this manner. SAS merely counts and reports the number of subjects coming from each of the states. Note that since we didn't include the MISSING option on the TABLE statement, SAS reports the number of missing values after the table, rather than as a row of the table.