# 11.2.1 - Modeling Ordinal Data with Log-linear Models Printer-friendly version

Earlier in the course we had described the ways to perform significance tests for independence and conditional independence, and to measure (linear) associations with ordinal categorical variables.

For example we focused on the CMH statistic and correlation measures for testing independence and linear associations for the example with the heart disease and cholesterol level. We concluded that the variables are not independent, and that the linear association was not strong either M2 = 26.15, df = 1 (see Lesson 4, "Ordinal data" section).

 Serum cholesterol (mg/100 cc) total 0–199 200–219 220–259 260+ CHD 12 8 31 41 92 no CHD 307 246 439 245 1237 total 319 254 470 286 1329

Can we answer the same questions (and more) via log-linear models?

### Modeling Ordinal Data in 2-way Tables

Loglinear models for contingency tables, by default, treat all variables as nominal variables.

If there is an ordering of the categories of the variables, this is not taken into account. That means that we could rearrange the rows and/or columns of a table and we would get the same fitted odds ratios for the data as we would given the orginal ordering of the rows and/or columns.

To model ordinal data with log-linear models we can apply some of the general ideas we saw with incomplete tables and analysis of ordinal data from the earlier in the semester.

That is, we typically

• assign scores to the levels of our categorical variables, and
• include additional parameters (which represent these scores) into a log-linear model to model the dependency between two variables.

### Linear by Linear Association Model

This is the most common log-linear model when you have ordinal data.

Objective

Modeling the log counts by accounting for the ordering of the categories of discrete variables.

Suppose we assign scores for the categories of the row variable, u1u2 ≤ ... ≤ uI , and for the categories of the column variable, v1v2 ≤ ... ≤ vJ . These are numbers or values that you will use to describe the difference in magnitude between these variables. Then we can model the dependency between two variables, e.g. C = CHD, and S = Serum cholesterol.

Model Structure:

$\text{log}(\mu_{ij})=\lambda+\lambda_i^C+\lambda_j^S+\beta u_i v_j$

For each row i, the log fitted values are a linear function of the columns. For each column j, the log fitted values are a linear function of the rows.

Parameter estimates and interpretation:

This model only has one more parameter than the independence model (i.e., βuivj), and is in between the independence and the saturated models by its complexity. We are trying to say something about the 'linear by linear association' by modeling this association between these two categories based on the scores that you have assigned.

• If β > 0, then C and S are positively associated (i.e., C tends to go up as S goes up).
• If β < 0, the C and S are negatively associated.
• The odds ratio for any 2 × 2 sub-table is a direct function of the row and column scores and β

\begin{align}
\theta(ij,i'j') &= \text{log}(\mu_{ij})+\text{log}(\mu_{i'j'})-\text{log}(\mu_{i'j})-\text{log}(\mu_{ij'})\\
&= \beta(u_i-u_{i'})(v_j-v_{j'})\\
\end{align}

Model Fit:

We use the G2 and ΔG2 as with any other log-linear model. For our example see HeartDiseaseLoglin.sas and the resulting output, HeartDiseaseLoglin.lst.  In this SAS program we create two new numerical variables as seen below in the datalines statement:  Here is the R code for this example, HeartDiseaseLoglin.R.: We observe G2 = 4.09, df = 2, p-value = 0.13 which indicates that the linear by linear association model fits well, and significantly better than the independence model where ΔG2 = 27.832, df = 1, p-value < 0.001. Notice the equivalence of the values of the ΔG2, M2, and the likelihood ratio statistic for "xCHD*yserum" parameter under the significance testing for the individual parameters (e.g. 'Type 3 Analysis' output of GENMOD).

$\hat{\beta}=-0.574$ and exp(−0.574) = 0.56 means that the estimated odds ratio for a unit change in row and column scores of 'chd-nochd' and '0-199 – 200-219' equal 0.56.

Linear by Linear Association Model              01:27 Tuesday, April 4, 2006   6

The GENMOD Procedure

Analysis Of Parameter Estimates

Likelihood Ratio
Standard      95% Confidence        Chi-
Parameter               DF   Estimate      Error          Limits          Square

serum         220-259    1    -0.5937     0.2303    -1.0563    -0.1513      6.65
serum         260+       0     0.0000     0.0000     0.0000     0.0000       .
xCHD*yserum              1    -0.5740     0.1161    -0.8089    -0.3527     24.45
Scale                    0     1.0000     0.0000     1.0000     1.0000


Look at the model fitted values ('Pred' from SAS "Observation Statistics" table or from using the"fitted" function in R):

$\hat{\theta}=(8.16\times 242.69)/(11.31\times 310.84)=0.56$

The cells in red:

Observation Statistics

Observation       count        xCHD      yserum   CHD     serum          Pred
Xbeta         Std     HessWgt       Lower
Upper      Resraw      Reschi      Resdev
StResdev    StReschi      Reslik

1          12           1           1   chd     0-199     8.1580998
2.0990113   0.2624486   8.1580998   4.8774452
13.64538   3.8419002   1.3450907   1.2560608
1.8977359    2.032248   1.9744497
2           8           1           2   chd     200-199   11.308199
2.425528   0.1694369   11.308199   8.1127572
15.762257   -3.308199   -0.983773   -1.038756
-1.264002   -1.197096   -1.242676
3          31           1           3   chd     220-259   35.909358
3.5809979   0.1111088   35.909358   28.882293
44.646109   -4.909358   -0.819258   -0.839078
-1.12459   -1.098027   -1.112893
Linear by Linear Association Model              01:27 Tuesday, April 4, 2006   7

The GENMOD Procedure

Observation Statistics

Observation       count        xCHD      yserum   CHD     serum          Pred
Xbeta         Std     HessWgt       Lower
Upper      Resraw      Reschi      Resdev
StResdev    StReschi      Reslik

4          41           1           4   chd     260+      36.624367
3.6007138    0.146867   36.624367   27.463552
48.840886   4.3756334   0.7230292   0.7093048
1.5477726   1.5777207   1.5714785
5         307           2           1   nochd   0-199     310.84191
5.7392845   0.0563922   310.84191   278.31617
347.16882   -3.841913    -0.21791   -0.218361
-2.036463   -2.032255   -2.032303
6         246           2           2   nochd   200-199   242.69181
5.4917924   0.0631727   242.69181   214.42846
274.68049   3.3081921   0.2123553   0.2118756
1.1943894   1.1970937   1.1970087
7         439           2           3   nochd   220-259   434.09065
6.0732534   0.0468783   434.09065   395.98388
475.86454   4.9093545    0.235632   0.2351899
1.0959661   1.0980261   1.0979313
8         245           2           4   nochd   260+      249.37563
5.5189603   0.0623404   249.37563    220.6936
281.78528   -4.375634   -0.277086   -0.277902
-1.582369   -1.577721   -1.577864

The estimated odds of 'chd' and higher level of cholesterol, e.g. '260+' under this model are

$\hat{\theta}=\text{exp}(-0.574(2-1)(4-1))=0.18=\dfrac{8.16\times 249.37}{36.62\times 310.84}$

We can use this evidence to conclude that a person is about 5.5 times more likely to have a heart condition with such a high cholesterol level.

### Choice of Scores

There are many options for assigning the score values, and these can be of equal or unequal spacing.

The most common choice of scores are consecutive integers; that is u1 = 1, u2 = 2, ... uI = I and v1 = 1, v2 = 2, ... vJ = J (which is what we used in the above example).

The model with such scores is a special case of the linear by linear association model and is known as the Uniform Association Model. It is called the uniform association model because the odds ratios for any two adjacent rows and any two adjacent columns are the same and equal to

$\theta=\text{exp}(\beta(u_i-u_{i-1})(v_j-v_{j-1}))=\text{exp}(\beta)$

In other words, the Local Odds Ratio equals exp(β) and is the same for adjacent rows and columns.

Also, sets of scores with the same spacing between them will lead to the same goodness-of-fit statistics, fitted counts, odds ratios, and $\hat{\beta}$ .

For example, v1 = 1, v2 = 2, v3 = 3, v4 = 4 and v1 = 8, v2 = 9, v3 = 10, v4 = 11 will yield the same results.

However, please note: Two sets of scores with the same relative spacing will lead to the same goodness-of-fit statistics, fitted counts, and odds ratios, BUT different estimates of β.

For example, v1 = 1, v2 = 2v3 = 4, v4 = 8 and v1 = 2, v2 = 4, v3 = 8, v4 = 16

The choice of scores may highly depend on your data and the context of your problem. There are other ways of using and modeling ordinality, e.g. Cumlative logit models (ref. Agresti(2002), sec 7.2 and 7.3 , Agresti (2007), Sec. 6.2, and Agresti(1996), Sec. 8.2 and 8.3.; which has already been discussed.

### Generalization to Higher-dimensional Tables

For higher-dimensions we already know how to test for associations and conditional independence with ordinal data, and combinations of ordinal and nominal, via CMH statistic (recall Lesson 4).

The modeling approach described today generalizes to higher-dimensional tables as well. We can always create new variables representing the scores.

Association models are generalization of linear by linear association models for multi-way tables.

We can also combine ordinal and nominal variables where we only assign the scores to the ordinal variables, and estimate scores from the data. Some of these models are known as row effects, column effect and row and column effects models. These are more advanced topics on this issues. For more information on some of these models see Agresti (2013), Sec. 9.5-9.6, Sec. 7.2-7.3, and Agresti (2007), Sec. 6.2-6.3.