Multiclass ODA: Convergent Validity of Protein Classification Methods

oda

2026-06-09

Research question

Nishikawa, Kubota, and Ooi (1983) independently classified 325 proteins into one of four mutually exclusive types using two different methods: one based on biological characteristics and one based on amino acid composition.1 Because the two methods should theoretically converge on the same type for each protein, the directional hypothesis is that protein type codes are identical across methods - demonstrating convergent validity.

Optimal Data Analysis (MultiODA) tests whether amino acid composition type discriminates biological type, with the a priori prediction that type codes match across methods.

Data

Biological type (1-4) is the class variable; amino acid composition type (1-4) is the attribute. Published cell frequencies are reconstructed directly into observation-level vectors - no external data file is required.

library(oda)

# Cross-classification: rows = biological type, cols = amino acid type.
# (column-major reconstruction matches published Table 1)
#                  AA=1  AA=2  AA=3  AA=4   total
#  Biological=1     98    16     5     3      122
#  Biological=2     13    50     2     8       73
#  Biological=3      6     4    23    12       45
#  Biological=4      7    19    14    45       85
#  total           124    89    44    68      325

biological_type <- c(
  rep(1L, 98), rep(2L, 13), rep(3L,  6), rep(4L,  7),  # amino_acid = 1
  rep(1L, 16), rep(2L, 50), rep(3L,  4), rep(4L, 19),  # amino_acid = 2
  rep(1L,  5), rep(2L,  2), rep(3L, 23), rep(4L, 14),  # amino_acid = 3
  rep(1L,  3), rep(2L,  8), rep(3L, 12), rep(4L, 45)   # amino_acid = 4
)
amino_acid_type <- c(rep(1L, 124), rep(2L, 89), rep(3L, 44), rep(4L, 68))

table(amino_acid_type, biological_type,
      dnn = c("Amino Acid Type (1-4)", "Biological Type (1-4)"))
#>                      Biological Type (1-4)
#> Amino Acid Type (1-4)  1  2  3  4
#>                     1 98 13  6  7
#>                     2 16 50  4 19
#>                     3  5  2 23 14
#>                     4  3  8 12 45

Fit the ODA model

Amino acid type is a four-category nominal variable. ODA searches all possible mappings from the four amino-acid-type categories to the four biological-type classes and selects the mapping that maximises ESS. No a priori direction is supplied; the search is nondirectional (Hypothesis: NONDIRECTIONAL in MegaODA output). Leave-one-out (LOO) jackknife validity is requested via loo = "on"; LOO confusion and ESS are reported. No LOO p-value is given because no canonical Fisher-exact LOO p-value is defined for C > 2 multicategorical class problems.

# Canonical reference run (mc_iter = 25000L; not evaluated in CRAN vignette)
fit <- oda_fit(
  x         = amino_acid_type,
  y         = biological_type,
  attr_type = "categorical",
  mc_iter   = 25000L,
  loo       = "on"
)
# CRAN-safe run: mc_iter = 500L for vignette rendering speed.
# Training rule, ESS, and confusion matrix are identical to the canonical run.
fit <- oda_fit(
  x         = amino_acid_type,
  y         = biological_type,
  attr_type = "categorical",
  mc_iter   = 500L,
  mc_seed   = 42L,
  loo       = "on"
)

Rule and confusion matrix

print(fit)
#> 
#> ODA (multiclass)  attr_type=categorical  priors=TRUE  n=325
#> 
#> Rule: 1 --> 1   |   2 --> 2   |   3 --> 3   |   4 --> 4
#> 
#>   CLASS     PAC
#>       1   80.3%
#>       2   68.5%
#>       3   51.1%
#>       4   52.9%
#> 
#>   Mean PAC: 63.22%   ESS: 50.96%  p(MC): < .001
#> 
#>   -- LOO --
#>   CLASS     PAC
#>       1   80.3%
#>       2   68.5%
#>       3   51.1%
#>       4   52.9%
#> 
#>   LOO Mean PAC: 63.22%   LOO ESS: 50.96%
#>   p(LOO): not reported for multicategorical ODA

ODA’s nondirectional search identified the identity mapping as the optimal categorical partition:

# Confusion matrix (actual x predicted); strip dimnames for clean display
conf_mat <- unname(fit$confusion)
rownames(conf_mat) <- paste0("Bio=", 1:4)
colnames(conf_mat) <- paste0("Pred=", 1:4)
print(conf_mat)
#>       Pred=1 Pred=2 Pred=3 Pred=4
#> Bio=1     98     16      5      3
#> Bio=2     13     50      2      8
#> Bio=3      6      4     23     12
#> Bio=4      7     19     14     45

ESS / PAC / PV interpretation

summary(fit)
#> 
#> ODA Summary (multiclass)  status=valid  n=325
#>   attr_type=categorical  priors=TRUE  weights=FALSE
#>   Rule: 1 --> 1   |   2 --> 2   |   3 --> 3   |   4 --> 4
#> 
#>   -- Train --
#>     Mean PAC: 63.22%   ESS: 50.96%
#>     p(MC): < .001  [MC permutation, two-tailed]
#>   -- LOO --
#>     CLASS     PAC
#>         1   80.3%
#>         2   68.5%
#>         3   51.1%
#>         4   52.9%
#>     LOO ESS: 50.96%
#>     LOO Mean PAC: 63.22%
#>     p(LOO): not reported for multicategorical ODA
m <- oda_metrics(fit)

# PAC (sensitivity) per class - pac_by_class is already on percentage scale
cat("PAC by biological type:\n")
#> PAC by biological type:
cat("  Type 1:", round(m$pac_by_class[1], 1), "%\n")
#>   Type 1: 80.3 %
cat("  Type 2:", round(m$pac_by_class[2], 1), "%\n")
#>   Type 2: 68.5 %
cat("  Type 3:", round(m$pac_by_class[3], 1), "%\n")
#>   Type 3: 51.1 %
cat("  Type 4:", round(m$pac_by_class[4], 1), "%\n")
#>   Type 4: 52.9 %

# Predictive value: diagonal / column sums
pv <- diag(fit$confusion) / colSums(fit$confusion) * 100
cat("\nPV by biological type:\n")
#> 
#> PV by biological type:
cat("  Type 1:", round(pv[1], 1), "%\n")
#>   Type 1: 79 %
cat("  Type 2:", round(pv[2], 1), "%\n")
#>   Type 2: 56.2 %
cat("  Type 3:", round(pv[3], 1), "%\n")
#>   Type 3: 52.3 %
cat("  Type 4:", round(pv[4], 1), "%\n")
#>   Type 4: 66.2 %

Monte Carlo and LOO validity

The MC p-value and LOO results are shown in the print and summary output above.

Notes on reproducibility

Fixture parity. The training rule, confusion matrix, and ESS are verified against MegaODA.exe output in the package test suite (tests/testthat/test-fixture-vignettes.R, Example 3).

MC p-value calibration. The MC p shown here reflects mc_iter = 500L in this CRAN vignette. Use the canonical run with mc_iter = 25000L (chunk fit-canonical, eval=FALSE) for publication-quality results.

Nondirectional search. No direction argument is supplied. ODA evaluates all possible mappings from the four amino-acid categories to the four biological-type classes and selects the mapping that maximises ESS. This matches the MegaODA.exe gold run (Hypothesis: NONDIRECTIONAL).

Optional directional analysis. A researcher with an a priori convergent- validity hypothesis (amino acid type i predicts biological type i) can supply direction = "ascending" for a constrained identity-map analysis (MPE Chapter 4 Phase 6C). For this dataset the two analyses yield identical ESS and confusion because the identity mapping happens to be the global optimum; they differ in MC interpretation (directional vs. nondirectional p-value).


  1. Nishikawa, K., Kubota, Y., & Ooi, T. (1983). Classification of proteins into groups based on amino acid composition and other characters, II: Grouping into four types. Journal of Biochemistry, 94, 997-1007.↩︎

  2. Yarnold, P.R., & Soltysik, R.C. (2005). Optimal Data Analysis: A Guidebook with Software for Windows. Washington, D.C.: APA Books.↩︎