Parsing POS Layout Files for Descriptive Names

Robert J. Gambrel


This vignette will show how a Provider of Services report layout file can be quickly parsed to extract the descriptive variable names it contains. POS datasets from year 2010 and earlier have generic variable names like PROV0001, PROV0002, ... that offer no insight into what the variable actually is. In the Layout file, along with a data dictionary explaining the variable’s values, there is also a COBOL descriptive name. The pos_names_extract() function will parse this file and return the descriptive names, in the order that matches the variables in the dataset.

Provider of Services Data

I have included a sample of the 2010 Provider of Services data for hospices. The full 2010 file (along with many other years) is available from the NBER and contains data from other provider types as well.

# load the package data
data(pos2010, package = "medicare")
##  [1] "prov0085" "prov0075" "prov0095" "prov0100" "prov3225" "prov0220"
##  [7] "prov2715" "prov2695" "prov0300" "prov0500"

These variable names are useless, and with over 500 variables it is impractical to look up each one. Instead, we can parse the layout file to obtain useful names. In this example, I have bundled the Layout 2010 file with this package, but I expect the user to have the downloaded text file that corresponds to each dataset in use.

# filepath should be changed by user
filepath <- system.file("extdata", "layout10.txt", package = "medicare")
names_2010 <- pos_names_extract(filepath, pos2010)
##  [1] "CATEGORY_SUBTYPE_IND"  "CATEGORY"             
##  [3] "CHOW_CNT"              "CHOW_DT"              
##  [5] "CITY"                  "COMPL_ACCEPT_PLAN_COR"
##  [7] "STATUS_COMPL"          "SSA_COUNTY"           

These are much more descriptive variable names and worth using.

pos2010_renamed <- pos2010
names(pos2010_renamed) <- names_2010

Note that it is up to the user to make sure that the layout file is appropriate for the chosen data file. Each year’s layout file is different, so each year must be parsed separately. The function checks whether the number of variables in the layout file and dataset match and whether the generic variable names are the same in both. It will stop if there’s a problem. If the generic names from dataset 20XX are the same as in layout 20YY, the parsing should work, but won’t necessarily be accurate. CMS is not 100% consistent with variable naming across years.

pos2010_short <- pos2010[, 1:500]
names_2010_short <- pos_names_extract(filepath, pos2010_short)
## Error in pos_names_extract(filepath, pos2010_short): Number of variables in Layout file did not match number of variables in the dataset. Are you sure the layout file year and dataset year match?
pos2010_wrong_names <- pos2010
names(pos2010_wrong_names)[1:3] <- c("wrong1", "wrong2", "wrong3")
names_2010_wrong_names <- pos_names_extract(filepath, pos2010_wrong_names)
## Error in pos_names_extract(filepath, pos2010_wrong_names): Generic variable names in Layout file do not match names in dataset. Are you sure the layout file year and dataset year match?

Pre-compiled dataset names

In order to same the user time and headaches of downloading each year’s Layout file, I have pre-compiled dataset names for years 2000-2010. These can be accessed via the pos_names() function. By looking at inner variables, this also illustrates how the dataset layouts change over time:

for (year in 2000:2010) {
## [1] 2000
## [1] "ORG_FAMILY_GRP"       "ORG_RESID_GRP"        "NUM_OTH_CONTRACT"    
## [1] 2001
## [1] "ORG_RESID_GRP"         "NUM_OTH_CONTRACT"      "NUM_OTH_FULL_TIME"    
## [1] 2002
## [1] "ORG_RESID_GRP"         "NUM_OTH_CONTRACT"      "NUM_OTH_FULL_TIME"    
## [1] 2003
## [1] "ORG_RESID_GRP"         "NUM_OTH_CONTRACT"      "NUM_OTH_FULL_TIME"    
## [1] 2004
## [1] "ORG_RESID_GRP"         "NUM_OTH_CONTRACT"      "NUM_OTH_FULL_TIME"    
## [1] 2005
## [1] 2006
## [1] 2007
## [1] 2008
## [1] 2009
## [1] 2010

mirror server hosted at Truenetwork, Russian Federation.