The fctutils
package provides a comprehensive suite of utilities for advanced manipulation and analysis of factor vectors in R. It offers tools for splitting, combining, reordering, filtering, and transforming factor levels based on various criteria. Designed to enhance the handling of categorical data, fctutils
simplifies complex factor operations, making it easier to preprocess and analyze data in R.
Key Features:
Install the package with its dependencies and load it for usage in R.
library(devtools) # Load the devtools package
install_github("guokai8/fctutils") # Install the package
fct_pos Reorders the levels of a factor vector based on the characters at specified positions within the factor levels.
library(fctutils)
factor_vec <- factor(c('Apple', 'banana', 'Cherry', 'date', 'Fig', 'grape'))
# Reorder based on positions 1 and 3, case-insensitive
fct_pos(factor_vec, positions = c(1, 3))
## [1] Apple banana Cherry date Fig grape
## Levels: Apple banana Cherry date Fig grape
# Reorder based on positions 3, case-insensitive, inplace = TRUE
fct_pos(factor_vec, positions = 3, inplace = TRUE)
## [1] grape Cherry Fig banana Apple date
## Levels: grape Cherry Fig banana Apple date
# Reorder in decreasing order, case-sensitive
fct_pos(factor_vec, positions = 1:2, case = TRUE, decreasing = TRUE)
## [1] Apple banana Cherry date Fig grape
## Levels: grape date banana Fig Cherry Apple
fct_count Reorders the levels of a factor vector based on the count of each level in the data.
factor_vec <- factor(c('apple', 'banana', 'apple', 'cherry', 'banana', 'banana', 'date'))
# Reorder levels by decreasing count
fct_count(factor_vec)
## [1] apple banana apple cherry banana banana date
## Levels: banana apple cherry date
# Reorder levels by increasing count
fct_count(factor_vec, decreasing = FALSE)
## [1] apple banana apple cherry banana banana date
## Levels: cherry date apple banana
fct_sub Reorders the levels of a factor vector based on substrings extracted from the factor levels.
factor_vec <- factor(c('Apple', 'banana', 'Cherry', 'date', 'Fig', 'grape'))
# Reorder based on substring from position 2 to 4
fct_sub(factor_vec, start_pos = 2, end_pos = 4)
## [1] banana date Cherry Fig Apple grape
## Levels: banana date Cherry Fig Apple grape
# Reorder from position 3 to end, case-sensitive
fct_sub(factor_vec, start_pos = 3, case = TRUE)
## [1] grape Cherry Fig banana Apple date
## Levels: grape Cherry Fig banana Apple date
fct_freq Reorders the levels of a factor vector based on the total frequency of characters appearing in the vector.
factor_vec <- factor(c('apple', 'banana', 'cherry', 'date', 'banana', 'apple', 'fig'))
# Reorder levels based on total character frequency
fct_freq(factor_vec)
## [1] apple banana cherry date banana apple fig
## Levels: banana apple date cherry fig
# Reorder levels, case-sensitive
factor_vec_case <- factor(c('Apple', 'banana', 'Cherry', 'date', 'banana', 'apple', 'Fig'))
fct_freq(factor_vec_case, case = TRUE)
## [1] Apple banana Cherry date banana apple Fig
## Levels: banana apple Apple date Cherry Fig
fct_char_freq Reorders the levels of a factor vector based on the frequency of characters at specified positions within the data.
factor_vec <- factor(c('apple', 'banana', 'apricot', 'cherry', 'banana', 'banana', 'date'))
# Reorder based on characters at positions 1 and 2
fct_char_freq(factor_vec, positions = 1:2)
## [1] banana banana banana apricot apple date cherry
## Levels: banana apricot apple date cherry
# Reorder, case-sensitive, decreasing order
fct_char_freq(factor_vec, positions = c(1, 3), case = TRUE)
## [1] banana banana banana date cherry apricot apple
## Levels: banana date cherry apricot apple
fct_substr_freq Reorders the levels of a factor vector based on the frequency of substrings extracted from the data.
factor_vec <- factor(c('apple', 'banana', 'apricot', 'cherry', 'banana', 'banana', 'date'))
fct_substr_freq(factor_vec, start_pos = 2, end_pos=3)
## [1] banana banana banana date cherry apricot apple
## Levels: banana date cherry apricot apple
fct_regex_freq Reorders the levels of a factor vector based on the frequency of substrings matching a regular expression.
factor_vec <- factor(c('apple', 'banana', 'apricot', 'cherry', 'blueberry', 'blackberry', 'date'))
# Reorder based on pattern matching 'a'
fct_regex_freq(factor_vec, pattern = 'a')
## [1] date blackberry banana apricot apple cherry blueberry
## Levels: date blackberry banana apricot apple cherry blueberry
# Reorder with case-sensitive matching
fct_regex_freq(factor_vec, pattern = '^[A-Z]', case = TRUE)
## [1] date cherry blueberry blackberry banana apricot apple
## Levels: date cherry blueberry blackberry banana apricot apple
fct_split Splits the levels of a factor vector using specified patterns or positions and reorders based on specified parts or criteria.
# Example factor vector with patterns
factor_vec <- factor(c('item1-sub1', 'atem2_aub2', 'item3|sub3', 'item1-sub4'))
# Split by patterns '-', '_', or '|' and reorder based on the first part
fct_split(factor_vec, split_pattern = c('-', '_', '\\|'), part = 1)
## [1] item1-sub1 atem2_aub2 item3|sub3 item1-sub4
## Levels: atem2_aub2 item1-sub1 item1-sub4 item3|sub3
# Use the second pattern '_' for splitting
fct_split(factor_vec, split_pattern = c('-', '_', '\\|'), use_pattern = 2, part = 2)
## [1] item1-sub1 atem2_aub2 item3|sub3 item1-sub4
## Levels: item1-sub1 item3|sub3 item1-sub4 atem2_aub2
# Reorder based on character frequencies in the specified part
fct_split(factor_vec, split_pattern = '-', part = 2, char_freq = TRUE)
## [1] item1-sub1 atem2_aub2 item3|sub3 item1-sub4
## Levels: atem2_aub2 item3|sub3 item1-sub1 item1-sub4
fct_len Reorders the levels of a factor vector based on the character length of each level.
factor_vec <- factor(c('apple', 'banana', 'cherry', 'date'))
# Sort levels by length
fct_len(factor_vec)
## [1] apple banana cherry date
## Levels: date apple banana cherry
fct_sort Sorts the levels of a factor vector based on the values of another vector or a column from a data frame. Handles cases where the sorting vector may contain NA
s.
factor_vec <- factor(c('apple', 'banana', 'cherry', 'date'))
by_vec <- c(2, 3, 1, NA)
fct_sort(factor_vec, by = by_vec)
## [1] apple banana cherry date
## Levels: cherry apple banana date
# Example using a data frame column
data <- data.frame(
Category = factor(c('apple', 'banana', 'cherry', 'date')),
Value = c(2, 3, 1, NA)
)
fct_sort(data$Category, by = data$Value)
## [1] apple banana cherry date
## Levels: cherry apple banana date
fct_sort_custom Reorders the levels of a factor vector based on a custom function applied to each level.
factor_vec <- factor(c('apple', 'banana', 'cherry'))
# Sort levels by reverse alphabetical order
fct_sort_custom(factor_vec, function(x) -rank(x))
## [1] apple banana cherry
## Levels: cherry banana apple
# Sort levels by length of the level name
fct_sort_custom(factor_vec, function(x) nchar(x))
## [1] apple banana cherry
## Levels: apple banana cherry
fct_replace Replaces a specified level in a factor vector with a new level. If a position is provided, the new level is inserted at the specified position among the levels; otherwise, the original level order is preserved.
factor_vec <- factor(c('apple', 'banana', 'cherry', 'date', 'fig', 'grape'))
# replace 'banana' as 'blueberry', and keep original order
fct_replace(factor_vec, old_level = 'banana', new_level = 'blueberry')
## [1] apple blueberry cherry date fig grape
## Levels: apple blueberry cherry date fig grape
# replace 'banana' as 'blueberry'
fct_replace(factor_vec, old_level = 'banana', new_level = 'blueberry', position = 2)
## [1] apple blueberry cherry date fig grape
## Levels: apple blueberry cherry date fig grape
fct_replace_pattern Replaces parts of the factor levels that match a specified pattern with a new string.
factor_vec <- factor(c('apple_pie', 'banana_bread', 'cherry_cake'))
# Replace '_pie', '_bread', '_cake' with '_dessert'
fct_replace_pattern(factor_vec, pattern = '_.*', replacement = '_dessert')
## [1] apple_dessert banana_dessert cherry_dessert
## Levels: apple_dessert banana_dessert cherry_dessert
fct_filter_freq Filters out factor levels that occur less than a specified frequency threshold and recalculates character frequencies excluding the removed levels. Offers options to handle NA values and returns additional information.
factor_vec <- factor(c('apple', 'banana', 'cherry', 'date', 'banana', 'apple', 'fig', NA))
# Filter levels occurring less than 2 times and reorder by character frequency
fct_filter_freq(factor_vec, min_freq = 2)
## [1] apple banana banana apple
## Levels: banana apple
# Filter levels, remove NA values, and return additional information
result <- fct_filter_freq(factor_vec, min_freq = 2, na.rm = TRUE, return_info = TRUE)
result$filtered_factor
## [1] apple banana banana apple
## Levels: banana apple
result$removed_levels
## [1] "cherry" "date" "fig"
result$char_freq_table
## all_chars
## a b e l n p
## 8 2 2 2 4 4
fct_filter_pos Removes factor levels where a specified character appears at specified positions within the levels.
factor_vec <- factor(c('apple', 'banana', 'apricot', 'cherry', 'date', 'fig', 'grape'))
# Remove levels where 'a' appears at position 1
fct_filter_pos(factor_vec, positions = 1, char = 'a')
## [1] banana cherry date fig grape
## Levels: banana cherry date fig grape
# Remove levels where 'e' appears at positions 2 or 3
fct_filter_pos(factor_vec, positions = c(2, 3), char = 'e')
## [1] apple banana apricot date fig grape
## Levels: apple apricot banana date fig grape
# Case-sensitive removal
factor_vec_case <- factor(c('Apple', 'banana', 'Apricot', 'Cherry', 'Date', 'Fig', 'grape'))
fct_filter_pos(factor_vec_case, positions = 1, char = 'A', case = TRUE)
## [1] banana Cherry Date Fig grape
## Levels: Cherry Date Fig banana grape
fct_remove_levels Removes specified levels from a factor vector, keeping the remaining levels and their order unchanged.
factor_vec <- factor(c('apple', 'banana', 'cherry', 'date', 'fig', 'grape'))
# Remove levels 'banana' and 'date'
fct_remove_levels(factor_vec, levels_to_remove = c('banana', 'date'))
## [1] apple cherry fig grape
## Levels: apple cherry fig grape
fct_filter_func Removes levels from a factor vector based on a user-defined function.
factor_vec <- factor(c('apple', 'banana', 'cherry', 'date'))
# Remove levels that start with 'b'
fct_filter_func(factor_vec, function(x) !grepl('^b', x))
## [1] apple <NA> cherry date
## Levels: apple cherry date
fct_merge_similar Merges levels of a factor that are similar based on string distance.
factor_vec <- factor(c('apple', 'appel', 'banana', 'bananna', 'cherry'))
# Merge similar levels
fct_merge_similar(factor_vec, max_distance = 1)
## [1] apple appel banana banana cherry
## Levels: appel apple banana cherry
fct_concat Combines multiple factor vectors into a single factor, unifying the levels.
factor_vec1 <- factor(c('apple', 'banana'))
factor_vec2 <- factor(c('cherry', 'date'))
# Concatenate factors
concatenated_factor <- fct_concat(factor_vec1, factor_vec2)
levels(concatenated_factor)
## [1] "apple" "banana" "cherry" "date"
fct_combine Combines two vectors, which may be of unequal lengths, into a factor vector and sorts based on the levels of either the first or second vector.
vector1 <- c('apple', 'banana', 'cherry')
vector2 <- c('date', 'fig', 'grape', 'honeydew')
# Combine and sort based on vector1 levels
fct_combine(vector1, vector2, sort_by = 1)
## [1] apple banana cherry date fig grape honeydew
## Levels: apple banana cherry date fig grape honeydew
# Combine and sort based on vector2 levels
fct_combine(vector1, vector2, sort_by = 2)
## [1] apple banana cherry date fig grape honeydew
## Levels: date fig grape honeydew apple banana cherry
fct_insert Inserts one or more new levels into a factor vector immediately after specified target levels. Targets can be identified by exact matches, positions, or pattern-based matching. Supports case sensitivity and handling of \code{NA} values. Can handle multiple insertions and maintains the original order of other levels. If a new level already exists in the factor and \code{allow_duplicates} is \code{FALSE}, it is moved to the desired position without duplication. If \code{allow_duplicates} is \code{TRUE}, unique duplicates are created.
factor_vec <- factor(c('apple', 'banana', 'cherry', 'date', 'fig', 'grape'))
fct_insert(factor_vec, insert = 'date', target = 'banana', inplace = TRUE)
## [1] apple banana date cherry fig grape
## Levels: apple banana date cherry fig grape
fct_insert(factor_vec, insert = c('date', 'grape'), positions = c(2, 4))
## [1] apple banana cherry date fig grape
## Levels: apple banana date cherry grape fig
fct_insert(factor_vec, insert = 'honeydew', pattern = '^c')
## [1] apple banana cherry date fig grape
## Levels: apple banana cherry honeydew date fig grape
factor_vec_na <- factor(c('apple', NA, 'banana', 'cherry', NA, 'date'))
fct_insert(factor_vec_na, insert = 'lychee', insert_after_na = TRUE)
## Warning in fct_insert(factor_vec_na, insert = "lychee", insert_after_na =
## TRUE): No target levels found for insertion. Returning the original factor.
## [1] apple <NA> banana cherry <NA> date
## Levels: apple banana cherry date
fct_intersect Combines multiple factor vectors and returns a factor vector containing only the levels common to all.
factor_vec1 <- factor(c('apple', 'banana', 'cherry'))
factor_vec2 <- factor(c('banana', 'date', 'cherry'))
factor_vec3 <- factor(c('banana', 'cherry', 'fig'))
# Get intersection of levels
fct_intersect(factor_vec1, factor_vec2, factor_vec3)
## [1] banana cherry banana cherry banana cherry
## Levels: banana cherry
fct_union Combines multiple factor vectors and returns a factor vector containing all unique levels.
factor_vec1 <- factor(c('apple', 'banana'))
factor_vec2 <- factor(c('banana', 'cherry'))
factor_vec3 <- factor(c('date', 'fig'))
# Get union of levels
fct_union(factor_vec1, factor_vec2, factor_vec3)
## [1] apple banana banana cherry date fig
## Levels: apple banana cherry date fig
fct_reorder_within Reorders the levels of a factor vector within groups defined by another factor vector.
data <- data.frame(
item = factor(c('A', 'B', 'C', 'D', 'E', 'F')),
group = factor(c('G1', 'G1', 'G1', 'G2', 'G2', 'G2')),
value = c(10, 15, 5, 20, 25, 15)
)
data <- rbind(data, data)
# Reorder 'item' within 'group' by 'value'
data$item <- fct_reorder_within(data$item, data$group, data$value, mean)
fct_extract Extracts substrings from the levels of a factor vector based on a regular expression pattern and creates a new factor.
factor_vec <- factor(c('item123', 'item456', 'item789'))
# Extract numeric part
fct_extract(factor_vec, pattern = '\\d+')
## [1] 123 456 789
## Levels: 123 456 789
# Extract with capturing group
factor_vec <- factor(c('apple: red', 'banana: yellow', 'cherry: red'))
fct_extract(factor_vec, pattern = '^(\\w+):', capture_group = 1)
## [1] apple banana cherry
## Levels: apple banana cherry
fct_pad_levels Pads the levels of a factor vector with leading characters to achieve a specified width.
# Example factor vector
factor_vec <- factor(c('A', 'B', 'C', 'D'))
# Pad levels to width 4 using '0' as padding character
padded_factor <- fct_pad_levels(factor_vec, width = 4, pad_char = '0')
print(levels(padded_factor))
## [1] "000A" "000B" "000C" "000D"
# Output: "000A" "000B" "000C" "000D"
# Pad levels to width 6 using '%A' as padding string
padded_factor <- fct_pad_levels(factor_vec, width = 6, pad_char = '%A')
print(levels(padded_factor))
## [1] "%A%%A%A" "%A%%A%B" "%A%%A%C" "%A%%A%D"
# Output: "%%A%A" "%%A%B" "%%A%C" "%%A%D"
fct_level_stats Computes statistical summaries for each level of a factor vector based on associated numeric data. (group_by and summarize).
fct_pattern_replace Replaces substrings in factor levels that match a pattern with a replacement string.
fct_impute Replaces \code{NA} values in a factor vector using specified imputation methods.
fct_unique_comb Generates a new factor where each level represents a unique combination of levels from the input factors.
fct_map_func Transforms factor levels by applying a function that can include complex logic.
fct_collapse_lev Collapses specified levels of a factor into new levels based on a grouping list.
fct_duplicates Identifies duplicate levels in a factor vector and returns a logical vector indicating which elements are duplicates.
fct_dummy Generates a data frame of dummy variables (one-hot encoded) from a factor vector.
fct_replace_na Replaces \code{NA} values in a factor vector with a specified level.
fct_sample_levels Randomly selects a specified number of levels from a factor vector.
fct_apply Transforms factor levels by applying a function to each level.
fct_encode Converts the levels of a factor vector into numeric codes, optionally using a provided mapping.
The fctutils
package provides a comprehensive set of functions to efficiently manage and manipulate factor vectors in R. From ordering and sorting to replacing, filtering, merging, and beyond, these tools enhance your ability to handle categorical data with ease. The additional essential functions further extend the package’s capabilities, ensuring that all common factor operations are covered.
For any questions please contact guokai8@gmail.com or submit the issues to https://github.com/guokai8/fctutils/issues