inspectr

My first R package, inspectr, is now available on CRAN! You can also check out the latest development version on GitHubinspectr consists of functions adapted from a quality control script I developed for performing data checks on large datasets from an educational assessment, then generalized for more generic application.

The inspectr package contains two classes of functions: column checks and basic fidelity checks. Column check functions allow the user to check data for fidelity without having to master apply functions, and basic fidelity check functions can be used to facilitate some common checks. The user can also define their own checks to use with the column check functions, making the package generalizable to unique data requirements.

To give you an idea of what inspectr is all about, I’ve provided an html version of an introductory vignette included with the package below.

Example Data Checks with inspectr

Jennifer Brussow

Package format

The inspectr package is designed to perform basic data checks without the user needing to understand the intricacies of apply functions.

The functions can be grouped into two categories: column checks and basic fidelity checks.

Column check functions

These are the basic functions used to perform checks. Each function checks one column for data fidelity, and functions exist to check that column against one or two additional columns. A data frame and a column name (or names) go in; a filtered set of records exhibiting issues comes out (either as a dataframe or as an .xlsx document – your choice!)

Basic fidelity check functions

These functions are designed to be used with the column check functions. They perform basic checks on the data, like ensuring that all data in a column are of the same type or ensuring that all values in column 1 are less than their corresponding values in column 2.

Checking an example dataset

To illustrate how to use these functions, let’s apply some basic checks to a sample dataset:

ID_var FName Var1 Var2 Perf_Lvl dates
475410871 David 23 24 Basic 11/3/1985
7702757443 Dorian 5 6 Basic 10/4/1984
KS – Kansas Silas 51 52 Intermediate 9/3/1981
9674734384� Jacob 97 98 Advanced 6/24/1979
7522008646 Gabriel 85 86 Advanced 11/5/1983
3062460685 CLAYTON 21 22 Basic 3/25/1987
3462891407 Errell 1 2 Basic 7/7/1983
4327020559 NA 11 12 Basic 9/14/1988
6042592424 3 4 Basic 6/2/1982
4087289192 Dylan 14 14 Basic 5/15/1986
4322037348 Randy 15 16 Basic 10/4/1982
22831223 Caden 35 36 Basic 6/11/1989
3577493348 Aspen 56 28 Basic 5/14/1985
3496003836 Tyreek 29 30 Basic 4/11/1982
4589950456 Aerielle 31 32 Basic 10/9/1981
789524583 Brandon 45 4 Intermediate 2/1/1984
7824406944 Paul 19 20 NA bad_day
9770520729 Khalyd 17 17 NA 10/6/1988
39965152 Marlayni 63 64 BAsic 8/9/1982
2004790709 Anthony 65 66 Advanced 12/21/1983

Single-column checks

The col_check function is designed to check a single column of data for fidelity to a given check. Several of the basic functions are appropriate for the single column check: numeric_check, character_check, character_blanks_check, date_check, and val_check.

Numeric checks

The numeric_check function checks to ensure all of the values in the column can be coerced into numeric values by as.numeric. For example, in the example dataset, the goal is to ensure that all of the IDs in the ID_var column are numeric.

When checking the example dataset with this function, the results show that there are two records that have non-numeric characters in their ID variables. By setting output = FALSE in the arguments, the function returns a dataframe containing only the records with errors.

col_check("ID_var", dataset, numeric_check, output = FALSE)
#>          ID_var FName Var1 Var2     Perf_Lvl     dates
#> 3   KS - Kansas Silas   51   52 Intermediate  9/3/1981
#> 4 9674734384� Jacob   97   98     Advanced 6/24/1979
Character checks

These character_check and character_blanks_check functions ensure that all of the values in the column can be coerced into character strings by as.character. While character_check does not tolerate blank values, character_blanks_check allows blanks as acceptable values for the purposes of the check. This difference is illustrated by the different results each check yields from the sample dataset:

col_check("FName", dataset, character_check)
#>       ID_var FName Var1 Var2 Perf_Lvl     dates
#> 8 4327020559  <NA>   11   12    Basic 9/14/1988

col_check("FName", dataset, character_blanks_check)
#>       ID_var FName Var1 Var2 Perf_Lvl     dates
#> 8 4327020559  <NA>   11   12    Basic 9/14/1988
#> 9 6042592424          3    4    Basic  6/2/1982

As you can see, neither of these checks tolerates NA values.

Value check

The value_check function allows the user to input their own values to set the parameters of the check. The user supplies a vector of accepted values to the values argument, and the check ensures that all values in the column are within that set of accepted values. Blank values and NA values are not tolerated by default, though they can be included in the vector of accepted values.

col_check("Perf_Lvl", dataset, val_check, values = c("Basic", "Intermediate", "Advanced"))
#>        ID_var    FName Var1 Var2 Perf_Lvl     dates
#> 17 7824406944     Paul   19   20     <NA>   bad_day
#> 18 9770520729   Khalyd   17   17     <NA> 10/6/1988
#> 19   39965152 Marlayni   63   64    BAsic  8/9/1982

col_check("Var1", dataset, val_check, values = c(1:25))
#>           ID_var    FName Var1 Var2     Perf_Lvl      dates
#> 3    KS - Kansas    Silas   51   52 Intermediate   9/3/1981
#> 4  9674734384�    Jacob   97   98     Advanced  6/24/1979
#> 5     7522008646  Gabriel   85   86     Advanced  11/5/1983
#> 12      22831223    Caden   35   36        Basic  6/11/1989
#> 13    3577493348    Aspen   56   28        Basic  5/14/1985
#> 14    3496003836   Tyreek   29   30        Basic  4/11/1982
#> 15    4589950456 Aerielle   31   32        Basic  10/9/1981
#> 16     789524583  Brandon   45    4 Intermediate   2/1/1984
#> 19      39965152 Marlayni   63   64        BAsic   8/9/1982
#> 20    2004790709  Anthony   65   66     Advanced 12/21/1983
Date check

The date_check function allows the user to input a beginning and end date to set the parameters of the check. The check ensures that all values in the column are equal to or between the specified beginning and end dates and returns all values that do not fall within the given range.

col_check("dates", dataset, date_check, begin = "06/02/1982", end = "11/11/1986")
#>           ID_var    FName Var1 Var2     Perf_Lvl     dates
#> 3    KS - Kansas    Silas   51   52 Intermediate  9/3/1981
#> 4  9674734384�    Jacob   97   98     Advanced 6/24/1979
#> 6     3062460685  CLAYTON   21   22        Basic 3/25/1987
#> 8     4327020559     <NA>   11   12        Basic 9/14/1988
#> 12      22831223    Caden   35   36        Basic 6/11/1989
#> 14    3496003836   Tyreek   29   30        Basic 4/11/1982
#> 15    4589950456 Aerielle   31   32        Basic 10/9/1981
#> 17    7824406944     Paul   19   20         <NA>   bad_day
#> 18    9770520729   Khalyd   17   17         <NA> 10/6/1988

Two-column checks

The two_col_check function is designed to check one column of data against values in another column. Several of the basic functions are appropriate for the two column check: less_than, less_than_equalto, greater_than, and greater_than_equalto.

two_col_check("Var1", "Var2", dataset, less_than)
#>        ID_var   FName Var1 Var2     Perf_Lvl     dates
#> 10 4087289192   Dylan   14   14        Basic 5/15/1986
#> 13 3577493348   Aspen   56   28        Basic 5/14/1985
#> 16  789524583 Brandon   45    4 Intermediate  2/1/1984
#> 18 9770520729  Khalyd   17   17         <NA> 10/6/1988

two_col_check("Var1", "Var2", dataset, less_than_equalto)
#>        ID_var   FName Var1 Var2     Perf_Lvl     dates
#> 13 3577493348   Aspen   56   28        Basic 5/14/1985
#> 16  789524583 Brandon   45    4 Intermediate  2/1/1984

The greater_than and greater_than_equalto functions work similarly. Notice that for these checks,the order of the input columns is reversed; Var2 is the column being checked for fidelity, and Var1 is the reference column.

two_col_check("Var2", "Var1", dataset, greater_than)
#>        ID_var   FName Var1 Var2     Perf_Lvl     dates
#> 10 4087289192   Dylan   14   14        Basic 5/15/1986
#> 13 3577493348   Aspen   56   28        Basic 5/14/1985
#> 16  789524583 Brandon   45    4 Intermediate  2/1/1984
#> 18 9770520729  Khalyd   17   17         <NA> 10/6/1988

two_col_check("Var2", "Var1", dataset, greater_than_equalto)
#>        ID_var   FName Var1 Var2     Perf_Lvl     dates
#> 13 3577493348   Aspen   56   28        Basic 5/14/1985
#> 16  789524583 Brandon   45    4 Intermediate  2/1/1984

Three-column checks

As of version 1.0.0, inspectr does not include any basic fidelity check functions that are designed to work with three_col_check. However, you are encouraged to write your own and plug them in! The example below shows a function written to check the Perf_Lvl column against Var1 and Var2 as reference columns. In order to pass the check, the value of Perf_Lvl has to be either “Basic”, “Intermediate”, or “Advanced”; OR if Perf_Lvl is NA, then Var2 must be even and Var1 must be odd.

This is sort of a silly check, but it illustrates the way a user-defined function can be used with three_col_check. Of course, you can also use user-defined functions with col_check and two_col_check, as well.

three_col_check(colname1 = "Perf_Lvl", colname2 = "Var1", colname3 = "Var2",
                data = dataset, fun = function(col1, col2, col3){
                  col1 %in% c("Basic", "Intermediate", "Advanced") |
                    (is.na(col1) & (col3 %% 2 ==0) & (col2 %% 2 ==1 ))
})
#>        ID_var    FName Var1 Var2 Perf_Lvl     dates
#> 18 9770520729   Khalyd   17   17     <NA> 10/6/1988
#> 19   39965152 Marlayni   63   64    BAsic  8/9/1982