Built 2022-01-01 using NMdata 0.0.9.954.
This vignette is still under development. Please make sure to see latest version available here.
Getting data ready for modeling is a crucial and often underestimated task. Mistakes during the process of combining data sets, defining time variables etc. can lead to difficulties during modeling, need for revisiting data set preparation, and in worst case wasted time working with an erroneos data set. Avoiding those mistakes by integrating checks into the data preparation process is a key element in an efficient and reliable data preparation work flow.
Furthermore, Nonmem has a number of restrictions on the format of the input data, and problems with the data set is a common reason for Nonmem not to behave as expected. When this happens, debugging can be time-consuming. NMdata
includes some simple functions to prevent these situations.
This vignette uses data.table
syntax for the little bit of data manipulation performed. However, you don’t need to use data.table at all to use these or any tool in NMdata
. The data set is a data.table
:
<- readRDS(file = system.file("examples/data/xgxr2.rds", package = "NMdata"))
pk class(pk)
#> [1] "data.table" "data.frame"
If you are not familiar with data.table
, you can keep reading this vignette and learn what NMdata
can do. In all brevity, data.table
is a powerful enhancement to the data.frame
class. The syntax differ from data.frame
, and the few places where this affects the examples provided here, explanations will be given.
When stacking (rbind
) and merging, it is most often necessary to check if two or more data sets are compatible for the operation. compareCols
compares columns across two or more data sets.
To illustrate the output of compareCols
, a slightly modified version of the pk
dataset has been created. One column (CYCLE
) has been removed, and AMT
has been recoded to character. compareCols
tells us about exactly these two differences:
compareCols(pk, pk.reduced)
#> Dimensions:
#> data nrows ncols
#> 1: pk 1502 24
#> 2: pk.reduced 751 23
#>
#> Columns that differ:
#> column pk pk.reduced
#> 1: CYCLE integer <NA>
#> 2: AMT integer character
Before merging or stacking, we may want to recode AMT
in one of the datasets to get the class we need, and decide what to do about the CYCLE
column which is missing in one of the datasets (add information or fill with NA
?).
The model estimation step is heavily dependent (and in Nonmem almost entirely based) on numeric data values. The source data will often contain character variables, i.e. columns with non-numeric data values.
If the column names reflect whether the values are numeric, double-checking can be avoided. renameByContents
renames columns if a function of their contents returns TRUE
.
<- renameByContents(data = pktmp, fun.test = NMisNumeric,
pk.renamed fun.rename = tolower, invert.test = TRUE)
We make use of the function NMisNumeric
which tests if Nonmem
can interpret the contents as numeric. If say the subject ID is of character class, it can be valid to Nonmem
. Subject ID "1039"
will be a numeric in Nonmem, "1-039"
will not. NMisNumeric
will return TRUE
if and only if all elements are either missing or interpretable as numeric. We invert the condition (invert.test=TRUE
), and the names of the columns that Nonmem cannot interpret as numeric become lowercase. We use compareCols
to illustrate that three columns were renamed:
compareCols(pktmp, pk.renamed)
#> Dimensions:
#> data nrows ncols
#> 1: pktmp 1502 23
#> 2: pk.renamed 1502 23
#>
#> Columns that differ:
#> column pktmp pk.renamed
#> 1: EVENTU character <NA>
#> 2: NAME character <NA>
#> 3: TIMEUNIT character <NA>
#> 4: eventu <NA> character
#> 5: name <NA> character
#> 6: timeunit <NA> character
We can now easily see that if we wish to include the information contained in eventu
, pktmp
, and pk.renamed
, we have to modify or translate their contents first.
Merges are a very common source of data creation bugs. As simple as they may seem, merges can leave you with an unexpected number of rows, some repeated and/or some omitted. Often, we can impose restrictions on the merge operation that allows for automated validation of the results.
Imagine the very common example that we have a longitudinal PK data set (called pk
), and we want to add subject-level covariates from a secondary data set (dt.cov
). We want to merge by ID
, and all we can allow to happen is columns to be added to pk
from dt.cov
. If rows disappear or get repeated, or if columns get renamed, it’s unintended and should return an error.
We merge the two data sets and the check of the dimensions raises no alarm - the number of rows is unchanged from pk
to pk2
, and one of two columns in dt.cov
was added.
<- merge(pk, dt.cov, by = "ID")
pk2 dims(pk, dt.cov, pk2)
#> data nrows ncols
#> 1: pk 1502 24
#> 2: dt.cov 150 2
#> 3: pk2 1502 25
What we didn’t realize is that we now have twice as many rows for subject 31.
== 31, .N]
pk[ID #> [1] 10
== 31, .N]
pk2[ID #> [1] 20
Using mergeCheck
, we get an error. This is because mergeCheck
compares the actual rows going in and out of the merge and not just the dimensions:
try(mergeCheck(pk, dt.cov, by = "ID"))
#> Rows disappeared during merge.
#> Rows duplicated during merge.
#> Overview of dimensions of input and output data:
#> data nrows ncols
#> 1: pk 1502 25
#> 2: dt.cov 150 2
#> 3: merged.df 1502 26
#> Overview of values of by where number of rows in df1 changes:
#> ID N.df1 N.result
#> 1: 31 10 20
#> 2: 180 10 0
#> Error in mergeCheck(pk, dt.cov, by = "ID") :
#> Merge added and/or removed rows.
Notice that mergeCheck
tells us for which values of ID
the input and output differ so we can quickly look into the data sets and make a decision how we want to handle this. In this case we discard the covariate value for subject 31 and use all.x=TRUE
argument to get NA
for subjects 31 and 180:
<- dt.cov[ID != 31]
dt.cov2 <- mergeCheck(pk, dt.cov2, by = "ID", all.x = TRUE)
pk2.check #> The following columns were added: COV
To ensure the consistency of rows before and after the merge, you could use merge(...,all.x=TRUE)
and then check dimensions before and after (yes, both all.x=TRUE
and the dimension check are necessary). This is not needed if you use mergeCheck
.
mergeCheck
does not try to reimplement merging. Under the hood, the merging is performed by data.table::merge.data.table
to which most arguments are passed. What mergeCheck
does is to add the checks that the results are consistent with the criteria outlined above. data.table::merge.data.table
is generally very fast, and even if there is a bit of extra calculations in mergeCheck
, it should never be slow.
Another problem the programmer may not realize during a merge is when column names are shared across x1
and x2
(in addition to columns that are being merged by). This will silently create column names like col.x
and col.y
in the output. mergeCheck
will by default give a warning if that happens (can be modified using the fun.commoncols
argument). Also, there is an optional argument to tell mergeCheck how many columns are expected to be added by the merge, and mergeCheck
will fail if another number of columns are added. This can be useful for programming.
The row order is by default maintained by mergeCheck
. Apart from this, there is only one difference from the behavior of the merge.data.frame
function syntax, being that the by
argument must always be supplied to mergeCheck
. Default behavior of merge.data.frame
is to merge by all common column names, but for coding transparency, this is intentionally not allowed by mergeCheck
.
In summary, mergeCheck
verifies that the rows that result from the merge are the exact same as in one of the existing datasets, only columns added from the second input dataset. You may think that this will limit your merges, and that you need merges for inner and outer joins etc. You are exactly right - mergeCheck
is not intended for those merges and does not support them. When that is said, the kind of merges that are supported by mergeCheck
are indeed very common. All merges in the NMdata
package are performed with mergeCheck
.
It is good practice not to discard records from a dataset but to flag them and omit them in model estimation. When reporting the analysis, we also need to account for how many data records were discarded due to which criteria. A couple of functons in NMdata
help you do this in a way that is easy to integrate with Nonmem.
The implementation in NMdata
is based on sequentially checking exclusion conditions. This means we can summarize how many records and subjects were excluded from the analysis due to the different criteria. The information is represented in one numerical column for Nonmem, and one (value-to-value corresponding) character column for the rest of us in the resulting data.
For use in Nonmem, the easiest is that inclusion/exclusion is determined by a single column in data - we call that column FLAG
here, but any column name can be used. FLAG
obviously draws on information from other columns such as TIME
, DV
, and many others, depending on your dataset and your way of working.
The function that applies inclusion/excluasion rules is called flagsAssign
, and it takes a dataset and a data.frame with rules as arguments.
<- fread(text = "FLAG,flag,condition
dt.flags 10,Below LLOQ,BLQ==1
100,Negative time,TIME<0")
<- flagsAssign(pk, tab.flags = dt.flags, subset.data = "EVID==0")
pk #> Coding FLAG = 100, flag = Negative time
#> Coding FLAG = 10, flag = Below LLOQ
<- flagsAssign(pk, subset.data = "EVID==1", flagc.0 = "Dosing") pk
fread
is used to create a data.table (like read.csv
to create a data.frame) for readability, one line for each row in the data.table created. flagsAssign
applies the conditions sequentially and by decreasing value of FLAG
. FLAG=0
means that the observation is included in the analysis. You can use any expression that can be evaluated within the data.frame. In this case, BLQ
has to exist in pk
.
Finally, flags are assigned to EVID==1
rows. Here, no flag table is used. This means that all EVID==1
rows will get FLAG=0
.
In Nonmem
, you can include IGNORE=(FLAG.NE.0)
in $DATA
or $INFILE
.
Again, the omission will be attributed to the first condition matched. Default is to apply the conditions by the order of decreasing numerical flag value. Use flags.increasing=TRUE
if you prefer the opposite. However, what cannot be modified is that 0 is the numerical value for rows that are not matched by any conditions.
What rows to omit from a dataset can vary from one analysis to another. Hence, the aim with the chosen design is that the inclusion criteria can be changed and applied to overwrite an existing inclusion/exclusion selection. For another analysis we want to include the observations below LLOQ. We have two options. Either we simply change the IGNORE
statement given above to IGNORE=(FLAG.LT.10)
, or you create a different exclusion flag for that one. If you prefer to create a new set of exclusion flags, just use new names for the numerical and the character flag columns so you don’t overwrite the old ones. See help of flagsAssign
and flagsCount
for how.
An overview of the number of observations disregarded due to the different conditions is then obtained using flagsCount
:
<- flagsCount(data = pk[EVID == 0], tab.flags = dt.flags)
tab.count print(tab.count)
#> flag N.left Nobs.left N.discard N.disc.cum Nobs.discard
#> 1: All available data 150 1352 NA 0 NA
#> 2: Negative time 150 1350 0 0 2
#> 3: Below LLOQ 131 755 19 19 595
#> 4: Analysis set 131 755 NA 19 NA
#> Nobs.disc.cum
#> 1: 0
#> 2: 2
#> 3: 597
#> 4: 597
flagsCount
includes a file
argument to save the the table right away.
Once the dataset is in place, NMdata
provides a few useful functions to ensure the formatting of the written data is compatible with Nonmem. These functions include checks that Nonmem will be able to interpret the data as intended, and more features are under development in this area.
The order of columns in Nonmem is important for two reasons. One is that a character in a variable read into Nonmem will make the run fail. The other is that there are restrictions on the number of variables you can read into Nonmem, depending on the version. NMorderColumns
tries to put the used columns first, and other or maybe even unusable columns in the back of the dataset. It does so by a mix of recognition of column names and analysis of the column contents.
Columns that cannot be converted to numeric are put in the back, while column bearing standard Nonmem variable names like ID
, TIME
, EVID
etc. will be pulled up front. You can of course add column names to prioritize to front (first
) or back (last
). See ?NMorderColumns
for more options.
<- NMorderColumns(pk) pk
We may want to add MDV
and rerun NMorderColumns
.
For the final step of writing the dataset, NMwriteData
is provided. Most importantly, it writes a csv file with appropriate options for Nonmem to read it as well as possible. It can also write an rds for R with equal contents (or RData if you prefer), but with the rds including all information (such as factor levels) which cannot be saved in csv. If you should use NMscanData
to read Nonmem results, this information can be used automatically. It also provides a proposal for text to include in the $INPUT
and $DATA
sections of the Nonmem control streams.
NMwriteData(pk)
#> Data _not_ witten to any files.
#> For NONMEM:
#> $INPUT ROW ID NOMTIME TIME EVID CMT AMT DV FLAG STUDY BLQ CYCLE DOSE
#> PART PROFDAY PROFTIME WEIGHTB eff0
#> $DATA <data file>
#> IGN=@
#> IGNORE=(FLAG.NE.0)
Notice, NMwriteData
detected the exclusion flag and suggests to include it in $DATA
.
If a file name had been provided, the data would have been written, and the path to the data file would have been included in the message written back to the user. There are several arguments that will affect the proposed text for the Nonmem run, see ?NMwriteData
.
I may be the only one, but sometimes during the modeling stage, I want to go back and change or add something to the data creation step. Then once I have written a new data file, my Nonmem $INPUT
sections no longer match the data file. In NMwriteData
you can use a last
argument to get columns pushed towards the back so the Nonmem runs should still work, but maybe you need the column in your nonmem runs, and so you have no way around updating the control streams. And that can be quite a lot of control streams.
NMdata
has a couple of functions to extract and write sections to Nonmem control streams called NMreadSection
and NMwriteSection
. We are not going into detail with what these functions can do, but let’s stick to the example above. We can do
NMwriteSection("run001.mod", "INPUT", "$INPUT ROW ID TIME EVID CMT AMT DV FLAG STUDY BLQ CYCLE DOSE FLAG2 NOMTIME PART PROFDAY PROFTIME WEIGHTB eff0")
But in fact, we can go a step further and take the information straight from NMwriteData
<- NMwriteData(pk)
text.nm #> Data _not_ witten to any files.
#> For NONMEM:
#> $INPUT ROW ID NOMTIME TIME EVID CMT AMT DV FLAG STUDY BLQ CYCLE DOSE
#> PART PROFDAY PROFTIME WEIGHTB eff0
#> $DATA <data file>
#> IGN=@
#> IGNORE=(FLAG.NE.0)
NMwriteData invisibly returns a list of sections ($INPUT
and $DATA
). NMwriteSection
can use these directly. So to write only the $INPUT
section to run001.mod
, we do the following. Please notice the single brackets in text.nm["INPUT"]
which mean that we still send a list to NMwriteSection
.
NMwriteSection("run001.mod", list.sections = text.nm["INPUT"])
If you run this in a loop over the control streams that use the created data set, you are all set to rerun the models as needed.
The last couple of functions that will be introduced here are used for tracing datasets to data creation scripts, including time stamps and other information you want to include with the data set.
<- NMstamp(pk, script = "vignettes/DataCreate.Rmd")
pk NMinfo(pk)
#> $dataCreate
#> $dataCreate$DataCreateScript
#> [1] "vignettes/DataCreate.Rmd"
#>
#> $dataCreate$CreationTime
#> [1] "2022-01-01 21:00:17 EST"
The script
argument is recognized by NMstamp
, but you can add anything to this. Say you want to keep descriptive note too:
<- NMstamp(pk, script = "vignettes/DataCreate.Rmd", Description = "A PK dataset used for examples.")
pk NMinfo(pk)
#> $dataCreate
#> $dataCreate$DataCreateScript
#> [1] "vignettes/DataCreate.Rmd"
#>
#> $dataCreate$CreationTime
#> [1] "2022-01-01 21:00:17 EST"
#>
#> $dataCreate$Description
#> [1] "A PK dataset used for examples."
These are very simple functions. But they are simple to use as well, and hopefully they will help you avoid sitting with a data set trying to guess which script generated it so you can do a modification or understand how something was done.
When using NMwriteData
, you don’t have to call NMstamp
explicitly. Just pass the script
argument to NMwriteData
and NMstamp
will be applied automatically.