Inclusion/Exclusion of rows in a DataFrame, based on specific criteria

 

Questions


I have a large set of data that contains pathology test data for a number of individuals. I present a scaled down data set describing the types of cases.

library(plyr)
library(tidyr)
library(dplyr)
library(lubridate)

options(stringsAsFactors = FALSE)
dat <- structure(list(PersID = c("am1", "am2", "am2", "am3", "am3", "am4", "am4", "am4", "am4", "am4", "am4"), Sex = c("M", "F","F", "M", "M", "F", "F", "F", "F", "F", "F"), DateTested = c("21/10/2015", "9/07/2010", "24/09/2010", "23/10/2013", "25/10/2013", "28/04/2010", "23/06/2010", "21/07/2010", "20/10/2010", "4/03/2011", "2/12/2011"), Res = c("NR", "R", "R", "NR", "R", "R", "R", "R", "R", "R", "R"), Status = c("Yes", "No", "No", "Yes", "Yes", "No", "No", "No", "No", "No", "No"), DateOrder = c(1L, 1L, 2L, 1L, 2L, 1L, 2L, 3L, 4L, 5L, 6L)), .Names = c("PersID", "Sex", "DateTested", "Res", "Status", "DateOrder"), class = "data.frame", row.names = c(NA, -11L))

The data describes three types of person (1)those with a single result only (2) those with 2 results, and (3) those with many results.

My goal is to come up with a script that will only include rows for individuals according to a set of criteria. Technically it is a method to only count rows for individuals if their subsequent results are within a specified reinfection period (30 days).

I have converted my data to a list and passed a number of functions to it to start processing the data.

dat$DateTested <- dmy(dat$DateTested)
datList <- dlply(.data=dat, .variables=c('PersID'))

What I have done so far is:

Select all rows where there is a single result per person

fnSingleTests <- function(y){
    y <- y[length(y$DateOrder)==1,]
}

singleTests <- ldply(datList, fnSingleTests, .id = NULL)

Convert the data frame to a list and pass a function that
determines if (a) there are two rows per person within the 30-day
reinfection period, then select the first one, and (b) if there are
more than two rows per person, and the last record and the first
record are within 30 days, only keep the first one.

fnMultiTests <- function(y){
    y <- y[length(y$DateOrder) > 1,]
}

multiTests <- llply(datList, fnMultiTests)

fnMultiTestsSplit <- function(y){

    test <- difftime(y$DateTested[length(y$DateTested)], y$DateTested[1], units='days')


    if (nrow(y) <=2){

        if (test < 31){
            y <- y[y$DateOrder == 1, ]
            y <- y[!is.na(y$PerdID), ]
        } else {
            y <- y[y$DateOrder %in% 1:2, ]
            y <- y[!is.na(y$PersID), ]
        }

    } else  {
        if (test < 31){
            y <- y[y$DateOrder == 1, ]
            y <- y[!is.na(y$PersID), ]
        } else {
            break()
        }

    }
}

finalTests <-  ldply(multiTests, failwith(NULL, fnMultiTestsSplit, quiet = TRUE), .id = NULL)

I can then combine data frames with rbind:

allFinalTests <- rbind(singleTests, finalTests)

Where I am stuck is for cases where there are more than two rows per person, and within sequential rows there may be cases of a period of time greater than the 30-day reinfection period.

Can anyone suggest how I could extend this code to include only cases where there are more than two PersID and then only include results where there are subsequent cases occur outside the 30 day reinfection period.

Specifically, start from the oldest case and if the next case is within 30 days then exclude the second cases, or if the second case is more than 30 days since the previous case, then include both cases. It should do this for all cases for the same PersID

In this example the final output I am looking for is:

PersID  Sex DateTested  Res Status  DateOrder
am1 M   21/10/2015  NR  Yes 1
am2 F   9/07/2010   R   No  1
am2 F   24/09/2010  R   No  2
am3 M   23/10/2013  NR  Yes 1
am4 F   28/04/2010  R   No  1
am4 F   23/06/2010  R   No  2
am4 F   20/10/2010  R   No  4
am4 F   4/03/2011   R   No  5
am4 F   2/12/2011   R   No  6

 

 

————————————————-

Answer

In base R, I would approach it as follows:

# convert the 'DateTested' column to a date-format
dat$DateTested <- as.Date(dat$DateTested, format = "%d/%m/%Y")
# calculate the difference in days with the previous observation in the group
dat$tdiff <- unlist(tapply(dat$DateTested, INDEX = dat$PersID,
                           FUN = function(x) c(0, `units<-`(diff(x), "days"))))
# filter the observations that have either a timedifference of zero or more 
dat[(dat[,"tdiff"]==0 | dat[,"tdiff"] > 30),]

which gives:

   PersID Sex DateTested Res Status DateOrder tdiff
1     am1   M 2015-10-21  NR    Yes         1     0
2     am2   F 2010-07-09   R     No         1     0
3     am2   F 2010-09-24   R     No         2    77
4     am3   M 2013-10-23  NR    Yes         1     0
6     am4   F 2010-04-28   R     No         1     0
7     am4   F 2010-06-23   R     No         2    56
9     am4   F 2010-10-20   R     No         4    91
10    am4   F 2011-03-04   R     No         5   135
11    am4   F 2011-12-02   R     No         6   273

Using the data.table package:

library(data.table)
# convert the 'data.frame' to a 'data.table'
# and convert the 'DateTested' column to a date-format
setDT(dat)[, DateTested := as.Date(DateTested, format = "%d/%m/%Y")]
# calculate the difference in days with the previous observation in the group
dat[, tdiff := c(0, `units<-`(diff(DateTested), "days")), PersID]
# filter the observations that have either a timedifference of zero or more than 30 days
dat[(tdiff==0 | tdiff > 30)]

which will give you the same result. You can also chain this together as follows:

setDT(dat)[, DateTested := as.Date(DateTested, format = "%d/%m/%Y")
           ][, tdiff := c(0, `units<-`(diff(DateTested), "days")), by = PersID
             ][(tdiff==0 | tdiff > 30)]

And using dplyr:

library(dplyr)
dat %>% 
  mutate(DateTested = as.Date(DateTested, format = "%d/%m/%Y")) %>%
  group_by(PersID) %>%
  mutate(tdiff = c(0, `units<-`(diff(DateTested), "days"))) %>%
  filter(tdiff == 0 | tdiff > 30)

which will also give you the same result.

dataframe,r

Facebook Comments

Post a comment