Inclusion/Exclusion of rows in a DataFrame, based on specific criteria
I have a large set of data that contains pathology test data for a number of individuals. I present a scaled down data set describing the types of cases.
The data describes three types of person (1)those with a single result only (2) those with 2 results, and (3) those with many results.
My goal is to come up with a script that will only include rows for individuals according to a set of criteria. Technically it is a method to only count rows for individuals if their subsequent results are within a specified reinfection period (30 days).
I have converted my data to a list and passed a number of functions to it to start processing the data.
What I have done so far is:
Select all rows where there is a single result per person
Convert the data frame to a list and pass a function that
I can then combine data frames with rbind:
Where I am stuck is for cases where there are more than two rows per person, and within sequential rows there may be cases of a period of time greater than the 30-day reinfection period.
Can anyone suggest how I could extend this code to include only cases where there are more than two
Specifically, start from the oldest case and if the next case is within 30 days then exclude the second cases, or if the second case is more than 30 days since the previous case, then include both cases. It should do this for all cases for the same
In this example the final output I am looking for is:
In base R, I would approach it as follows:
Using the data.table package:
which will give you the same result. You can also chain this together as follows:
And using dplyr:
which will also give you the same result.