Skip to content

133 code parser #139

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 107 commits into from
Closed

133 code parser #139

wants to merge 107 commits into from

Conversation

m7pr
Copy link
Contributor

@m7pr m7pr commented Aug 23, 2023

Closes #133

This PR introduces a feature that can be utilized in a broad usage. Currently it only extends qenv class, but the big picture is that we will be able to change the way we provide data in teal::init.

Current behavior

Currently teal::init takes teal.data::teal_data() as an input which takes actual R objects as an input with an extra companion of code specification, which in all cases is the code used to create the R object that is being passed. This results in code duplication: we first create a code to create the object, and then we copy-paste this code into an object specification at teal.data::teal_data().

new_iris <- transform(iris, id = seq_len(nrow(iris)))
new_mtcars <- transform(mtcars, id = seq_len(nrow(mtcars)))

app <- init(
  data = teal_data(
    dataset("new_iris", new_iris, code = "new_iris <- transform(iris, id = seq_len(nrow(iris))"),
    dataset("new_mtcars", new_mtcars, code = "new_mtcars <- transform(mtcars, id = seq_len(nrow(mtcars)))"),
  )
)

Proposed alternative

The alternative to the above, proposed in this PR, is having a functionality called code parser. This functionality understands which parts of the code (passed as a character) is needed to create a specific object (with all it's dependent objects and dependent side-effects). Thanks to that, we don't need to pass object specifications and code separately - we can just pass the code, which will be evaluated and which will be parsed so that under the hood objects are created and their respective code is assigned to them automatically.

This PR only introduces changes to qenv object. Further changes to how teal.data::teal_data() or teal::init data parameter work will be needed. qenv received a new field called code_dependency that is a list needed to restore the object and it's side effects. Below are a few examples on extraction of the code of ADSL object

code objects
library(dplyr)
code = '
arm_mapping <- list(
  "A: Drug X" = "150mg QD",
  "B: Placebo" = "Placebo",
  "C: Combination" = "Combination"
)
color_manual <- c("150mg QD" = "#000000", "Placebo" = "#3498DB", "Combination" = "#E74C3C")
# assign LOQ flag symbols: circles for "N" and triangles for "Y", squares for "NA"
shape_manual <- c("N" = 1, "Y" = 2, "NA" = 0)
ADSL <- goshawk::rADSL
goshawk::rADLB-> ADLB
iris2 <- iris # @effect ADLB ADSL
var_labels <- lapply(ADLB, function(x) attributes(x)$label)
iris3 <- iris'
code2 = '
ADLB <- ADLB %>%
  dplyr::mutate(AVISITCD = dplyr::case_when(
    AVISIT == "SCREENING" ~ "SCR",
    AVISIT == "BASELINE" ~ "BL",
    grepl("WEEK", AVISIT) ~
      paste(
        "W",
        trimws(
          substr(
            AVISIT,
            start = 6,
            stop = stringr::str_locate(AVISIT, "DAY") - 1
          )
        )
      ),
    TRUE ~ NA_character_
  )) %>%
  dplyr::mutate(AVISITCDN = dplyr::case_when(
    AVISITCD == "SCR" ~ -2,
    AVISITCD == "BL" ~ 0,
    grepl("W", AVISITCD) ~ as.numeric(gsub("[^0-9]*", "", AVISITCD)),
    TRUE ~ NA_real_
  )) %>%
  # use ARMCD values to order treatment in visualization legend
  dplyr::mutate(TRTORD = ifelse(grepl("C", ARMCD), 1,
                                ifelse(grepl("B", ARMCD), 2,
                                       ifelse(grepl("A", ARMCD), 3, NA)
                                )
  )) %>%
  dplyr::mutate(ARM = as.character(arm_mapping[match(ARM, names(arm_mapping))])) %>%
  dplyr::mutate(ARM = factor(ARM) %>%
                  reorder(TRTORD)) %>%
  dplyr::mutate(
    ANRHI = dplyr::case_when(
      PARAMCD == "ALT" ~ 60,
      PARAMCD == "CRP" ~ 70,
      PARAMCD == "IGA" ~ 80,
      TRUE ~ NA_real_
    ),
    ANRLO = dplyr::case_when(
      PARAMCD == "ALT" ~ 20,
      PARAMCD == "CRP" ~ 30,
      PARAMCD == "IGA" ~ 40,
      TRUE ~ NA_real_
    )
  ) %>%
  dplyr::rowwise() %>%
  dplyr::group_by(PARAMCD) %>%
  dplyr::mutate(LBSTRESC = ifelse(
    USUBJID %in% sample(USUBJID, 1, replace = TRUE),
    paste("<", round(runif(1, min = 25, max = 30))), LBSTRESC
  )) %>%
  dplyr::mutate(LBSTRESC = ifelse(
    USUBJID %in% sample(USUBJID, 1, replace = TRUE),
    paste(">", round(runif(1, min = 70, max = 75))), LBSTRESC
  )) %>%
  ungroup()'

code3 = '
attr(ADLB[["ARM"]], "label") <- var_labels[["ARM"]]
attr(ADLB[["ANRHI"]], "label") <- "Analysis Normal Range Upper Limit"
attr(ADLB[["ANRLO"]], "label") <- "Analysis Normal Range Lower Limit"
mtcars # @effect ADLB
options(prompt = ">") # @effect ADLB

# add LLOQ and ULOQ variables
ADLB_LOQS<-goshawk:::h_identify_loq_values(ADLB)
goshawk:::h_identify_loq_values(ADLB)->ADLB_LOQS
ADLB = dplyr::left_join(ADLB, ADLB_LOQS, by = "PARAM")
iris6 <- list(ADLB, ADLB_LOQS, ADSL)
iris5 <- iris'
q1 <- teal.code:::new_qenv()
q2 <- teal.code::eval_code(q1, code = code)
q3 <- teal.code::eval_code(q2, code = code2)
q4 <- teal.code::eval_code(q3, code = code3)

get_code(q2, deparse = FALSE, names = "ADLB")
get_code(q3, deparse = FALSE, names = "ADLB")
get_code(q4, deparse = FALSE, names = "ADLB")
get_code(q4, deparse = FALSE, names = "var_labels")
get_code(q4, deparse = FALSE, names = "ADSL")
get_code(q4, deparse = FALSE, names = c("ADSL", "ADS", "C"))
get_code(q4, deparse = FALSE, names = c("var_labels", "ADSL"))
get_code(q4)

Side effects

The functionality might be a bit complicated. The main reason for that is the handling of side effects. Often in a code there are side effects that can not be directly connected with specific objects. If you connect objects with assign operators (like <-, =, ->) then it is easy to understand the dependency structure between objects and code lines. However if you have side effects, like the creation of a database connection, that influences all other operations in the code, it is not possible to be guessed just by the static code analysis. Hence we introduce a possibility to pass # @effect object_name tag at the end of the line, to specify on which objects does this line has effects. The bottleneck of this solution is that, we operate on a parsed code that looses information about comments. The comments are stored in it's srcref attribute that is put into utils::getParseData() function, which requires us to have some extra meta-information stored if we want to also restore lines that are side effects.

Notes

The relation between objects is assumed to be passed by <-, = or -> assignment operators. No other object creation methods (like assign, or <<- or any non-standard-evaluation method) are supported. This is solved by # @effect tag

@m7pr m7pr added the core label Aug 23, 2023
@m7pr m7pr requested a review from chlebowa August 23, 2023 09:54
@gogonzo gogonzo self-assigned this Sep 6, 2023
@@ -3,6 +3,7 @@
#' @name get_code
#' @param object (`qenv`)
#' @param deparse (`logical(1)`) if the returned code should be converted to character.
#' @param names (`character(n)`) if provided, returns the code only for objects specified in `names`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Food for thought: is it a better API to have this argument or to have a separate function, say get_object_code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can also have get_code() that extracts a list for all the objects, and you could call get_code()['object_name'], unsure what is the best way in here yet

@m7pr
Copy link
Contributor Author

m7pr commented Sep 26, 2023

Hey @chlebowa for this

Also,

q <- eval_code(new_qenv(), "a <- 1")
get_code(q, deparse = FALSE, names = "a")

returns character when names is not NULL.

Yeah, it returns character for deparse = TRUE

testthat::test_that(
  "get_code returns the same class when names is specified and when not",
  {
    q <- eval_code(new_qenv(), "a <- 1")
    testthat::expect_identical(
      get_code(q, deparse = FALSE, names = "a"),
      get_code(q, deparse = TRUE)
    )
  }
)

"Objects not found in 'qenv' environment: ",
paste(names[!(names %in% ls(qenv@env))], collapse = ", ")
)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}
return(character(0L))
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I allow the function to work further, because if someone asks for 3 objects and 1 of them does not exist, you at least get the code for other two. Maybe it's better if we put error in here

Copy link
Contributor

@averissimo averissimo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation of code parser is smart and complex 💯

I've been testing and thinking about this and can't shake off the feeling of the existence of a bunch of exceptions that may exist outside the control of the insightsengineering team

However, I'm not finding a lot of them and the ones I find are a bit specific 😁

Mostly when accessing data from packages or an initial assignment via assign('yada') # @effect yada

Minor edge cases

It's been hard to find situations where it fails, which is nice in something as complex as this!!

I guess both the examples below come from the initial object not being detected as "assigned"

Data from packages

I believe this case might be plausible

testthat::test_that("code_parser load data from package & effect hint", {

  code <- 'data(iris) # @effect iris'
  
  q1 <- teal.code::eval_code(teal.code:::new_qenv(), code = code)
  
  # Makes sure the object is on qenv
  q2 <- testthat::expect_output(
    teal.code::eval_code(q1, code = "print(NROW(iris))"),
    "150"
  )
  
  get_code(q1, deparse = FALSE, names = "iris") |> 
    length() |> 
    expect_gt(0)

  parsed_code <- get_code(q2, deparse = FALSE, names = "iris")

  expect_gt(length(parsed_code), 0)
  expect_false(is.na(parsed_code))
})

assign as first call

Related to the one above, although I guess it's not as plausible but exists nonetheless

testthat::test_that("code_parser with assign & effect hint", {

  code <- 'assign("ADSL", iris) # @effect ADSL'
  
  q1 <- teal.code::eval_code(teal.code:::new_qenv(), code = code)
  
  # Makes sure the object is on qenv
  q2 <- testthat::expect_output(
    teal.code::eval_code(q1, code = "print(NROW(ADSL))"),
    "150"
  )
  
  get_code(q1, deparse = FALSE, names = "ADSL") |> 
    length() |> 
    expect_gt(0)
  
  parsed_code <- get_code(q2, deparse = FALSE, names = "iris")
  
  expect_gt(length(parsed_code), 0)
  expect_false(is.na(parsed_code))
  
})

@m7pr
Copy link
Contributor Author

m7pr commented Sep 28, 2023

Thanks @averissimo for kind words. This is a joint team effort, so there were multiple people involved in coming up with great ideas and suggestions. For the cases that you found with assign and data I think we have a statement, that this will not work yet

#' @details The relation between objects is assumed to be passed by `<-`, `=` or `->` assignment operators. No other
#' object creation methods (like `assign`, or `<<-` or any non-standard-evaluation method) are supported. To specify

as we are aware of our limitations.

@averissimo
Copy link
Contributor

averissimo commented Sep 28, 2023

code <- 'assign("ADSL", iris) # @effect ADSL'
...
code <- 'data(iris) # @effect iris'

@m7pr I'm aware of that documentation 🙂 and the examples above have the hint, however, it's not catching it when getting the code.

@m7pr
Copy link
Contributor Author

m7pr commented Sep 28, 2023

ah, got you! Alrighty then, thanks for pointing this up. I think I can have an extra look on this

@m7pr
Copy link
Contributor Author

m7pr commented Sep 28, 2023

We had a call today with @gogonzo and @chlebowa where we decided to simplify the approach.
The main change will be change in the default behavior of eval_code. It will change expressions and languages input to characters and the main functionality will be provided in eval_code for a character signature. We will just store the object@code as a character vector (not expression as it is now), and we will extend object@code at every eval_code. The whole parsing machinery will be transferred to get_code and executed on a whole object@code.

@m7pr
Copy link
Contributor Author

m7pr commented Sep 29, 2023

Hey, working on a new approach on a separate branch so it's easier to track change of the final approach against the main
#146

Incorporated some of the feedback provided by @chlebowa and @averissimo but not all yet. Work in progress

@m7pr
Copy link
Contributor Author

m7pr commented Oct 2, 2023

Hey @averissimo I incorporated your 2 examples in tests in other PR #146

@m7pr
Copy link
Contributor Author

m7pr commented Oct 6, 2023

closing in favour of #146

@m7pr m7pr closed this Oct 6, 2023
m7pr added a commit that referenced this pull request Oct 9, 2023
Fixes #133 

Alternative to #139

---------

Signed-off-by: Marcin <[email protected]>
Co-authored-by: go_gonzo <[email protected]>
Co-authored-by: Aleksander Chlebowski <[email protected]>
Co-authored-by: Dawid Kałędkowski <[email protected]>
Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Aleksander Chlebowski <[email protected]>
@m7pr m7pr deleted the 133_code_parser@main branch November 7, 2023 09:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Code Parser]: Ability to figure out object dependencies based on a static code
4 participants