| Title: | EFSA Ensemble of Data Collections Tools |
|---|---|
| Description: | Provides tools for dataset operations and utilities designed to preserve data history within EFSA's ad hoc data collections. It also imports packages developed by EFSA that provide additional support for data collection activities. |
| Authors: | Lorenzo Copelli [aut] (ORCID: <https://orcid.org/0009-0002-4305-065X>), Luca Belmonte [aut, cre] (ORCID: <https://orcid.org/0000-0002-7977-9170>) |
| Maintainer: | Luca Belmonte <[email protected]> |
| License: | EUPL-1.2 |
| Version: | 1.0.0 |
| Built: | 2026-05-13 09:45:03 UTC |
| Source: | https://github.com/openefsa/efsatools |
This function drops all the empty lines and columns from the specified data frame, i.e. all the rows and columns that contain only NAs.
dropEmpty(dataframe)dropEmpty(dataframe)
dataframe |
|
The provided data frame without empty lines and columns and all the types transformed to string.
# The first row is going to be dropped. irisTest_ <- iris irisTest_[1, ] <- NA irisTestDropped <- dropEmpty(irisTest_) # The Species column is going to be dropped. irisTest_ <- iris irisTest_$Test <- NA irisTestDropped <- dropEmpty(irisTest_)# The first row is going to be dropped. irisTest_ <- iris irisTest_[1, ] <- NA irisTestDropped <- dropEmpty(irisTest_) # The Species column is going to be dropped. irisTest_ <- iris irisTest_$Test <- NA irisTestDropped <- dropEmpty(irisTest_)
This function takes a data frame and joins it with an EFSA catalog. The EFSA catalog must be itself a data frame.
enrich(dataframe, catalogue, joinBy, enrichedColumnName)enrich(dataframe, catalogue, joinBy, enrichedColumnName)
dataframe |
|
catalogue |
|
joinBy |
|
enrichedColumnName |
|
The specified data frame enriched with the catalogue data.
dataframe_ <- iris |> dplyr::rename(CODE = Species) catalogue_ <- iris |> dplyr::rename(CODE = Species) |> dplyr::mutate(NAME = "test") |> dplyr::select(CODE, NAME) |> unique() enriched_ <- enrich( dataframe = dataframe_, catalogue = catalogue_, joinBy = "CODE", enrichedColumnName = "enrichedColumn")dataframe_ <- iris |> dplyr::rename(CODE = Species) catalogue_ <- iris |> dplyr::rename(CODE = Species) |> dplyr::mutate(NAME = "test") |> dplyr::select(CODE, NAME) |> unique() enriched_ <- enrich( dataframe = dataframe_, catalogue = catalogue_, joinBy = "CODE", enrichedColumnName = "enrichedColumn")
This function drops and merges all the replicated columns from the specified data frame.
removeReplicatedColumns(dataframe, prefix)removeReplicatedColumns(dataframe, prefix)
dataframe |
|
prefix |
|
All the occurrences of "N/A", "NA", and empty strings (case insensitive) inside the provided data frame are replaced with NAs of type character. Then, all and only the columns starting with the specified prefix are selected and united into a single column with name ending per "_deduplicated". All empty entries in the new deduplicated column are replaced with NAs. Finally, the new column is bound with the other columns of the initial dataframe.
The specified data frame with an additional deduplicated column and all the types transformed to string.
irisTest_ <- iris irisTest_$Species_1 <- irisTest_$Species irisTest_$Species_2 <- irisTest_$Species irisTest_$Species <- NULL deduplicatedDataframe_ <- removeReplicatedColumns( dataframe = irisTest_, prefix = "Species_")irisTest_ <- iris irisTest_$Species_1 <- irisTest_$Species irisTest_$Species_2 <- irisTest_$Species irisTest_$Species <- NULL deduplicatedDataframe_ <- removeReplicatedColumns( dataframe = irisTest_, prefix = "Species_")
This function implements a Slowly Changing Dimension Type 2 to merge new and current data while maintaining historical records. The function deactivates the old records and activates new ones, ensuring a history-preserving update strategy. Only the changing records are marked as not active and replaced by new active ones.
SCD2(newData, currentData, key = names(newData))SCD2(newData, currentData, key = names(newData))
newData |
|
currentData |
|
key |
|
The function:
Separates active and inactive records from the current data.
Gets the old records that are still present in the new data (i.e., the ones that can remain active).
Gets the records present in new data but not present in still active current data (i.e., the records to activate) and activates them.
Gets the current active records that are not present in the new data (i.e., the records to deactivate) and deactivates them.
A combined data frame with old data marked as not active and new data marked as active.
currentData_ <- tibble::tribble( ~id, ~colA, ~colB, ~colC, ~IS_ACTIVE, ~START_DATE, ~END_DATE, 1, "a1", "b1", "c1", TRUE, Sys.time(), as.Date(NA), 2, "a2", "b2", "c2", TRUE, Sys.time(), as.Date(NA), 3, "a3", "b3", "c3", TRUE, Sys.time(), as.Date(NA)) newData_ <- tibble::tribble( ~id, ~colA, ~colB, ~colC, 1, "a1", "b1", "c1", 2, "a2", "b2", "c20", 3, "a4", "b4", "c4") mergedData <- SCD2(newData = newData_, currentData = currentData_)currentData_ <- tibble::tribble( ~id, ~colA, ~colB, ~colC, ~IS_ACTIVE, ~START_DATE, ~END_DATE, 1, "a1", "b1", "c1", TRUE, Sys.time(), as.Date(NA), 2, "a2", "b2", "c2", TRUE, Sys.time(), as.Date(NA), 3, "a3", "b3", "c3", TRUE, Sys.time(), as.Date(NA)) newData_ <- tibble::tribble( ~id, ~colA, ~colB, ~colC, 1, "a1", "b1", "c1", 2, "a2", "b2", "c20", 3, "a4", "b4", "c4") mergedData <- SCD2(newData = newData_, currentData = currentData_)
This function implements a Simplified version of Slowly Changing Dimension Type 2 to merge new and current data while maintaining historical records. The function deactivates all the old records and activates new ones, ensuring a history-preserving update strategy. The difference between a standard SCD2 is that this simplified version applies no checks on the data, deactivating all the old records and activating the new ones, even if some of the old records are still active.
SSCD2(newData, currentData)SSCD2(newData, currentData)
newData |
|
currentData |
|
A combined data frame with all old data marked as not active and new data marked as active.
currentData_ <- tibble::tribble( ~id, ~colA, ~colB, ~colC, ~IS_ACTIVE, ~START_DATE, ~END_DATE, 1, "a1", "b1", "c1", TRUE, Sys.time(), as.Date(NA), 2, "a2", "b2", "c2", TRUE, Sys.time(), as.Date(NA), 3, "a3", "b3", "c3", TRUE, Sys.time(), as.Date(NA)) newData_ <- tibble::tribble( ~id, ~colA, ~colB, ~colC, 1, "a1", "b1", "c1", 2, "a2", "b2", "c20", 3, "a4", "b4", "c4") mergedData <- SSCD2(newData = newData_, currentData = currentData_)currentData_ <- tibble::tribble( ~id, ~colA, ~colB, ~colC, ~IS_ACTIVE, ~START_DATE, ~END_DATE, 1, "a1", "b1", "c1", TRUE, Sys.time(), as.Date(NA), 2, "a2", "b2", "c2", TRUE, Sys.time(), as.Date(NA), 3, "a3", "b3", "c3", TRUE, Sys.time(), as.Date(NA)) newData_ <- tibble::tribble( ~id, ~colA, ~colB, ~colC, 1, "a1", "b1", "c1", 2, "a2", "b2", "c20", 3, "a4", "b4", "c4") mergedData <- SSCD2(newData = newData_, currentData = currentData_)