This is a getting-started guide for the Repo R package, which implements an R objects repository manager. It is a data-centered data flow manager.
The Repo package builds one (or more) centralized local repository where R objects are stored together with corresponding annotations, tags, dependency notes, provenance traces, source code. Once a repository has been populated, stored objects can be easily searched, navigated, edited, imported/exported. Annotations can be exploited to reconstruct data flows and perform typical pipeline management operations.
Additional information can be found in the paper: Napolitano, F. repo: an R package for data-centered management of bioinformatic pipelines. BMC Bioinformatics 18, 112 (2017).
Repo latest version can be found at: https://github.com/franapoli/repo
Repo is also on CRAN at: https://cran.r-project.org/package=repo
The following command creates a new repository in a temporary path
(the default would be “~/.R_repo”). The same function opens existing
repositories. The variable rp
will be used as the main
interface to the repository throughout this guide.
Repo created.
This document is produced by a script named index.Rmd
.
The script itself can be added to the repository and newly created
resources annotated as being produced by it. The annotation is made
automatic using the options
command.
Here is a normalized version of the Iris dataset to be stored in the repository:
The shortest way to permanently store the myiris
object
in the repository is simply:
However, richer annotation is possible, for example:
## chunk "myiris" {
myiris <- scale(as.matrix(iris[,1:4]))
rp$put(
obj = myiris,
name = "myiris",
description = paste(
"A normalized version of the iris dataset coming with R.",
"Normalization is made with the scale function",
"with default parameters."
),
tags = c("dataset", "iris", "repodemo")
)
## }
The call provides the data to be stored (obj
), an
identifier (name
), a longer description
, a
list of tags
.
The comment lines (## chunk "myiris" {
and
## }
) have a special meaning: they associate the
corresponding code to the resource. The code can be showed as
follows:
myiris <- scale(as.matrix(iris[,1:4]))
rp$put(
obj = myiris,
name = "myiris",
description = paste(
"A normalized version of the iris dataset coming with R.",
"Normalization is made with the scale function",
"with default parameters."
),
tags = c("dataset", "iris", "repodemo")
)
The code associated with an item should take care of building and storing it. The build command executes the code in the current environment. It can automatically build dependencies, too.
In this example, the Iris class annotation will be stored separately:
The following code produces a 2D visualization of the Iris data and shows it:
irispca <- princomp(myiris)
iris2d <- irispca$scores[,c(1,2)]
plot(iris2d, main="2D visualization of the Iris dataset",
col=rp$get("irisLabels"))
Note that irisLabels
is loaded on the fly from the
repository.
It would be nice to store the figure itself in the repo together with
the Iris data. This is done using the attach
method, which
stores any file in the repo as is (as opposed to R objects), plus
annotations. Two parameters differ from put
:
filepath Instead of an identifier,
attach
takes a file name (with path). The file name will be
also the item identifier.
to This optional parameter tells Repo which item the new one is attached to. Can be empty.
fpath <- file.path(rp$root(), "iris2D.pdf")
pdf(fpath)
plot(iris2d, main="2D visualization of the Iris dataset",
col=rp$get("irisLabels"))
invisible(dev.off())
rp$attach(fpath, "Iris 2D visualization obtained with PCA.",
c("visualization", "iris", "repodemo"),
to="myiris")
The attached PDF can be accessed using an external PDF viewer
directly from within Repo through the sys
command. On a
Linux system, this command runs the Evince document viewer and shows
iris2D.pdf
:
The following code makes a clustering of the Iris data and stores it in the repository. There is one parameter to note:
kiris
variable, myiris
is necessary. (This
information is used by build
to build dependencies and by
dependencies
to show them).kiris <- kmeans(myiris, 5)$cluster
rp$put(kiris, "iris_5clu", "Kmeans clustering of the Iris data, k=5.",
c("metadata", "iris", "kmeans", "clustering", "repodemo"),
depends="myiris")
The following shows what the clustering looks like. The figure will be attached to the repository as well.
fpath <- file.path(rp$root(), "iris2Dclu.pdf")
pdf(fpath)
plot(iris2d, main="Iris dataset kmeans clustering", col=kiris)
invisible(dev.off())
rp$attach(fpath, "Iris K-means clustering.",
c("visualization", "iris", "clustering", "kmeans", "repodemo"),
to="iris_5clu")
Finally, a contingency table of the Iris classes versus clusters is computed below. The special tag hide prevents an item from being shown unless explicitly requested.
res <- table(rp$get("irisLabels"), kiris)
rp$put(res, "iris_cluVsSpecies",
paste("Contingency table of the kmeans clustering versus the",
"original labels of the Iris dataset."),
c("result", "iris","validation", "clustering", "repodemo", "hide"),
src="index.Rmd", depends=c("myiris", "irisLabels", "iris_5clu"))
The info
command summarizes some information about a
repository:
Root: /tmp/RtmpLz7p96
Number of items: 7
Total size: 41.39 kB
The Repo library supports an S3 print
method that shows
the contents of the repository. All non-hidden items will be shown,
together with some details, which by defaults are: name, dimensions,
size.
ID Dims Size
myiris 150x4 1.83 kB
irisLabels 150 132 B
iris_5clu 150 127 B
Hidden items are… hidden. The following will show them too:
ID Dims Size
@index.Rmd - 12.42 kB
myiris 150x4 1.83 kB
irisLabels 150 132 B
@iris2D.pdf - 13.27 kB
iris_5clu 150 127 B
@iris2Dclu.pdf - 13.44 kB
iris_cluVsSpecies 3x5 189 B
Items can also be filtered. With the following call, only items tagged with “clustering” will be shown:
ID Dims Size
iris_5clu 150 127 B
@iris2Dclu.pdf - 13.44 kB
iris_cluVsSpecies 3x5 189 B
print
can show information selectively. This command
shows tags and size on disk:
ID Tags Size
myiris dataset, iris, repodemo 1.83 kB
irisLabels labels, iris, repodemo 132 B
iris_5clu metadata, iris, kmeans, clustering, repodemo 127 B
The find
command will match a search string against all
item fields in the repository:
ID Dims Size
iris_5clu 150 127 B
@iris2Dclu.pdf - 13.44 kB
iris_cluVsSpecies 3x5 189 B
It is also possible to obtain a visual synthetic summary of the
repository by using the pies
command:
Finally, the check
command runs an integrity check
verifying that the stored data has not been modified/corrupted. The
command will also check the presence of extraneous (not indexed) files.
Since the rp
repository was created in a temporary
directory, a few extraneous files will pop up.
Checking: index.Rmd... ok.
Checking: myiris... ok.
Checking: irisLabels... ok.
Checking: iris2D.pdf... ok.
Checking: iris_5clu... ok.
Checking: iris2Dclu.pdf... ok.
Checking: iris_cluVsSpecies... ok.
Checking for extraneous files in repo root...
Some extraneous file found:
/tmp/RtmpLz7p96/iris2D.pdf
/tmp/RtmpLz7p96/iris2Dclu.pdf
In Repo, the relations “generated by”, “attached to” and “dependent on” are summarized in a dependency graph. The formal representation of the graph is a matrix, in which the entry (i,j) represent a relation from i to j of type 1, 2 or 3 (dependency, attachment or generation). Here’s how it looks like:
index.Rmd | myiris | irisLabels | iris2D.pdf | iris_5clu | iris2Dclu.pdf | iris_cluVsSpecies | |
---|---|---|---|---|---|---|---|
index.Rmd | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
myiris | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
irisLabels | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
iris2D.pdf | 3 | 2 | 0 | 0 | 0 | 0 | 0 |
iris_5clu | 3 | 1 | 0 | 0 | 0 | 0 | 0 |
iris2Dclu.pdf | 3 | 0 | 0 | 0 | 2 | 0 | 0 |
iris_cluVsSpecies | 3 | 1 | 1 | 0 | 1 | 0 | 0 |
Omitting the plot=F
parameter, the dependencies
method will plot the dependency graph. This plot requires the
igraph library.
The three types of edges can be shown selectively, so here’s how the graph looks like without the “generated” edges:
The get
command is used to retrieve items from a
repository. In the following the variable myiris
is loaded
into the variable x
in the current environment.
An even simpler command is load
, which uses the item
name also as variable name:
[1] TRUE
The info
command can provide additional information
about an entry:
ID: myiris
Description: A normalized version of the iris dataset coming with R. Normalization is made with the scale function with default parameters.
Tags: dataset, iris, repodemo
Dimensions: 150x4
Timestamp: 2024-11-09 03:52:25.411829
Size on disk: 1.83 kB
Provenance: index.Rmd
Attached to: -
Stored in: /tmp/RtmpLz7p96/m/myiris
MD5 checksum: 06b875d5abd76d957b3e356a84ce5895
URL: -
There are actually 3 different ways of adding an object to a repository:
rp$put
)rp$put(replace=T)
)rp$put(replace="addversion")
)Plus, item contents for an existing entry can be downloaded if an URL
is provided with it (rp$pull
).
The K-means algorithm will likely provide different solutions over
multiple runs. Alternative solutions can be stored as new versions of
the iris_5clu
item as follows:
kiris2 <- kmeans(myiris, 5)$cluster
rp$put(kiris2, "iris_5clu",
"Kmeans clustering of the Iris data, k=5. Today's version!",
depends="myiris", replace="addversion")
The new repository looks like the old one:
ID Dims Size
myiris 150x4 1.83 kB
irisLabels 150 132 B
iris_5clu 150 125 B
Except that iris_5clu
is actually the one just put (look
at the description):
ID: iris_5clu
Description: Kmeans clustering of the Iris data, k=5. Today's version!
Tags:
Dimensions: 150
Timestamp: 2024-11-09 03:52:26.697056
Size on disk: 125 B
Provenance: index.Rmd
Attached to: -
Stored in: /tmp/RtmpLz7p96/i/iris_5clu1
MD5 checksum: 46e1ed66e063c8ef803341d7c4f0e782
URL: -
The old one has been renamed and hidden:
ID: iris_5clu#1
Description: Kmeans clustering of the Iris data, k=5.
Tags: metadata, iris, kmeans, clustering, repodemo, hide
Dimensions: 150
Timestamp: 2024-11-09 03:52:26.697877
Size on disk: 127 B
Provenance: index.Rmd
Attached to: -
Stored in: /tmp/RtmpLz7p96/i/iris_5clu
MD5 checksum: f7752f8f2ea2d39a40a360b55aabf221
URL: -
It is also possible to use the repository for caching purposes by
using the lazydo
command. It will run an expression and
store the results. When the same expression is run again, the results
will be loaded from the repository instead of being built again.
## First run
system.time(rp$lazydo(
{
Sys.sleep(.5)
result <- "This took half a second to compute"
}
))
lazydo is building resource from code.
Cached item name is: 36b10f25c707a3e623be91b1a2c526a8
user system elapsed
0.004 0.000 0.504
## Second run
system.time(rp$lazydo(
{
Sys.sleep(.5)
result <- "This took half a second to compute"
}
))
lazydo found precomputed resource.
user system elapsed
0.001 0.000 0.001
Existing items can feature an URL property. The
pull
function is meant to update item contents by
downloading them from the Internet. This allows for the distribution of
“stub” repositories containing all items information without the actual
data. The following code creates an item provided with a remote URL. A
call to pull
overwrites the stub local content with the
remote content.
rp$put("Local content", "item1",
"This points to big data you may want to download",
"tag", URL="http://exampleURL/repo")
print(rp$get("item1"))
[1] "Local content"
[1] "Remote content"
The handlers
method returns a list of functions by the
same names of the items in the repo. Each of these functions can call
Repo methods (get
by default) on the corresponding items.
In this way all item names are loaded, which may be useful for example
to exploit auto-completion features of the editor.
[1] "index.Rmd" "myiris"
[3] "irisLabels" "iris2D.pdf"
[5] "iris_5clu#1" "iris2Dclu.pdf"
[7] "iris_cluVsSpecies" "iris_5clu"
[9] "36b10f25c707a3e623be91b1a2c526a8" "item1"
[11] "repo"
Handlers call get
by default:
kiris
1 2 3 4 5
setosa 22 0 0 0 28
versicolor 0 21 27 2 0
virginica 0 2 21 27 0
The tag
command (not yet described) adds a tag to an
item:
ID: iris_cluVsSpecies
Description: Contingency table of the kmeans clustering versus the original labels of the Iris dataset.
Tags: result, iris, validation, clustering, repodemo, hide, onenewtag
Dimensions: 3x5
Timestamp: 2024-11-09 03:52:27.441884
Size on disk: 189 B
Provenance: index.Rmd
Attached to: -
Stored in: /tmp/RtmpLz7p96/i/iris_cluVsSpecies
MD5 checksum: a84f692bde0194f6591b0e599fcd7b93
URL: -
One may want to open a repo directly with:
Found repo index in "/tmp/RtmpLz7p96/R_repo.RDS".
In that case, the handler to the repo itself will come handy:
ID Dims Size
myiris 150x4 1.83 kB
irisLabels 150 132 B
iris_5clu 150 125 B
item1 1 67 B
If items are removed or added, handlers may need a refresh: