BiocNeighbors 1.18.0
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 324 2367 9205 2538 9324 8389 5971 5956 3024 2210
## [2,] 800 5935 9234 1350 8611 2513 3446 4203 1720 4058
## [3,] 1911 5541 13 2401 6794 4186 6725 1260 8295 6194
## [4,] 2057 9519 8226 8789 6718 8843 417 293 6648 8471
## [5,] 7348 9408 6515 5818 9531 8169 955 4086 1265 4044
## [6,] 1913 4562 1410 5281 5233 4148 3437 3766 5036 8379
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.9836214 1.0246693 1.0309650 1.0421752 1.0582401 1.1058861 1.116445
## [2,] 0.9123601 0.9240700 0.9329356 0.9695373 0.9727747 0.9803308 1.043381
## [3,] 0.9440759 0.9470428 1.0324351 1.0447689 1.0495856 1.0559775 1.056176
## [4,] 0.9211345 0.9465020 0.9524145 0.9541939 0.9654297 0.9979405 1.001115
## [5,] 0.9978417 1.0260110 1.0737127 1.1029959 1.1342275 1.1359851 1.144647
## [6,] 0.8886356 0.9917402 1.0405064 1.0488504 1.0605050 1.0844269 1.101511
## [,8] [,9] [,10]
## [1,] 1.155324 1.155334 1.161989
## [2,] 1.049971 1.073174 1.088138
## [3,] 1.062944 1.082914 1.086168
## [4,] 1.004145 1.015664 1.017241
## [5,] 1.145844 1.152644 1.157907
## [6,] 1.119780 1.129085 1.132131
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 7499 6704 5077 8655 4156
## [2,] 7485 4356 3490 9134 2509
## [3,] 9013 7097 274 9450 1511
## [4,] 6770 7337 5695 1799 5830
## [5,] 5865 9415 1919 441 1714
## [6,] 5500 9234 96 326 8024
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.9588218 0.9645501 0.9660923 0.9941998 1.0322046
## [2,] 0.9020717 0.9746211 0.9835749 1.0334380 1.0657200
## [3,] 0.8032002 0.8359988 0.8629534 0.8956631 0.8980619
## [4,] 0.8761650 0.9659457 0.9708010 1.0066589 1.0274303
## [5,] 0.8615956 0.8987432 0.9211295 0.9329082 0.9462915
## [6,] 1.0002086 1.0103360 1.0263430 1.0281612 1.0311813
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 On HPC file systems, you can change TEMPDIR
to a location that is more amenable to concurrent access. and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "/var/folders/v1/y6dg5h4n163dzmrfl6t_r5480000gp/T//RtmpwIRmXb/file185e227c5d9ec.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R version 4.3.0 RC (2023-04-13 r84257)
## Platform: x86_64-apple-darwin20 (64-bit)
## Running under: macOS Monterey 12.6.4
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.18.0 knitr_1.42 BiocStyle_2.28.0
##
## loaded via a namespace (and not attached):
## [1] cli_3.6.1 rlang_1.1.1 xfun_0.39
## [4] jsonlite_1.8.4 S4Vectors_0.38.1 htmltools_0.5.5
## [7] stats4_4.3.0 sass_0.4.6 rmarkdown_2.21
## [10] grid_4.3.0 evaluate_0.21 jquerylib_0.1.4
## [13] fastmap_1.1.1 yaml_2.3.7 bookdown_0.34
## [16] BiocManager_1.30.20 compiler_4.3.0 codetools_0.2-19
## [19] Rcpp_1.0.10 BiocParallel_1.34.1 lattice_0.21-8
## [22] digest_0.6.31 R6_2.5.1 parallel_4.3.0
## [25] bslib_0.4.2 Matrix_1.5-4 tools_4.3.0
## [28] BiocGenerics_0.46.0 cachem_1.0.8