blob: 837c7fb702b53140d640d1e5dde41d86b5c1a302 [file] [log] [blame]
@menu
* Using the Ferret Library
* Portability
* Error Handling
* Utility Routines
* Data Types
* Vector and Vecset Distances
* Tables
* Multiple Modality Support
* Vector Distance Reference
* Vecset Distance Reference
* Index and Sketch Reference
@end menu
@node Using the Ferret Library
@section Using the Ferret Library
After successful installation, you only need to put the following line to your
code before the first reference to the Ferret Library data types or routines.
@smallexample
#include <cass.h>
@end smallexample
To link against the Ferret Library, add the option @option{-lcass} to the
command line of @command{ld} or @command{gcc}.
@node Portability
@section Portability
For portability purpose, the following primitive types are used in the Ferret
Library, so that the compile to the same size on all machines.
@smallexample
int32_t
uint32_t
uint64_t
uchar
@end smallexample
The database files are portable between 32-bit and 64-bit machines, but not
between machines of different endians for now. To make it portable between
different endians, only the file IO module (@file{cass_file.c}) needs to be modified.
@node Error Handling
@section Error Handling
Most of the Ferret Library routines return 0 for success and a non-zero error
code if some error happens. Following is a list of possible error codes.
@smallexample
CASS_ERR_OUTOFMEM
CASS_ERR_MALFORMATVEC
CASS_ERR_PARAMETER
CASS_ERR_IO
CASS_ERR_CORRUPTED
@end smallexample
The following function converts an error code into a string for display.
@table @asis
@item @emph{Prototype:}
@code{const char* cass_strerror (int err);}
@end table
@node Utility Routines
@section Utility Routines
This section documents the general utility data structures and routines that are
not perticularly related to nearest neighbor search, but will make program
easier. You do not need to initialize the library to use the routines
documented in this section.
@subsection Dynamic arrrays
Maintaining a dynamic (growable) array is always a pain in the plain C
language. The Ferret Library provides a set of macros to ease the programmer's
life. A dynamic array of type @code{@emph{T}} is declared as the following struct:
@smallexample
struct @{
cass_size_t inc;
cass_size_t size;
cass_size_t len;
@emph{T} *data
@} @emph{array};
@end smallexample
where @code{len} is the number of elements actually in the array and @code{size} is
the size of the memory allocated, pointed to by @code{data}. When the array
needs to grow, @code{size} is incremented by @code{inc}, and the memory is
reallocated. All @code{inc}, @code{size} and @code{len} are numbers of
elements instead of actual bytes. To access the @code{i}-th element of
@code{@emph{array}}, use @code{@emph{array}.data[i]}.
Any structure declared following the above template can be handled by the macros
documented in this section, despite the type of @code{@emph{T}} and the actual
name of the struct. Using the array declaration macro @code{ARRAY_TYPE}, the
declaration of an array can be as easy as
@smallexample
ARRAY_TYPE(@emph{T}) @emph{array};
@end smallexample
The following small example shows the operations of an array of @code{char}.
@smallexample
ARRAY_TYPE(char) s;
ARRAY_INIT(s);
ARRAY_APPEND(s, 'A');
ARRAY_APPEND(s, 'B');
ARRAY_GET(s, 0); /* should be 'A' */
ARRAY_LEN(s); /* should be 2 */
/* enumerate the array */
ARRAY_BEGIN_FOREACH(s, c)
@{
putchar(c);
@} ARRAY_END_FOREACH(s, c);
ARRAY_CLEANUP(array);
@end smallexample
The following items are to be noticed.
@itemize @bullet
@item All macros accepts the array variable instead of pointer to the variable.
@item The type of array elements can be obtained by the GCC extension
@code{typeof @code{@emph{array}.data[0]}}
@item The details of the array struct are considered exposed, and any changes
made to the struct are assumed safe, so long as the following holds:
@smallexample
len <= size
&& inc > 0
&& (size == 0 || data != NULL)
@end smallexample
@item The array struct can safely contain other elements and they will not be
touched by the macros. To add extra elements to the struct, the user need to
explicitly declare the struct, the @code{ARRAY_TYPE} macro will not help.
@item The user is responsible for managing the dynamically allocated memory
pointed to by the array element, should they be pointers.
@end itemize
@subsubsection Declaration
@table @asis
@item @emph{Macro}:
@code{ARRAY_TYPE(type) @emph{array};}
@code{@emph{array} = ARRAY_WRAPPER(data, len);}
@item @emph{Description}:
@code{ARRAY_TYPE} expands to the type declaration of a struct for a dynamic array of
the given type. @code{ARRAY_WRAPPER} makes an array variable out of a chunk of
memory allocated by @code{malloc}, specified by the pointer to the first
element, @code{data}, and the number of element, @code{len}.
@end table
@subsubsection Initialization
@table @asis
@item @emph{Macro}:
@code{ARRAY_INIT(array)}
@code{ARRAY_INIT_SIZE(array, initial_size)}
@item @emph{Description}:
Initialize an array. The latter set the size of array to @code{initial_size}
and allocates the memory of this size.
@end table
@subsubsection Cleanup
@table @asis
@item @emph{Macro}:
@code{ARRAY_CLEANUP(array)}
@item @emph{Description}:
Release the dynamically allocated memory of the array.
@end table
@subsubsection Field access
@table @asis
@item @emph{Macro}:
@code{ARRAY_INC(array)}
@code{ARRAY_SET_INC(array, inc)}
@code{ARRAY_LEN(array)}
@code{ARRAY_RAW_SIZE(array)}
@code{ARRAY_SIZE(array)}
@code{ARRAY_EXPAND(array, new_size)}
@item @emph{Description:}
These macros access the various fields of the array struct.
The macro @code{ARRAY_RAW_SIZE} returns the raw size of data in bytes actually
present in the array, which is equal to @code{@emph{array}.len *
sizeof(@emph{array}.data[0])}.
@code{ARRAY_EXPAND} will grow the array if necessary so that the size of the array is at
least @code{new_size}.
@end table
@subsubsection Element access
@table @asis
@item @emph{Macro:}
@code{ARRAY_GET(array, i)}
@code{ARRAY_SET(array, i, elem)}
@code{ARRAY_APPEND(array, elem)}
@code{ARRAY_APPEND_UNSAFE(array, elem)}
@code{ARRAY_TRUNC(array) /* same as array.len = 0 */}
@code{ARRAY_MERGE(array, array2)}
@code{ARRAY_MERGE_RAW(array, data, len)}
@item @emph{Description:}
For @code{ARRAY_GET} and @code{ARRAY_SET}, you need to ensure that @code{i <
@emph{array}.len}.
@code{ARRAY_APPEND} will automatically grow the array when
space is not enough, but @code{ARRAY_APPEND_UNSAFE} will not. The latter is
faster though, and can be used when total size of the array is known and fixed
at initialization.
@code{ARRAY_MERGE} appends all the elements in @code{array2} to @code{array},
growing @code(array) when necessary. @code{ARRAY_MERGE_RAW} is similar, but the
second array is given by the pointer to the first element of the array and the number of
elements.
@end table
@subsubsection Element enumeration
@table @asis
@item @emph{Macro:}
@smallexample
ARRAY_BEGIN_FOREACH(array, cursor) @{
...@emph{your code}...
@} ARRAY_END_FOREACH(array, cursor);
ARRAY_BEGIN_FOREACH_P(array, cursor_p) @{
...@emph{your code}...
@} ARRAY_END_FOREACH_P(array, cursor_p);
@end smallexample
@item @emph{Description:}
These two sets of macros enumerate the elements in an array. @code{cursor(_p)} does
not need to be declared outside the block (actually, any declaration of
@code{cursor(_p)} outside the block will be overridden). The former pair of
macros passes the element in value and the latter in reference, which can reduce
the overhead of copying the elements when they are large. You can use any name
for @code{cursor(_p)} so long as they match in the @code{..._BEGIN_...} and
@code{..._END_...} macros.
@end table
@subsection Heaps
The Ferret Library provides binary heap support on top of the dynamic array routines
described in the previous section. A heap is just a dynamic array, except that
the elements in the array follow certain order. The heap routines described
below directly operate on array structs.
@table @asis
@item @emph{Prototype:}
@code{HEAP_EMPTY(heap) /*boolean, whether the heap is empty */ }
@code{HEAP_HEAD(heap) /*the smallest element in the heap, in value */ }
@code{HEAP_ENQUEUE(heap, elem, ge)}
@code{HEAP_DEQUEUE(heap, ge) /* does not return the element */ }
@code{HEAP_ENQUEUE_UPDATE(heap, elem, ge, update)}
@code{HEAP_DEQUEUE_UPDATE(heap, elem, ge, update)}
@item @emph{Description:}
Both @code{HEAP_ENQUEUE} and @code{HEAP_DEQUEUE} require an extra parameter
@code{ge}, the comparator, which can be either a function pointer or a macro.
For any two variable @code{a} and @code{b} of the type of heap element,
@code{ge(&a,&b)} should return 1 if @code{a >= b} and 0 if not.
Sometimes, the elements of the heap need to known their own index in the heap.
The situation is complicated because the enqueue and dequeue operations need to
reorder some of the heap elements. The macros @code{HEAP_ENQUEUE_UPDATE} and
@code{HEAP_DEQUEUE_UPDATE} allow the user to track the offset of heap elements.
The extra parameter @code{update} and be a function pointer or a macro, and for
each element @code{a} which is rellocated to the new offset @code{i},
@code{update(&a, i)} will be invoked.
For @code{HEAP_DEQUEUE(_UPDATE)}, the head of the heap is thrown away silently.
You need to call @code{HEAP_HEAD} in advance if you need to keep the dequeued
element.
All the heap macros described here assume the elements of the array are in
binary heap order. You can safely access the heap data by other array macros in
read-only manner, but updates other than @code{ARRAY_EXPAND} are not advised
because they may destroy the heap order.
@end table
@subsection Maintaining top-K elements
Maintaining a list of the best K structs with respect to certain field is a
common job and the Ferret Toolkit provides a set of macros to do this
efficiently.
The macros documented below works on an array of @code{K} structs with a comparable
field @code{key}.
@table @asis
@item @emph{Prototype:}
@code{TOPK_INIT (array, key, K, init)}
@item @emph{Description:}
Initialize the @code{key} field of each of the @code{K} element to value
@code{init}. For @code{K} maximal values, @code{init} should be minimal value
possible, and for the minimal values the opposite.
@item @emph{Prototype:}
@code{TOPK_INSERT_MIN (array, key, K, elem)}
@item @emph{Description:}
Keep the @code{K} minimal element in @code{array} with regard to the field
@code{key}. If @code{elem} is smaller than the existing largest element in
@code{array}, it replace that largest element.
@item @emph{Prototype:}
@code{TOPK_INSERT_MAX (array, key, K, elem)}
@item @emph{Description:}
Similar to @code{TOPK_INSERT_MIN}, except for it keeps the @code{K} maximal
elements.
@item @emph{Prototype:}
@code{TOPK_SORT_MIN (array, key, K)}
@item @emph{Description:}
The above two macros use heap for high performance and the result is that the
result array is not sorted according to @code{key}. This macro sorts the array
in descending order according to @code{key}. Note that this macro can only sort
arrays prepared with @code{TOPK_INSERT_MIN}.
@item @emph{Prototype:}
@code{TOPK_INSERT_MIN_UNIQ (array, key, K, elem)}
@item @emph{Description:}
Similar to @code{TOPK_INSERT_MIN}, except it assures that every one in
@code{array} has a unique @code{index} field. For now, the name of the field
@code{index} is not customizable. This is to be fixed fixed in later versions of the
toolkit. To keep the uniqueness, the naive array insertion instead of heap is
used, and the result is sorted.
@end table
Following is a small example. Given the query point id $code{q} and the set of
data point id @code{data[]}, we want to get the @code{K} points in @code{data[]}
that is closest to @code{q}.
@smallexample
#define K 10
typedef struct @{
float key;
int index;
@} foo_t;
foo_t min[K], foo;
TOPK_INIT(min, key, K, MAXFLOAT);
for (i = 0; i < data_len; i++)
@{
foo.key = dist(q, data[i]);
foo.index = i;
TOPK_INSERT_MIN(min, key, K, foo);
@}
/* sort the result*/
TOPK_SORT_MIN(min, key, K);
@end smallexample
@subsection Portable file IO
@table @asis
@item @emph{Prototype:}
@code{typedef struct ... CASS_FILE;}
@code{CASS_FILE *cass_open (const char *path, const char *mode);}
@code{void cass_close (CASS_FILE *)};
@code{/* @emph{type} can be one of @{ int32, uint32, uint64, size, float,
double, char */}
@code{int cass_read_@emph{type} (@emph{type} *, size_t nmemb, CASS_FILE *);}
@code{int cass_write_@emph{type} (const @emph{type} *, size_t nmemb, CASS_FILE *);}
@code{char *cass_read_pchar (CASS_FILE *);}
@code{int cass_write_pchar (const char *, CASS_FILE *);}
@item @emph{Description:}
These routines intend to be a portable way of doing I/O, so that the file
produced from a machine of one architecture can be read by a machine of another
architecture. For now, the portability is not realized and these routines are
just a wrapper of stdio. @code{CASS_FILE} is same as @code{FILE} for now.
@end table
@subsection Timer
@table @asis
@item @emph{Prototype:}
@code{typedef struct ... stimer_t};
@code{void stimer_tick (stimer_t *timer);}
@code{float stimer_tuck (stimer_t *timer, const char *msg);}
@item @emph{Description:}
These routines are used to measure the wall time used by certain piece of code.
@code{stimer_tick} initialize the @code{timer} struct and record the start time,
@code{stimer_tuck} stop timing and returns the elapsed time. If @code{msg} is
not @code{NULL}, the elapsed time is printed to @code{stdout} together with
@code{msg}.
@end table
@subsection Replacement of @code{malloc}, @code{calloc} and @code{realloc}
@table @asis
@item @emph{Prototype:}
@code{@emph{type *} type_alloc (@emph{type});}
@code{@emph{type *} type_calloc (@emph{type}, cass_size_t len);}
@code{@emph{type *} type_realloc (@emph{type}, @emph{type} *ptr, cass_size_t len);}
@item @emph{Description:}
These macros are just wrappers of @code{malloc}, @code{calloc} and
@code{realloc}. They accept @code{@emph{type}} instead of
@code{sizeof(@emph{type})}, and return pointer of type @code{@emph{type *}}
instead of @code{void *}.
@end table
@subsection Vectors and Matrices
Two dimensional array, or matrix, of size @code{M*N} are stored in memory in the following way.
First, a chunk of memory to hold @code{M*N} elements are allocated. Then a
chunk of memory to hold @code{M} pointer to the rows are allocated, and
initialized to the start memory address of each row. @code{matrix3} is the
3-dimensional array.
@table @asis
@item @emph{Prototype:}
@code{@emph{type **} type_matrix_alloc (@emph(type), cass_size_t row,
cass_size_t col);}
@code{void matrix_free (@emph{type} **matrix);}
@code{@emph{type ***} type_matrix3_alloc (@emph(type), cass_size_t num,
cass_size_t row, cass_size_t col);}
@code{void matrix3_free (@emph{type} ***matrix);}
@item @emph{Description:}
Allocate/free memory for the matrix.
@item @emph{Prototype:}
@code{int type_matrix_load_stream (@emph{type}, FILE *stream,
cass_size_t *row, cass_size_t *col, @emph{type}
***matrix);}
@code{int type_matrix_load_file (@emph{type}, const char *path,
cass_size_t *row, cass_size_t *col, @emph{type}
***matrix);}
@code{int type_matrix_dump_stream (@emph{type}, FILE *stream,
cass_size_t row, cass_size_t col, @emph{type}
**matrix);}
@code{int type_matrix_dump_file (@emph{type}, const char *path,
cass_size_t row, cass_size_t col, @emph{type}
**matrix);}
@item @emph{Description:}
Read and write matrix, either with a stream or a file. These routines
are not portable and are to be modified to use @code{CASS_FILE} instead.
@item @emph{Prototype:}
@code{int type_matrix_map_file (@emph{type}, const char *path,
cass_size_t *row, cass_size_t *col, @emph{type}
***matrix);}
@code{int matrix_unmap_file (void **matrix);}
@item @emph{Description:}
Memory mapping, similar to @code{type_matrix_load_file}.
@end table
@node Library Initialization and Cleanup
@section Library Initialization and Cleanup
@table @asis
@item @emph{Prototype:}
@code{int cass_init (void);}
@code{int cass_cleanup (void);}
@item @emph{Description:}
Call @code{cass_init} before using any of the Ferret Library routines, and call
@code{cass_cleanup} at the end of your program.
@end table
@node Database Environment
@section Database Environment
@subsection Open and close
@table @asis
@item @emph{Prototype:}
@code{int cass_env_open (cass_env_t **env, char *db_home, uint32_t flags);}
@code{int cass_env_close(cass_env_t *env, uint32_t flags);}
@item @emph{Description:}
The function @code{cass_env_open} opens a Ferret database for future
use and send back the environment pointer through the first argument.
@code{db_home} should contain the path to the directory in which contains the
Ferret database. @code{flags} can be one or more of the following flags.
@table @code
@item CASS_READONLY
Open the database read-only. Without this flag specifiled, the
database will be opened read-write.
@item CASS_EXCL
If the directory does not exist, create the database; otherwise
fail.
@item
@end table
The function @code{cass_env_close} closes the database provided as
the first argument. Currently there's no flags supported and the user should
always pass in 0.
@end table
@subsection Logging
@table @asis
@item @emph{Prototype:}
@code{int cass_env_errmsg(cass_env_t *env, int error, const char *fmt, ...);}
@code{int cass_env_panic(cass_env_t *env, const char *msg);}
@item @emph{Description:}
The function @code{cass_env_errmsg} works like @code{fprintf}, except it writes
the message to the database log file.
The function @code{cass_env_panic} kills the application after printing out the
error message.
@end table
@subsection Checkpointing
@table @asis
@item @emph{Prototype:}
@code{int cass_env_checkpoint(cass_env_t *env);}
@item @emph{Description:}
Flush any changes of the database since last checkpoint into disk. This
function may lock the database and cause significant delay to the panding
modifications.
@end table
@subsection Describing
@table @asis
@item @emph{Prototype:}
@code{int cass_env_describe (cass_env_t *env, CASS_FILE *);}
@item @emph{Description:}
This function writes the textual description of the database to the provided
stream.
@end table
@node Mapping
@section Mapping
@node Vector, Vecset and Dataset
@section Vector, Vecset and Dataset
@subsection Data structures
A vecset is a set of vectors, and a dataset is a set of vecsets. A vecset
usually contains the feature vectors corresponding to an object, which is
segmented into several part, with each part described by a vector in the vecset.
A Ferret table is a dataset, which contains multiple vecsets. The three
structures are defined as follows.
@smallexample
/* vector */
typedef struct @{
float weight;
cass_vecset_id_t parent;
union @{
uchar data[0];
int32_t int_data[0];
float float_data[0];
chunk_t bit_data[0];
@};
@} cass_vec_t;
/* vecset */
typedef struct @{
uint32_t num_regions;
cass_vec_id_t start_vecid;
@} cass_vecset_t;
/* dataset */
typedef struct @{
uint32_t flags;
uint32_t loaded;
cass_size_t vec_size;
cass_size_t vec_dim;
cass_vec_id_t max_vec;
cass_vec_id_t num_vec;
void *vec;
cass_vecset_id_t max_vecset;
cass_vecset_id_t num_vecset;
cass_vecset_t *vecset;
@} cass_dataset_t;
@end smallexample
@code{cass_vec_t} and @code{cass_vecset_t} are not standalone; they depend on
information in @code{cass_dataset_t} and even
@xref{x-config,,@code{cass_vecset_cfg}} to determine how to interprete there
fields. The fields of the three structs are explained below.
First, @code{cass_dataset_t}.
@table @code
@item flags
A table can contain either vectors or vecsets, or both, and this field specifies
what is contained in the dataset. It can be one of the following values.
@table @code
@item CASS_DATASET_VEC
The datset contains only vector data.
@item CASS_DATASET_VECSET
The dataset contains only vecset data.
@item CASS_DATASET_BOTH
The dataset contains both vectors and vecsets.
@end table
@item loaded
Whether the feature data (vectors and/or vecsets) are in memory and can be
accessed through @code{vec}/@code{vecset} field.
@item vec_size
Size of a vector in bytes, including a 32-bit weight and the real feature data.
@item vec_dim
The dimension of the feature vector.
@item max_vec
The maximum number of vectors the memory allocated in @code{vec} can hold.
@item num_vec
The real number of vectors in the dataset.
@item vec
Pointer to the memory holding the vectors.
@item max_vecset
The maximum number of vecsets the memory allocated in @code{vecset} can hold.
@item num_vecset
The real number of vecsets in the dataset.
@item vecset
Pointer to the memory holding the vecsets.
@end table
Second, @code{cass_vecset_t}.
@table @code
@item num_regions
Number of vectors in this vecset.
@item start_vecid
The offset of the first vector in the dataset. The vectors in dataset that
belong to this vecset are those numbered from @code{start_vecid} to
@code{(start_vecid + num_regions -1)}. Thoses numbers are only meaningful to
the dataset that contains the vecset.
@end table
Finally, @code{cass_vec_t}.
@table @code
@item weight
The weight of the vector inside the vecset.
@item parent
The id of the vecset which contains the vector.
@item [int_|float_|bit_]data
This is the real data of the vector.
Different field of the union should be
used according to the configuration of the vecset. Also, because the
dimension is variable, these fields works as a place holder.
@end table
@code{cass_vec_t} requires a bit more explanation. This struct actually works
as a placeholder, and when the dimensionality of the vector is determined, the
real data will always be larger than the struct in size. In other words,
@code{sizeof(cass_vec_t)} should never be used except for taking the size
of the vector without the real feature data. The variable size of
@code{cass_vec_t} makes it a little tricky to locate a perticular vector in a
dataset. The following piece of code illustrates how to do the job.
@smallexample
/* Get the i-th vector in the dataset */
cass_vec_t *vec = (void *)dataset->vec + dataset->vec_size * i;
@end smallexample
@subsection Dataset routines
The definition of all the above defined structures are considered explicit, and
are safe to be modified by the user. The following routines make common
operations on datasets easier.
@table @asis
@item @emph{Prototype:}
@code{int cass_dataset_init (cass_dataset_t *ds, cass_size_t vec_size,
cass_size_t vec_dim, uint32_t flags);}
@item @emph{Description:}
Initialize a dataset, without allocating memory to hold the feature data.
@item @emph{Prototype:}
@code{int cass_dataset_grow (cass_dataset_t *ds, cass_size_t num_vecset,
cass_size_t num_vec); }
@item @emph{Description:}
(Re-)allocate memory of the dataset so that it can hold at least
@code{num_vecset} vecsets and @code{num_vec} vectors.
@item @emph{Prototype:}
@code{int cass_dataset_release (cass_dataset_t *ds);}
@item @emph{Description:}
Release the memory of the feature data. Not actually frees the @code{ds}
pointer.
@item @emph{Prototype:}
@code{int cass_dataset_merge (cass_dataset_t *ds, const cass_dataset_t *src,
cass_vecset_id_t start, cass_vecset_id_t num, cass_dataset_map_t
map, void *map_param);}
@item @emph{Description:}
Merge the @code{num} vecsets start from @code{start} of the dataset @code{src}
to the dataset @code{ds}, grow @code{ds} when necessary. If the dataset
contains both vectors and vecsets, the @code{parent} field of vectors are
updated according to the new position of the corresponding vecsets.
However, if the dataset contains vector only, the @code{parent} field is
directly copied.
@item @emph{Prototype:}
@code{int cass_dataset_checkpoint (cass_dataset_t *ds, CASS_FILE *);}
@code{int cass_dataset_restore (cass_dataset_t *ds, CASS_FILE *);}
@item @emph{Description:}
Read/write the meta data.
@item @emph{Prototype:}
@code{int cass_dataset_load (cass_dataset_t *ds, CASS_FILE *in);}
@code{int cass_dataset_dump (cass_dataset_t *ds, CASS_FILE *out);}
@item @emph{Description:}
Read/write the feature data.
@item @emph{Description:}
@end table
@node Class, Instance and Registry
@section Class, Instance and Registry
The Ferret Toolkit is designed such that many components work as plugins and can
be customized. For example, there are different index/sketch plugins that
meet different requirements of precision, speed and storage overhead. One
specific index algorithm can be viewd as a class, and it is instanciated when a
real index structure is created upon a dataset. The vecset distance and vector
distance also follow this paradigm. For example, a L2 distance algorithm that
only evaluate with a certain subset of all the dimensions is a class, and it is
instanciated when the specific interested subset of the dimensions to use is
provided. A class is defined with a number of properties and a set of method,
and a instance of a class also has a set of properties and a pointer to
the class. Following is an example showing the structs describing the vector distance
class and vector distance instances.
@smallexample
typedef cass_dist_t (*cass_vec_dist_func_t) (cass_size_t n, void *, void *, void *);
/* this is the class */
typedef struct @{
char *name;
cass_vec_type_t vec_type;
cass_vec_dist_type_t type;
cass_vec_dist_func_t dist;
int (*describe) (void *, CASS_FILE *);
int (*construct) (cass_vec_dist_t **, const char *);
int (*checkpoint) (void *, CASS_FILE *);
int (*restore) (void **, CASS_FILE *);
void (*free) (void *);
@} cass_vec_dist_class_t;
/* this is the instance */
typedef struct @{
uint32_t refcnt;
char *name;
cass_vec_dist_class_t *class;
/* private data... */
@} cass_vec_dist_t;
@end smallexample
Given a struct @code{dist_class} of the class, an instance can be created with
@code{dist_class->construct}, which returns the constructed instance through
its first argument. Because different class accept different parameters, a
string is used to encode the parameters. The parameter string could be
something like @code{"-M 128 -W 3.2"}.
See @xref{x-extend,,Extending Ferret} for more information.
Both class and instance are identified by an internal ID and an external textual name,
and a registry is used to map between the ID, name and the pointer to the struct.
Each type of class/instance has its own name space, and the namespaces of class
and instance are seperate. For example, you can create a distance instance
"trivial" from the distance class "trivial".
Following are the registry routines.
@table @asis
@item @emph{Prototype:}
@code{typedef struct ... cass_reg_t;}
@item @emph{Description:}
This is the registry type.
@item @emph{Prototype:}
@code{int cass_reg_init (cass_reg_t *reg);}
@code{int cass_reg_init_size (cass_reg_t *reg, cass_size_t size);}
@item @emph{Description:}
Initialize a registry. The latter initialize the registry to hold @code{size}
entries, and is used when the size of registry is known and fixed in advance.
@item @emph{Prototype:}
@code{int cass_reg_cleanup (cass_reg_t *reg);}
@item @emph{Description:}
Cleanup the registry, release the memory.
@item @emph{Prototype:}
int32_t cass_reg_lookup (cass_reg_t *, const char *name);
@item @emph{Description:}
Map name to ID. If the name does not exist, -1 is returned.
@item @emph{Prototype:}
int32_t cass_reg_find (cass_reg_t *, const void *ptr);
@item @emph{Description:}
Map pointer to ID. If the pointer does not exist in the registry, -1 is
returned.
@item @emph{Prototype:}
void *cass_reg_get (cass_reg_t *, uint32_t i);
@item @emph{Description:}
Map ID to pointer.
@item @emph{Prototype:}
int cass_reg_add (cass_reg_t *, const char *name, void *);
@item @emph{Description:}
Add an item to the registry, return the ID of the newly added item.
@end table
@node Vector Distance and Vecset Distance
@section Vector Distance and Vecset Distance
To enable customization, the vector distance and vecset distance come with a
parameter instead of simply a pointer to function, and the parameter and pointer
to function are wrapped up in a struct. Following are the relative declarations
of vector distance. Both vector/vecset distance follow the class/instance
model.
Following are declarations relative to vector distance.
@smallexample
typedef cass_dist_t (*cass_vec_dist_func_t) (cass_size_t n, void *v1, void *v2,
cass_vec_dist_t *);
typedef struct _cass_vec_dist_class {
char *name;
cass_vec_type_t vec_type;
cass_vec_dist_type_t type;
cass_vec_dist_func_t dist;
int (*describe) (void *, CASS_FILE *);
int (*construct) (void **, const char *param);
int (*checkpoint) (void *, CASS_FILE *);
int (*restore) (void **, CASS_FILE *);
void (*free) (void *);
} cass_vec_dist_class_t;
typedef struct {
uint32_t refcnt;
char *name;
cass_vec_dist_class_t *class;
/* private data... */
} cass_vec_dist_t;
@end smallexample
Following are the explanation of some of the fields of
@code{cass_vec_dist_class_t}.
@table @code
@item vec_type
The type of vector on which this distance is defined.
@item type
Type of the distance. Can be one of the following values.
@table @code
@item CASS_VEC_DIST_TYPE_TRIVIAL
@item CASS_VEC_DIST_TYPE_ID
@item CASS_VEC_DIST_TYPE_L1
@item CASS_VEC_DIST_TYPE_L2
@item CASS_VEC_DIST_TYPE_MAX
@item CASS_VEC_DIST_TYPE_HAMMING
@item CASS_VEC_DIST_TYPE_COS
@end table
Note that the distance type itself do not determine the distance. For example,
L1 distance evaluate using all the dimensions and those evaluated using a
subset of the dimensions are all of the type @code{CASS_VEC_DIST_TYPE_L1}.
@item dist
The pointer to the function.
@item describe
Writes textual describing information to the stream.
@item construct
Construct the distance instance using the given textual parameter.
@item checkpoint
Write the meta data of distance into a stream.
@item restore
Construct a distance instance from a stream.
@item free
Free the distance instance.
@end table
The fields of @code{cass_vec_dist_t} actually do not need explanation. One
thing to note that @code{cass_vec_dist_t} is a basic type, and for different
distance classes, there can be more specific types which can hold private
information. For example, if only certain subset of all dimensions are
interested, then the distance can be defined as follows:
@smallexample
typedef struct
@{
cass_vec_dist_t base;
ARRAY_TYPE(int) dim; /* array of interested dimensions */
@} cass_l1_dim_t;
@end smallexample
One can then safely cast @code{cass_l1_dim_t *} to @code{cass_vec_dist_t *}.
Following are routines to access the distance classes. They are just
specialized version of the vector distance class registry.
@table @asis
@item @emph{Prototype:}
@code{int32_t cass_vec_dist_class_lookup (const char *n);}
@code{int32_t cass_vec_dist_class_find (const cass_vec_dist_class_t *p);}
@code{cass_vec_dist_class_t *cass_vec_dist_class_get (uint32_t i);}
@end table
The declaration of vecset distance is similar.
@smallexample
typedef cass_dist_t (*cass_vecset_dist_func_t) (cass_dataset_t *, cass_vecset_id_t,
cass_dataset_t *, cass_vecset_id_t , cass_vec_dist_t *vec_dist,
cass_vecset_dist_t *);
typedef struct _cass_vecset_dist_class{
char *name;
cass_vecset_type_t vecset_type;
cass_vecset_dist_type_t type;
cass_vecset_dist_func_t dist;
/* void ** is actually cass_vecset_dist_t ** */
int (*describe) (void *, CASS_FILE *);
int (*construct) (void **, const char *param);
int (*checkpoint) (void *, CASS_FILE *);
int (*restore) (void **, CASS_FILE *);
void (*free) (void *);
/* private data... */
} cass_vecset_dist_class_t;
typedef struct {
uint32_t refcnt;
char *name;
int32_t vec_dist;
cass_vecset_dist_class_t *class;
} cass_vecset_dist_t;
@end smallexample
There are also routines to access the vecset distance class registry.
@table @asis
@item @emph{Prototype:}
@code{int32_t cass_vecset_dist_class_lookup (const char *n);}
@code{int32_t cass_vecset_dist_class_find (const cass_vecset_dist_class_t *p);}
@code{cass_vecset_dist_class_t *cass_vecset_dist_class_get (uint32_t i);}
@end table
@node Configurations and Tables
@section Configurations and Tables
@anchor{x-config}
Just like a table in a relational database has a scheme, a table in a Ferret
database has a configuration, which is specified by the user when the table is
created. Following is the definition of the configuration struct.
@smallexample
typedef struct @{
uint32_t refcnt;
char *name;
cass_vecset_type_t vecset_type;
cass_vec_type_t vec_type;
cass_size_t vec_dim;
cass_size_t vec_size;
uint32_t flags;
@} cass_vecset_cfg_t;
@end smallexample
The fields are explained below.
@table @code
@item refcnt
The reference count of the structure. Unlike the scheme in the relational
database, which is always associated with a specific table, a configuration in a
Ferret database is a independent object, and can be shared among more than one
tables. The reference count is used to maintain the number of tables, or other
objects that keep a pointer to the configuration struct.
@item name
The name of the configuration. The memory of the string is owned by the struct,
and will be freed when the configuration is destroyed.
@item vecset_type
Type of the vecset, can be one of the following values:
@table @code
@item CASS_VECSET_SINGLE
Any vecset in the table always has one and only one vector.
@item CASS_VECSET_SET
A vecset can have more than one vectors. This is the most common case.
@item CASS_VECSET_NONE
The table do not keep data for vecset, but only keep data for vectors. This is
used for sketch methods which disregard the vecset and only deal with vectors.
@end table
@item vec_type
The type of the vector, can be one of the following values:
@table @code
@item CASS_VEC_INT
32-bit integer, or @code{int32_t}.
@item CASS_VEC_FLOAT
32-bit floating point number, or @code{float}.
@item CASS_VEC_BIT
The feature vector is a bit vector.
@item CASS_VEC_QUANT
The feature vector is some kind of quantization. Used for certain index method.
@end table
@item vec_dim
The dimension of the feature vector. For bit vectors, it means the number of
bits.
@item vec_size
The size of the feature vector. Which include the feature vector and a weight.
@item flags
For now always set 0.
@end table
@subsection Table Creation
@table @asis
@item @emph{Prototype:}
@code{int cass_table_create (cass_table_t **table,
cass_env_t *env,
char *table_name,
uint32_t opr_id,
int32_t cfg_id,
int32_t parent_id,
int32_t parent_cfg_id,
int32_t map_id,
char *param);}
@item @emph{Description:}
This routine creates a new table. Note that this routine does not associate the
file with its parent. To do this, use
@xref{x-tableassociate,,@code{cass_table_associate}}.
@table @code
@item table
The pointer to the newly created table is return through this pointer.
@item env
The environment in which the table is created.
@item table_name
Name of the table.
@item opr_id
The ID of the table class. See @xref{x-index,,Index and Sketch Reference} for
possible values.
@item cfg_id
The ID of the table configuration. For tables do not need a configuration, like
index, @code{cfg_id} can be -1.
@item parent_id
The ID of the parent table. For tables without a parent, use -1.
@item parent_cfg_id
The ID of the parent table's configuration. Can be -1.
@item map_id
The ID of the map object to use.
@item param
Extra parameters to pass to the detail implementation. See @xref{x-index,,Index
and Sketch Reference} for details.
@end table
@end table
@subsection Free a Table
@table @asis
@item @emph{Prototype:}
@code{int cass_table_free(cass_table_t *table);}
@item @emph{Description:}
This routine frees the table from the main memory. The on-disk table is not
destroyed.
@end table
@subsection Describe a Table
@table @asis
@item @emph{Prototype:}
@code{int cass_table_describe (cass_table_t *, CASS_FILE *);}
@item @emph{Description}
This routine writes textual description of the table to the provided stream.
@end table
@subsection Import and Export Data
@table @asis
@item @emph{Prototype:}
@code{int cass_table_import_data(cass_table_t *table, char *fname); }
@code{int cass_table_export_data(cass_table_t *table, char *fname); }
@item @emph{Description:}
These two routines import/export a table from/to a external text file. See
@xref{x-fileformat,,Data File Format} for details of the text file format.
@end table
@subsection Table Association
@anchor{x-tableassociate}
@table @asis
@item @emph{Prototype:}
@code{int cass_table_associate (cass_table_t *table, int32_t child);}
@code{int cass_table_disassociate (cass_table_t *table, int32_t child);}
@code{cass_table_t *cass_table_parent (cass_table_t *table);}
@item @emph{Description:}
@code{cass_table_associate} associates table @code{child} (the ID) to @code{table}, so @code{child} becomes a
child of @code{table}. When there are insertion to a table, the same data are
automatically inserted to its children.
@code{cass_table_disassociate} revokes the relationship.
@code{cass_table_parent} returns the parent of the table, or @code{NULL} when
the table has no parent.
@end table
@subsection Load and Release Feature Data
@table @asis
@item @emph{Prototype:}
@code{int cass_table_load (cass_table_t *table);}
@code{int cass_table_release (cass_table_t *table);}
@item @emph{Description:}
When the database is opened, the meta data of all the tables, as well as other
objects are loaded into memory. However, the real feature data,
(or index/sketches) are not. In this way, the system knows the structural
information of the database without too much memory overhead. The feature data
are only loaded when necessary and are released when not needed. If the feature
data are modified, they are automatically dumped to disk before being released.
You can refer to the @code{loaded} field of @code{cass_table_t} to determine if
the feature data are loaded.
The feature data must be loaded before data insertions or queries.
@end table
@subsection Data Insertion
@table @asis
@item @emph{Prototype:}
@code{int cass_table_batch_insert (cass_table_t *table, cass_dataset_t *dataset,
cass_vecset_id_t start, cass_vecset_id_t end);}
@item @emph{Description:}
This routine inserts the vecsets in @code{dataset} indexed by the range [@code{start}, @code{end}]
into @code{table}.
@end table
@node Queries
@section Queries
There are two types of queries in the Ferret toolkit -- range queries and K-NN
queries. K-NN queries are well supported and there are index/sketches for
various distance measures. For each query, the user specifies one query vecset
and the required range/# nearest neighbors, and the system tries the best to
return the results. Queries can be in two levels -- vecset level or vector
level. In vecset level, the vecsets in the table that are closest to the query
vecset or within the specified range are returned. In vector level,
a queries is carried out for each vector in the query vecset, and the results
are returned in multiple sets. Also, in vector level queries, the dataset is
the union of all the vecsets in the table, and the vecset-vector relationship is
disregarded. That means, the more than one vectors in the result set can belong
to the same vecset. Further more, the result sets can either be returned as an
array or as a bitmap.
To issue a query, the user need to prepare a query data structure to specify the
query information, and provide a result structure to hold the return value.
The related
structures are described below.
@smallexample
typedef struct @{
cass_id_t id;
cass_dist_t dist;
@} cass_list_entry_t;
typedef ARRAY_TYPE(cass_list_entry_t) cass_list_t;
typedef struct @{
uint32_t flags;
union @{
bitmap_t bitmap;
cass_list_t list;
ARRAY_TYPE(bitmap_t) bitmaps;
ARRAY_TYPE(cass_list_t) lists;
@};
@} cass_result_t;
typedef struct @{
uint32_t flags;
cass_dataset_t *dataset;
cass_id_t vecset_id;
cass_size_t topk;
cass_dist_t range;
char *extra_params;
cass_result_t *candidate;
int32_t vec_dist_id;
int32_t vecset_dist_id;
@} cass_query_t;
@end smallexample
The @code{cass_result_t} is a union plus a @code{flags}, which also appears in
@code{cass_query_t} and gives information on how to interprete the union and other
requirements. @code{flags} in @code{cass_query_t} specifies the user requirement
and @code{flags} in @code{cass_result_t} specify the system response. The
possible values of the two are essentially the same.
Specifically, the user can specify one of the following four
values in the @code{flags} of the @code{cass_query_t} structure, to specify
which representation of the result in @code{cass_result_t} is required.
@table @code
@item CASS_RESULT_BITMAP
The @code{bitmap} field of the union is used. The query should be in vecset
level.
@item CASS_RESULT_LIST
The @code{list} field of the union is used. The query should also be in vecset
level.
@item CASS_RESULT_BITMAPS
The @code{bitmaps} field of the union is used. The query should be in vector
level.
@item CASS_RESULT_LISTS
The @code{lists} field of the union is used. The query should also be in vector
level.
@end table
Note that for vector level queries, even if there is always only one vector in
the vecset (the vecset is of the type @code{CASS_VECSET_SINGLE}), either
@code{bitmaps} or @code{lists} should be used.
The @code{flags} can also have one or more of the following ORed to its value.
@table @code
@item CASS_RESULT_MALLOC
The memory to hold the real data in @code{list}/@code{lists} have not been
allocated by the user and should be allocated by the system.
@item CASS_RESULT_USERMEM
The user has allocated memory in @code{list}/@code{lists} and the system should
use the user provided memory. This is useful when there are multiple contiguous
queries, allowing the user to allocate memory once and reuse for the following
queries .
@item CASS_RESULT_REALLOC
The user should not set this flag, but if it appears in the return value, it
means that the user allocated memory is not enough and the system has
reallocated the memory.
@item CASS_RESULT_SORT
The user requires that the result list to be sorted.
@item CASS_RESULT_DIST
The user requires that the @code{dist} field of @code{cass_list_entry_t} be filled if
@code{list} or @code{lists} is used. This value means the distance of the
corresponding vector/vecset to the query vector/vecset.
@end table
It is important to note that all the above flags are treated as suggestions
rather than mandatories. The user should always check the @code{flags} of
@code{cass_result_t} and interpret the result accordingly. For example, some of the
index/sketch algorithm do not refer to the original feature data, and have no
way to figure out the distance between the data vector/vecset and the query
vector/vecset, so even if the user requires @code{CASS_RESULT_DIST}, it will not
be returned. Some of the index/sketch algorithm keep the K-NN's in a heap, and
it will be cheap to sort the heap with heap sort than regular quicksort. In
that case, if the user requres @code{CASS_RESULT_SORT}, the results will be
sorted. For other algorithms, if sorting within the query procedure is no
faster than outside it by the user, the algorithm will disregard the sorting
requirement. Finally, some of the index method, like LSH, will always return
bitmaps instead of lists.
The other fields of the @code{cass_query_t} are explained below.
@table @code
@item dataset
@item vecset_id
The vecset in @code{dataset} specified by @code{vecset_id} is used as the query
vecset.
@item topk
@item range
If @code{topk > 0}, the system does the K-NN query; otherwise, the system does
the range query which is specified by @code{range}.
@item extra_params
The extra parameters, as a string, passed to the query algorithm. The string is
interpreted differently by different query index/sketch algorithms.
@item candidate
If @code{candidate != NULL}, then the query is carried out within the data
provided by @code{candidate} instead of the whole dataset. The often happens
when the user wants to refine the result of a fast but inaccurate index
algorithm with a more accurate one. In that case, @code{candidate} can directly
point to the previous query result. In the case of vector level query, the
order of lists/bitmaps in @code{candidate} should be the same as the order of
vectors in the query vecset.
@item vec_dist_id
@item vecset_dist_id
The vector distance and vecset distance to be used. If it is a vector level
query, @code{vecset_dist_id} is disregarded.
@end table
The following routines are used to issues queries to tables.
@table @asis
@item @emph{Prototype:}
@code{int cass_table_query (cass_table_t *table, cass_query_t *query,
cass_result_t *result);}
@code{int cass_table_batch_query (cass_table_t *table, uint32_t count,
cass_query_t **queries, cass_result_t **results);}
@item @emph{Description:}
The function of the batch query is same as doing single query multiple times.
But for some index/sketch algorithm, doing multiple queries in a batch allows
higher performance. But this is not always the case.
@end table
@node Multiple Modality Support
@section Multiple Modality Support
@node Vector Distance Reference
@section Vector Distance Reference
@include vecdist.texi
@node Vecset Distance Reference
@section Vecset Distance Reference
@include vecsetdist.texi
@node Index and Sketch Reference
@section Index and Sketch Reference
@include index.texi