kripodb.hdf5¶

Similarity matrix using hdf5 as storage backend.

class kripodb.hdf5.AbstractSimpleTable(table, append_chunk_size=100000000)[source]¶

Abstract wrapper around a HDF5 table

Parameters:	table (tables.Table) – HDF5 table append_chunk_size (int) – Size of chunk to append in one go. Defaults to 1e8, which when table description is 10bytes will require 2Gb during append.

Attributes: table (tables.Table): HDF5 table append_chunk_size (int): Number of rows to read from other table during append.

append(other)[source]¶

Append rows of other table to self

Parameters:	other – Table of same type as self

class kripodb.hdf5.LabelsLookup(h5file, expectedrows=0)[source]¶

Table to look up label of fragment by id or id of fragment by label

When table does not exist in h5file it is created.

Parameters:	h5file (tables.File) – Object representing an open hdf5 file expectedrows (int) – Expected number of pairs to be added. Required when similarity matrix is opened in write mode, helps optimize storage

by_id(frag_id)[source]¶

Look up label of fragment by id

Parameters:	frag_id (int) – Fragment identifier
Raises:	`IndexError` – When id of fragment is not found
Returns:	Label of fragment
Return type:	str

by_label(label)[source]¶

Look up id of fragment by label

Parameters:	label (str) – Fragment label
Raises:	`IndexError` – When label of fragment is not found
Returns:	Fragment identifier
Return type:	int

by_labels(labels)[source]¶

Look up ids of fragments by label

Parameters:	labels (set[str]) – Set of fragment labels
Raises:	`IndexError` – When label of fragment is not found
Returns:	Set of fragment identifiers
Return type:	set[int]

keep(other, keep)[source]¶

Copy content of self to other and only keep given fragment identifiers

Parameters:	other (LabelsLookup) – Labels table to fill keep (set[int]) – Fragment identifiers to keep

label2ids()[source]¶

Return whole table as a dictionary

Returns:	Dictionary with label as key and frag_id as value.
Return type:	dict

merge(label2id)[source]¶

Merge label2id dict into self

When label does not exists an id is generated and the label/id is added. When label does exist the id of the label in self is kept.

Parameters:	label2id (dict]) – Dictionary with fragment label as key and fragment identifier as value.
Returns:	Dictionary of label/id which where in label2id, but missing in self
Return type:	dict

skip(other, skip)[source]¶

Copy content of self to other and skip given fragment identifiers

Parameters:	other (LabelsLookup) – Labels table to fill skip (set[int]) – Fragment identifiers to skip

update(label2id)[source]¶

Update labels lookup by adding labels in label2id.

Parameters:	label2id (dict) – Dictionary with fragment label as key and fragment identifier as value.

class kripodb.hdf5.PairsTable(h5file, expectedrows=0)[source]¶

Tabel to store similarity score of a pair of fragment fingerprints

When table does not exist in h5file it is created.

Parameters:	h5file (tables.File) – Object representing an open hdf5 file expectedrows (int) – Expected number of pairs to be added. Required when similarity matrix is opened in write mode, helps optimize storage

score_precision¶: int – Similarity score is a fraction, the score is converted to an int by multiplying it with the precision

full_matrix¶: bool – Matrix is filled above and below diagonal.

append(other)[source]¶

Append rows of other table to self

Parameters:	other – Table of same type as self

count(frame_size, raw_score=False)[source]¶

Count occurrences of each score

Parameters:	frame_size (int) – Size of matrix loaded each time. Larger requires more memory and smaller is slower. raw_score (bool) – Return raw int16 score or fraction score
Returns:	Score and number of occurrences
Return type:	Tuple[(str, int)]

find(frag_id, cutoff, limit)[source]¶

Find fragment hits which has a similarity score with frag_id above cutoff.

Parameters:	frag_id (int) – query fragment identifier cutoff (float) – Cutoff, similarity scores below cutoff are discarded. limit (int) – Maximum number of hits. Default is None for no limit.
Returns:	Where first tuple value is hit fragment identifier and second value is similarity score
Return type:	List[Tuple]

keep(other, keep)[source]¶

Copy pairs from self to other and keep given fragment identifiers and the identifiers they pair with.

Parameters:	other (PairsTable) – Pairs table to fill keep (set[int]) – Fragment identifiers to keep
Returns:	Fragment identifiers that have been copied to other
Return type:	set[int]

skip(other, skip)[source]¶

Copy content from self to other and skip given fragment identifiers

Parameters:	other (PairsTable) – Pairs table to fill skip (set[int]) – Fragment identifiers to skip

update(similarities_iter, label2id)[source]¶

Store pairs of fragment identifier with their similarity score

Parameters:	similarities_iter (Iterator) – Iterator which yields (label1, label2, similarity_score) label2id (Dict) – Lookup with fragment label as key and fragment identifier as value

class kripodb.hdf5.SimilarityMatrix(filename, mode='r', expectedpairrows=None, expectedlabelrows=None, cache_labels=False, **kwargs)[source]¶

Similarity matrix

Parameters:

filename (str) – File name of hdf5 file to write or read similarity matrix from
mode (str) – Can be ‘r’ for reading or ‘w’ for writing
expectedpairrows (int) – Expected number of pairs to be added. Required when similarity matrix is opened in write mode, helps optimize storage
expectedlabelrows (int) – Expected number of labels to be added. Required when similarity matrix is opened in write mode, helps optimize storage
cache_labels (bool) – Cache labels, speed up label lookups

h5file¶: tables.File – Object representing an open hdf5 file

pairs¶: PairsTable – HDF5 Table that contains pairs

labels¶: LabelsLookup – Table to look up label of fragment by id or id of fragment by label

append(other)[source]¶

Append data from other similarity matrix to me

Parameters:	other (SimilarityMatrix) – Other similarity matrix

close()[source]¶: Closes the hdf5file

count(frame_size, raw_score=False, lower_triangle=False)[source]¶

Count occurrences of each score

Parameters:	frame_size (int) – Size of matrix loaded each time. Larger requires more memory and smaller is slower. raw_score (bool) – Return raw int16 score or fraction score lower_triangle (bool) – Dummy argument to force same interface for thawed and frozen matrix
Returns:	Score and number of occurrences
Return type:	(str, int)

find(query, cutoff, limit=None)[source]¶

Find similar fragments to query.

Parameters:	query (str) – Query fragment identifier cutoff (float) – Cutoff, similarity scores below cutoff are discarded. limit (int) – Maximum number of hits. Default is None for no limit.
Yields:	(str, float) – Hit fragment idenfier and similarity score

keep(other, keep)[source]¶

Copy content of self to other and only keep given fragment labels and the labels they pair with

Parameters:	other (SimilarityMatrix) – Writable matrix to fill keep (set[str]) – Fragment labels to keep

skip(other, skip)[source]¶

Copy content of self to other and skip all given fragment labels

Parameters:	other (SimilarityMatrix) – Writable matrix to fill skip (set[str]) – Fragment labels to skip

update(similarities_iter, label2id)[source]¶

Store pairs of fragment identifier with their similarity score and label 2 id lookup

Parameters:	similarities_iter (iterator) – Iterator which yields (label1, label2, similarity_score) label2id (dict) – Dictionary with fragment label as key and fragment identifier as value.