kripodb.hdf5

Similarity matrix using hdf5 as storage backend.

class kripodb.hdf5.AbstractSimpleTable(table, append_chunk_size=100000000)[source]

Abstract wrapper around a HDF5 table

Parameters:
  • table (tables.Table) – HDF5 table
  • append_chunk_size (int) – Size of chunk to append in one go. Defaults to 1e8, which when table description is 10bytes will require 2Gb during append.
Attributes
table (tables.Table): HDF5 table append_chunk_size (int): Number of rows to read from other table during append.
append(other)[source]

Append rows of other table to self

Parameters:other – Table of same type as self
class kripodb.hdf5.LabelsLookup(h5file, expectedrows=0)[source]

Table to look up label of fragment by id or id of fragment by label

When table does not exist in h5file it is created.

Parameters:
  • h5file (tables.File) – Object representing an open hdf5 file
  • expectedrows (int) – Expected number of pairs to be added. Required when similarity matrix is opened in write mode, helps optimize storage
by_id(frag_id)[source]

Look up label of fragment by id

Parameters:frag_id (int) – Fragment identifier
Raises:IndexError – When id of fragment is not found
Returns:Label of fragment
Return type:str
by_label(label)[source]

Look up id of fragment by label

Parameters:label (str) – Fragment label
Raises:IndexError – When label of fragment is not found
Returns:Fragment identifier
Return type:int
by_labels(labels)[source]

Look up ids of fragments by label

Parameters:labels (set[str]) – Set of fragment labels
Raises:IndexError – When label of fragment is not found
Returns:Set of fragment identifiers
Return type:set[int]
keep(other, keep)[source]

Copy content of self to other and only keep given fragment identifiers

Parameters:
  • other (LabelsLookup) – Labels table to fill
  • keep (set[int]) – Fragment identifiers to keep
label2ids()[source]

Return whole table as a dictionary

Returns:Dictionary with label as key and frag_id as value.
Return type:dict
merge(label2id)[source]

Merge label2id dict into self

When label does not exists an id is generated and the label/id is added. When label does exist the id of the label in self is kept.

Parameters:label2id (dict]) – Dictionary with fragment label as key and fragment identifier as value.
Returns:Dictionary of label/id which where in label2id, but missing in self
Return type:dict
skip(other, skip)[source]

Copy content of self to other and skip given fragment identifiers

Parameters:
  • other (LabelsLookup) – Labels table to fill
  • skip (set[int]) – Fragment identifiers to skip
update(label2id)[source]

Update labels lookup by adding labels in label2id.

Parameters:label2id (dict) – Dictionary with fragment label as key and fragment identifier as value.
class kripodb.hdf5.PairsTable(h5file, expectedrows=0)[source]

Tabel to store similarity score of a pair of fragment fingerprints

When table does not exist in h5file it is created.

Parameters:
  • h5file (tables.File) – Object representing an open hdf5 file
  • expectedrows (int) – Expected number of pairs to be added. Required when similarity matrix is opened in write mode, helps optimize storage
score_precision

int – Similarity score is a fraction, the score is converted to an int by multiplying it with the precision

full_matrix

bool – Matrix is filled above and below diagonal.

append(other)[source]

Append rows of other table to self

Parameters:other – Table of same type as self
count(frame_size, raw_score=False)[source]

Count occurrences of each score

Parameters:
  • frame_size (int) – Size of matrix loaded each time. Larger requires more memory and smaller is slower.
  • raw_score (bool) – Return raw int16 score or fraction score
Returns:

Score and number of occurrences

Return type:

Tuple[(str, int)]

find(frag_id, cutoff, limit)[source]

Find fragment hits which has a similarity score with frag_id above cutoff.

Parameters:
  • frag_id (int) – query fragment identifier
  • cutoff (float) – Cutoff, similarity scores below cutoff are discarded.
  • limit (int) – Maximum number of hits. Default is None for no limit.
Returns:

Where first tuple value is hit fragment identifier and second value is similarity score

Return type:

List[Tuple]

keep(other, keep)[source]

Copy pairs from self to other and keep given fragment identifiers and the identifiers they pair with.

Parameters:
  • other (PairsTable) – Pairs table to fill
  • keep (set[int]) – Fragment identifiers to keep
Returns:

Fragment identifiers that have been copied to other

Return type:

set[int]

skip(other, skip)[source]

Copy content from self to other and skip given fragment identifiers

Parameters:
  • other (PairsTable) – Pairs table to fill
  • skip (set[int]) – Fragment identifiers to skip
update(similarities_iter, label2id)[source]

Store pairs of fragment identifier with their similarity score

Parameters:
  • similarities_iter (Iterator) – Iterator which yields (label1, label2, similarity_score)
  • label2id (Dict) – Lookup with fragment label as key and fragment identifier as value
class kripodb.hdf5.SimilarityMatrix(filename, mode='r', expectedpairrows=None, expectedlabelrows=None, cache_labels=False, **kwargs)[source]

Similarity matrix

Parameters:
  • filename (str) – File name of hdf5 file to write or read similarity matrix from
  • mode (str) – Can be ‘r’ for reading or ‘w’ for writing
  • expectedpairrows (int) – Expected number of pairs to be added. Required when similarity matrix is opened in write mode, helps optimize storage
  • expectedlabelrows (int) – Expected number of labels to be added. Required when similarity matrix is opened in write mode, helps optimize storage
  • cache_labels (bool) – Cache labels, speed up label lookups
h5file

tables.File – Object representing an open hdf5 file

pairs

PairsTable – HDF5 Table that contains pairs

labels

LabelsLookup – Table to look up label of fragment by id or id of fragment by label

append(other)[source]

Append data from other similarity matrix to me

Parameters:other (SimilarityMatrix) – Other similarity matrix
close()[source]

Closes the hdf5file

count(frame_size, raw_score=False, lower_triangle=False)[source]

Count occurrences of each score

Parameters:
  • frame_size (int) – Size of matrix loaded each time. Larger requires more memory and smaller is slower.
  • raw_score (bool) – Return raw int16 score or fraction score
  • lower_triangle (bool) – Dummy argument to force same interface for thawed and frozen matrix
Returns:

Score and number of occurrences

Return type:

(str, int)

find(query, cutoff, limit=None)[source]

Find similar fragments to query.

Parameters:
  • query (str) – Query fragment identifier
  • cutoff (float) – Cutoff, similarity scores below cutoff are discarded.
  • limit (int) – Maximum number of hits. Default is None for no limit.
Yields:

(str, float) – Hit fragment idenfier and similarity score

keep(other, keep)[source]

Copy content of self to other and only keep given fragment labels and the labels they pair with

Parameters:
skip(other, skip)[source]

Copy content of self to other and skip all given fragment labels

Parameters:
update(similarities_iter, label2id)[source]

Store pairs of fragment identifier with their similarity score and label 2 id lookup

Parameters:
  • similarities_iter (iterator) – Iterator which yields (label1, label2, similarity_score)
  • label2id (dict) – Dictionary with fragment label as key and fragment identifier as value.