Similarity matrix using hdf5 as storage backend.
(table, append_chunk_size=100000000)[source]¶ Abstract wrapper around a HDF5 table
Parameters: - table (tables.Table) – HDF5 table
- append_chunk_size (int) – Size of chunk to append in one go. Defaults to 1e8, which when table description is 10bytes will require 2Gb during append.
- Attributes
- table (tables.Table): HDF5 table append_chunk_size (int): Number of rows to read from other table during append.
(h5file, expectedrows=0)[source]¶ Table to look up label of fragment by id or id of fragment by label
When table does not exist in h5file it is created.
Parameters: - h5file (tables.File) – Object representing an open hdf5 file
- expectedrows (int) – Expected number of pairs to be added. Required when similarity matrix is opened in write mode, helps optimize storage
(frag_id)[source]¶ Look up label of fragment by id
Parameters: frag_id (int) – Fragment identifier Raises: IndexError
– When id of fragment is not foundReturns: Label of fragment Return type: str
(label)[source]¶ Look up id of fragment by label
Parameters: label (str) – Fragment label Raises: IndexError
– When label of fragment is not foundReturns: Fragment identifier Return type: int
(labels)[source]¶ Look up ids of fragments by label
Parameters: labels (set[str]) – Set of fragment labels Raises: IndexError
– When label of fragment is not foundReturns: Set of fragment identifiers Return type: set[int]
(other, keep)[source]¶ Copy content of self to other and only keep given fragment identifiers
Parameters: - other (LabelsLookup) – Labels table to fill
- keep (set[int]) – Fragment identifiers to keep
()[source]¶ Return whole table as a dictionary
Returns: Dictionary with label as key and frag_id as value. Return type: dict
(label2id)[source]¶ Merge label2id dict into self
When label does not exists an id is generated and the label/id is added. When label does exist the id of the label in self is kept.
Parameters: label2id (dict]) – Dictionary with fragment label as key and fragment identifier as value. Returns: Dictionary of label/id which where in label2id, but missing in self Return type: dict
(other, skip)[source]¶ Copy content of self to other and skip given fragment identifiers
Parameters: - other (LabelsLookup) – Labels table to fill
- skip (set[int]) – Fragment identifiers to skip
(h5file, expectedrows=0)[source]¶ Tabel to store similarity score of a pair of fragment fingerprints
When table does not exist in h5file it is created.
Parameters: - h5file (tables.File) – Object representing an open hdf5 file
- expectedrows (int) – Expected number of pairs to be added. Required when similarity matrix is opened in write mode, helps optimize storage
¶ int – Similarity score is a fraction, the score is converted to an int by multiplying it with the precision
¶ bool – Matrix is filled above and below diagonal.
(other)[source]¶ Append rows of other table to self
Parameters: other – Table of same type as self
(frame_size, raw_score=False)[source]¶ Count occurrences of each score
Parameters: Returns: Score and number of occurrences
Return type:
(frag_id, cutoff, limit)[source]¶ Find fragment hits which has a similarity score with frag_id above cutoff.
Parameters: Returns: Where first tuple value is hit fragment identifier and second value is similarity score
Return type: List[Tuple]
(other, keep)[source]¶ Copy pairs from self to other and keep given fragment identifiers and the identifiers they pair with.
Parameters: - other (PairsTable) – Pairs table to fill
- keep (set[int]) – Fragment identifiers to keep
Returns: Fragment identifiers that have been copied to other
Return type:
(other, skip)[source]¶ Copy content from self to other and skip given fragment identifiers
Parameters: - other (PairsTable) – Pairs table to fill
- skip (set[int]) – Fragment identifiers to skip
(filename, mode='r', expectedpairrows=None, expectedlabelrows=None, cache_labels=False, **kwargs)[source]¶ Similarity matrix
Parameters: - filename (str) – File name of hdf5 file to write or read similarity matrix from
- mode (str) – Can be ‘r’ for reading or ‘w’ for writing
- expectedpairrows (int) – Expected number of pairs to be added. Required when similarity matrix is opened in write mode, helps optimize storage
- expectedlabelrows (int) – Expected number of labels to be added. Required when similarity matrix is opened in write mode, helps optimize storage
- cache_labels (bool) – Cache labels, speed up label lookups
¶ tables.File – Object representing an open hdf5 file
¶ PairsTable – HDF5 Table that contains pairs
¶ LabelsLookup – Table to look up label of fragment by id or id of fragment by label
(other)[source]¶ Append data from other similarity matrix to me
Parameters: other (SimilarityMatrix) – Other similarity matrix
(frame_size, raw_score=False, lower_triangle=False)[source]¶ Count occurrences of each score
Parameters: Returns: Score and number of occurrences
Return type:
(query, cutoff, limit=None)[source]¶ Find similar fragments to query.
Parameters: Yields: (str, float) – Hit fragment idenfier and similarity score
(other, keep)[source]¶ Copy content of self to other and only keep given fragment labels and the labels they pair with
Parameters: - other (SimilarityMatrix) – Writable matrix to fill
- keep (set[str]) – Fragment labels to keep
(other, skip)[source]¶ Copy content of self to other and skip all given fragment labels
Parameters: - other (SimilarityMatrix) – Writable matrix to fill
- skip (set[str]) – Fragment labels to skip
(similarities_iter, label2id)[source]¶ Store pairs of fragment identifier with their similarity score and label 2 id lookup
Parameters: - similarities_iter (iterator) – Iterator which yields (label1, label2, similarity_score)
- label2id (dict) – Dictionary with fragment label as key and fragment identifier as value.