kripodb.frozen

Similarity matrix using pytables carray

class kripodb.frozen.FrozenSimilarityMatrix(filename, mode='r', **kwargs)[source]

Frozen similarities matrix

Can retrieve whole column of a specific row fairly quickly. Store as compressed dense matrix. Due to compression the zeros use up little space.

Warning! Can not be enlarged.

Compared find performance FrozenSimilarityMatrix with SimilarityMatrix:

>>> from kripodb.db import FragmentsDb
>>> db = FragmentsDb('data/feb2016/Kripo20151223.sqlite')
>>> ids = [v[0] for v in db.cursor.execute('SELECT frag_id FROM fragments ORDER BY RANDOM() LIMIT 20')]
>>> from kripodb.frozen import FrozenSimilarityMatrix
>>> fdm = FrozenSimilarityMatrix('01-01_to_13-13.out.frozen.blosczlib.h5')
>>> from kripodb.hdf5 import SimilarityMatrix
>>> dm = SimilarityMatrix('data/feb2016/01-01_to_13-13.out.h5', cache_labels=True)
>>> %timeit list(dm.find(ids[0], 0.45, None))

… 1 loop, best of 3: 1.96 s per loop >>> %timeit list(fdm.find(ids[0], 0.45, None)) … The slowest run took 6.21 times longer than the fastest. This could mean that an intermediate result is being cached. … 10 loops, best of 3: 19.3 ms per loop >>> ids = [v[0] for v in db.cursor.execute(‘SELECT frag_id FROM fragments ORDER BY RANDOM() LIMIT 20’)] >>> %timeit -n1 [list(fdm.find(v, 0.45, None)) for v in ids] … 1 loop, best of 3: 677 ms per loop >>> %timeit -n1 [list(dm.find(v, 0.45, None)) for v in ids] … 1 loop, best of 3: 29.7 s per loop

Parameters:
  • filename (str) – File name of hdf5 file to write or read similarity matrix from
  • mode (str) – Can be ‘r’ for reading or ‘w’ for writing
  • **kwargs – Passed though to tables.open_file()
h5file

tables.File – Object representing an open hdf5 file

scores

tables.CArray – HDF5 Table that contains matrix

labels

tables.CArray – Table to look up label of fragment by id or id of fragment by label

close()[source]

Closes the hdf5file

count(frame_size=None, raw_score=False, lower_triangle=False)[source]

Count occurrences of each score

Only scores are counted of the upper triangle or lower triangle. Zero scores are skipped.

Parameters:
  • frame_size (int) – Dummy argument to force same interface for thawed and frozen matrix
  • raw_score (bool) – When true return raw int16 score else fraction score
  • lower_triangle (bool) – When true return scores from lower triangle else return scores from upper triangle
Returns:

Score and number of occurrences

Return type:

Tuple[(str, int)]

find(query, cutoff, limit=None)[source]

Find similar fragments to query.

Parameters:
  • query (str) – Query fragment identifier
  • cutoff (float) – Cutoff, similarity scores below cutoff are discarded.
  • limit (int) – Maximum number of hits. Default is None for no limit.
Returns:

Hit fragment identifier and similarity score

Return type:

list[tuple[str,float]]

from_array(data, labels)[source]

Fill matrix from 2 dimensional array

Parameters:
  • data (np.array) – 2 dimensional square array with scores
  • labels (list) – List of labels for each column and row index
from_pairs(similarity_matrix, frame_size, limit=None, single_sided=False)[source]

Fills self with matrix which is stored in pairs.

Also known as COOrdinate format, the ‘ijv’ or ‘triplet’ format.

Parameters:
  • similarity_matrix (kripodb.hdf5.SimilarityMatrix) –
  • frame_size (int) – Number of pairs to append in a single go
  • limit (int|None) – Number of pairs to add, None for no limit, default is None.
  • single_sided (bool) – If false add stored direction and reverse direction. Default is False.

time kripodb similarities freeze –limit 200000 -f 100000 data/feb2016/01-01_to_13-13.out.h5 percell.h5 47.2s time kripodb similarities freeze –limit 200000 -f 100000 data/feb2016/01-01_to_13-13.out.h5 coo.h5 0.2m - 2m6s .4m - 2m19s .8m - 2m33s 1.6m - 2m48s 3.2m - 3m4s 6.4m - 3m50s 12.8m - 4m59s 25.6m - 7m27s

to_pairs(pairs)[source]

Copies labels and scores from self to pairs matrix.

Parameters:pairs (SimilarityMatrix) –
to_pandas()[source]

Pandas dataframe with labelled colums and rows.

Warning! Only use on matrices that fit in memory

Returns:pd.DataFrame