kripodb.frozen¶

Similarity matrix using pytables carray

class kripodb.frozen.FrozenSimilarityMatrix(filename, mode='r', **kwargs)[source]¶

Frozen similarities matrix

Can retrieve whole column of a specific row fairly quickly. Store as compressed dense matrix. Due to compression the zeros use up little space.

Warning! Can not be enlarged.

Compared find performance FrozenSimilarityMatrix with SimilarityMatrix:

>>> from kripodb.db import FragmentsDb
>>> db = FragmentsDb('data/feb2016/Kripo20151223.sqlite')
>>> ids = [v[0] for v in db.cursor.execute('SELECT frag_id FROM fragments ORDER BY RANDOM() LIMIT 20')]
>>> from kripodb.frozen import FrozenSimilarityMatrix
>>> fdm = FrozenSimilarityMatrix('01-01_to_13-13.out.frozen.blosczlib.h5')
>>> from kripodb.hdf5 import SimilarityMatrix
>>> dm = SimilarityMatrix('data/feb2016/01-01_to_13-13.out.h5', cache_labels=True)
>>> %timeit list(dm.find(ids[0], 0.45, None))

… 1 loop, best of 3: 1.96 s per loop >>> %timeit list(fdm.find(ids[0], 0.45, None)) … The slowest run took 6.21 times longer than the fastest. This could mean that an intermediate result is being cached. … 10 loops, best of 3: 19.3 ms per loop >>> ids = [v[0] for v in db.cursor.execute(‘SELECT frag_id FROM fragments ORDER BY RANDOM() LIMIT 20’)] >>> %timeit -n1 [list(fdm.find(v, 0.45, None)) for v in ids] … 1 loop, best of 3: 677 ms per loop >>> %timeit -n1 [list(dm.find(v, 0.45, None)) for v in ids] … 1 loop, best of 3: 29.7 s per loop

Parameters:	filename (str) – File name of hdf5 file to write or read similarity matrix from mode (str) – Can be ‘r’ for reading or ‘w’ for writing **kwargs – Passed though to tables.open_file()

h5file¶: tables.File – Object representing an open hdf5 file

scores¶: tables.CArray – HDF5 Table that contains matrix

labels¶: tables.CArray – Table to look up label of fragment by id or id of fragment by label

close()[source]¶: Closes the hdf5file

count(frame_size=None, raw_score=False, lower_triangle=False)[source]¶

Count occurrences of each score

Only scores are counted of the upper triangle or lower triangle. Zero scores are skipped.

Parameters:	frame_size (int) – Dummy argument to force same interface for thawed and frozen matrix raw_score (bool) – When true return raw int16 score else fraction score lower_triangle (bool) – When true return scores from lower triangle else return scores from upper triangle
Returns:	Score and number of occurrences
Return type:	Tuple[(str, int)]

find(query, cutoff, limit=None)[source]¶

Find similar fragments to query.

Parameters:	query (str) – Query fragment identifier cutoff (float) – Cutoff, similarity scores below cutoff are discarded. limit (int) – Maximum number of hits. Default is None for no limit.
Returns:	Hit fragment identifier and similarity score
Return type:	list[tuple[str,float]]

from_array(data, labels)[source]¶

Fill matrix from 2 dimensional array

Parameters:	data (np.array) – 2 dimensional square array with scores labels (list) – List of labels for each column and row index

from_pairs(similarity_matrix, frame_size, limit=None, single_sided=False)[source]¶

Fills self with matrix which is stored in pairs.

Also known as COOrdinate format, the ‘ijv’ or ‘triplet’ format.

Parameters:	similarity_matrix (kripodb.hdf5.SimilarityMatrix) – frame_size (int) – Number of pairs to append in a single go limit (int\|None) – Number of pairs to add, None for no limit, default is None. single_sided (bool) – If false add stored direction and reverse direction. Default is False.

time kripodb similarities freeze –limit 200000 -f 100000 data/feb2016/01-01_to_13-13.out.h5 percell.h5 47.2s time kripodb similarities freeze –limit 200000 -f 100000 data/feb2016/01-01_to_13-13.out.h5 coo.h5 0.2m - 2m6s .4m - 2m19s .8m - 2m33s 1.6m - 2m48s 3.2m - 3m4s 6.4m - 3m50s 12.8m - 4m59s 25.6m - 7m27s

to_pairs(pairs)[source]¶

Copies labels and scores from self to pairs matrix.

Parameters:	pairs (SimilarityMatrix) –

to_pandas()[source]¶

Pandas dataframe with labelled colums and rows.

Warning! Only use on matrices that fit in memory

Returns:	pd.DataFrame