kripodb.pairs

Module handling generation and retrieval of similarity of fingerprint pairs

kripodb.pairs.dump_pairs(bitsets1, bitsets2, out_format, out_file, out, number_of_bits, mean_onbit_density, cutoff, label2id, nomemory, ignore_upper_triangle=False)[source]

Dump pairs of bitset collection.

A pairs are rows of the bitset identifier of both bitsets with a similarity score.

Parameters:
  • bitsets1 (Dict{str, pyroaring.BitMap}) – First dict of fingerprints with fingerprint label as key and pyroaring.BitMap as value
  • bitsets2 (Dict{str, pyroaring.BitMap}) – Second dict of fingerprints with fingerprint label as key and pyroaring.BitMap as value
  • out_format – ‘tsv’ or ‘hdf5’
  • out_file – Filename of output file where ‘hdf5’ format is written to.
  • out (File) – File object where ‘tsv’ format is written to.
  • number_of_bits (int) – Number of bits for all bitsets
  • mean_onbit_density (float) – Mean on bit density
  • cutoff (float) – Cutoff, similarity scores below cutoff are discarded.
  • label2id – dict to translate label to id (string to int)
  • nomemory – If true bitset2 is not loaded into memory
  • ignore_upper_triangle – When true returns similarity where label1 > label2, when false returns all similarities
kripodb.pairs.dump_pairs_hdf5(similarities_iter, label2id, expectedrows, out_file)[source]

Dump pairs in hdf5 file

Pro: * very small, 10 bytes for each pair + compression Con: * requires hdf5 library to access

Parameters:
  • similarities_iter (Iterator) – Iterator with tuple with fingerprint 1 label, fingerprint 2 label, similarity as members
  • label2id (dict) – dict to translate label to id (string to int)
  • expectedrows
  • out_file
kripodb.pairs.dump_pairs_tsv(similarities_iter, out)[source]

Dump pairs in tab delimited file

Pro: * when stored in sqlite can be used outside of Python Con: * big, unless output is compressed

Parameters:
  • similarities_iter (Iterator) – Iterator with tuple with fingerprint 1 label, fingerprint 2 label, similarity as members
  • out (File) – Writeable file
kripodb.pairs.merge(ins, out)[source]

Concatenate similarity matrix files into a single one.

Parameters:
  • ins (list[str]) – List of input similarity matrix filenames
  • out (str) – Output similarity matrix filenames
Raises:

AssertionError – When nr of labels of input files is not the same

kripodb.pairs.open_similarity_matrix(fn)[source]

Open read-only similarity matrix file.

Parameters:fn (str) – Filename of similarity matrix
Returns:A read-only similarity matrix object
Return type:SimilarityMatrix | FrozenSimilarityMatrix
kripodb.pairs.similar(query, similarity_matrix, cutoff, limit=None)[source]

Find similar fragments to query based on similarity matrix.

Parameters:
  • query (str) – Query fragment identifier
  • similarity_matrix (kripodb.db.SimilarityMatrix) – Similarity matrix
  • cutoff (float) – Cutoff, similarity scores below cutoff are discarded.
  • limit (int) – Maximum number of hits. Default is None for no limit.
Yields:

Tuple[(str, str, float)] – List of (query fragment identifier, hit fragment identifier, similarity score) sorted on similarity score

kripodb.pairs.similar_run(query, pairsdbfn, cutoff, out)[source]

Find similar fragments to query based on similarity matrix and write to tab delimited file.

Parameters:
  • query (str) – Query fragment identifier
  • pairsdbfn (str) – Filename of similarity matrix file or url of kripodb webservice
  • cutoff (float) – Cutoff, similarity scores below cutoff are discarded.
  • out (File) – File object to write output to
kripodb.pairs.similarity2query(bitsets2, query, out, mean_onbit_density, cutoff, memory)[source]

Calculate similarity of query against all fingerprints in bitsets2 and write to tab delimited file.

Parameters:
  • bitsets2 (kripodb.db.IntbitsetDict) –
  • query (str) – Query identifier or beginning of it
  • out (File) – File object to write output to
  • mean_onbit_density (flaot) – Mean on bit density
  • cutoff (float) – Cutoff, similarity scores below cutoff are discarded.
  • memory (Optional[bool]) – When true will load bitset2 into memory, when false it doesn’t
kripodb.pairs.total_number_of_pairs(fingerprint_filenames)[source]

Count number of pairs in similarity matrix files

Parameters:fingerprint_filenames (list[str]) – List of file names of similarity matrices
Returns:Total number of pairs
Return type:int