DTI 1a: Ducking BindingDB

Introduction to drug-target interaction with BindingDB and DuckDB.

Published

May 15, 2024

Introduction

Drug-target interaction, or DTI, refers to the binding of a drug (a chemical) to a target (typically, a protein). Beneficial binds between a drug and target will either up-regulate favorable behavior or down-regulate unfavorable behavior. Beyond ability to favorably change target behavior, drugs can be further filtered by their target and effect specificity (low chance of binding to unintended targets and low chance of side effects, respectively). It follows that large collections of DTIs are valuable during drug development. DTIs can be established experimentally but computational methods are attractive for their ability to reduce drug discovery cost and time.

Computational methods

Computational methods innovate both

the data structures used to represent the drugs and targets (SMILES, graphs, ECFP, etc.) and
the models used to predict or simulate binding affinity.

Data structures and the models that consume them are usually tightly coupled. That said, there’s opportunity to experiment with both somewhat independently of each other.

BindingDB

Released in 1995, BindingDB is one of the older datasets capturing drug-target interactions. Since the original release, new drug-target interactions have continued to be curated from scientific literature. As of May 7th, 2024, BindingDB contains 2,882,970 measurements, 1,243,034 compounds, 9,311 targets [1].

Goal

This series will focus on:

Processing BindingDB drugs and targets for machine learning
Experimenting with representations and embeddings
Experimenting with binding affinity models

Due to the data scale, processing, embedding, and prediction will require thoughtful tool and pattern selection. For processing, frameworks like DuckDB, spark, and polars should make the problem tractable. Scaling to GPU-enabled cloud machines will likely be needed for embedding and prediction.

Coarse source data model

Coarsely, I’ll use a data model that captures reactions (which involve a ligand and target), targets, and ligands. Ligands will be stored as molecules (system level), atoms, and bonds. This data model is flexible to downstream feature engineering and embedding for ML.

erDiagram
    Reaction
    Target

    Ligand_Molecule
    Ligand_Atom
    Ligand_Bond

    Reaction }o--|| Ligand_Molecule : "reacts"
    Reaction }o--|| Target : "reacts"
    Ligand_Molecule ||--|{ Ligand_Atom : "contains"
    Ligand_Bond }|--|{ Ligand_Atom : "bonds"

Note: BindingDB tends to use “ligand” and “monomer” interchangeably with “drug” and “enzyme” interchangeably with “target”.

This release focuses on building the Reaction and Target tables with DuckDB. The next release will focus on the Ligand tables.

BindingDB SQL Setup

Create MySQL database from dump file

Log into your MySQL server
Create bindingdb database

mysql> create database bindingdb

Load data from dump file into newly created bindingdb database

mysql -u username -p bindingdb < BDB-mySQL_All_202404.dmp

Initialize SQL connection

import duckdb

conn = duckdb.connect()

%load_ext sql
%config SqlMagic.displaylimit = 20
%sql conn --alias duckdb

%%sql
INSTALL mysql;
DETACH DATABASE IF EXISTS bindingdb;
LOAD mysql;
ATTACH 'host=localhost user=root port=3306 database=bindingdb' AS bindingdb (TYPE mysql_scanner, READ_ONLY);

Running query in 'duckdb'

Success

Build Reaction Table

Use the BindingDB SQL DBs to build the reaction table.

erDiagram
    Reaction {
        int reaction_id PK
        int ligand_id FK
        int target_id FK
        float affinity_ic50
        float affinity_ki
        float affinity_kd
        float affinity_ec50
        varchar[] affinity_measures "list of non-NULL affinity measures by their column names"
        decimal delta_g
        float k_off
        float k_on
        decimal ph
        decimal temp
    }

Build reaction table backbone

%%sql
WITH ligand_target_ids AS (
    SELECT
        reactant_set_id,
        inhibitor_monomerid AS ligand_id,
        COALESCE(
            enzyme_monomerid,
            enzyme_polymerid,
            enzyme_complexid
        ) AS target_id,
        CASE
            WHEN enzyme_monomerid IS NOT NULL THEN 'monomer'
            WHEN enzyme_polymerid IS NOT NULL THEN 'polymer'
            WHEN enzyme_complexid IS NOT NULL THEN 'complex'
        END AS target_type
    FROM
        bindingdb.enzyme_reactant_set
)
SELECT
    target_type,
    count(*) AS count
FROM
    ligand_target_ids
GROUP BY
    target_type

Running query in 'duckdb'

target_type	count
complex	149181
polymer	2695549
monomer	19

Complex and monomer targets will be ignored given:

Polymers are the majority target type.
Representing monomers, polymers, and complex molecules in a single format is non-trivial. (We prefer a single representation format to simplify target embedding in later steps.)

reaction_backbone = conn.sql(
    """
SELECT
    cast(reactant_set_id AS INTEGER) AS reaction_id,
    cast(inhibitor_monomerid AS INTEGER) AS ligand_id,
    cast(enzyme_polymerid AS INTEGER) AS target_id,
FROM
    bindingdb.enzyme_reactant_set
WHERE
    enzyme_polymerid IS NOT NULL
"""
)

Build reaction results subtable

%%sql
SELECT
    reactant_set_id AS reaction_id,
    ic50 AS affinity_ic50,
    ki AS affinity_ki,
    kd AS affinity_kd,
    ec50 AS affinity_ec50,
    delta_g,
    koff AS k_off,
    kon AS k_on,
    ph,
    temp,
FROM
    bindingdb.ki_result

Running query in 'duckdb'

reaction_id	affinity_ic50	affinity_ki	affinity_kd	affinity_ec50	delta_g	k_off	k_on	ph	temp
50958048	None	10000	None	None	None	None	None	None	None
50958049	None	150	None	None	None	None	None	None	None
50958050	None	6700	None	None	None	None	None	None	None
50958051	0.100000	None	None	None	None	None	None	None	None
50958052	0.700000	None	None	None	None	None	None	None	None
50958054	0.100000	None	None	None	None	None	None	None	None
50958056	5.1	None	None	None	None	None	None	None	None
50958057	5.7	None	None	None	None	None	None	None	None
50958058	25	None	None	None	None	None	None	None	None
50958060	5.8	None	None	None	None	None	None	None	None
50958062	7.8	None	None	None	None	None	None	None	None
50958063	2.7	None	None	None	None	None	None	None	None
50958064	11	None	None	None	None	None	None	None	None
50958066	3.8	None	None	None	None	None	None	None	None
50958068	22	None	None	None	None	None	None	None	None
50958069	25	None	None	None	None	None	None	None	None
50958070	73	None	None	None	None	None	None	None	None
50958072	26	None	None	None	None	None	None	None	None
50958074	0.600000	None	None	None	None	None	None	None	None
50958075	0.400000	None	None	None	None	None	None	None	None

Truncated to displaylimit of 20.

The affinity measures and binding constants (koff, kon) are string types. These need to be converted.

conn.sql("SELECT ic50, ki, kd, ec50, delta_g, koff, kon, ph, temp FROM bindingdb.ki_result").dtypes

[VARCHAR,
 VARCHAR,
 VARCHAR,
 VARCHAR,
 DECIMAL(10,4),
 VARCHAR,
 VARCHAR,
 DECIMAL(10,4),
 DECIMAL(10,4)]

from duckdb.typing import VARCHAR, FLOAT
from duckdb import InvalidInputException
from typing import Optional


def string_to_float(string: str) -> Optional[float]:
    if isinstance(string, str):
        try:
            return float(string.replace(">", "").replace("<", ""))
        except ValueError:
            pass


try:
    conn.remove_function("string_to_float");
except InvalidInputException:
    pass

conn.create_function("string_to_float", string_to_float, [VARCHAR], FLOAT);

reaction_result = conn.sql(
    """
SELECT
    cast(reactant_set_id AS INTEGER) AS reaction_id,
    string_to_float(ic50) AS affinity_ic50,
    string_to_float(ki) AS affinity_ki,
    string_to_float(kd) AS affinity_kd,
    string_to_float(ec50) AS affinity_ec50,
    delta_g,
    string_to_float(koff) AS k_off,
    string_to_float(kon) AS k_on,
    ph,
    temp,
FROM
    bindingdb.ki_result
"""
)

reaction_result.fetchdf()

	reaction_id	affinity_ic50	affinity_ki	affinity_kd	affinity_ec50	delta_g	k_off	k_on	ph	temp
0	50958048	NaN	10000.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	50958049	NaN	150.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	50958050	NaN	6700.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	50958051	0.1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	50958052	0.7	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...
2844743	51451441	6700.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2844744	51451442	2700.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2844745	51451443	490.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2844746	51451444	17300.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2844747	51451445	510.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

2844748 rows × 10 columns

Build `affinity_measures` column

The PySpark DataFrame API would allow this logic to be defined in a more concise way. The DuckDB Expression API is powerful but far more verbose.

from duckdb import (
    CaseExpression,
    ColumnExpression,
    ConstantExpression,
    FunctionExpression,
    StarExpression,
)

from itertools import combinations
from functools import reduce

affinity_cols = ["affinity_ic50", "affinity_ki", "affinity_kd", "affinity_ec50"]

affinity_measure_candidates = sorted(
    [
        comb
        for comb_r in [combinations(affinity_cols, r=i) for i in range(1, len(affinity_cols) + 1)]
        for comb in comb_r
    ],
    key=lambda x: len(x),
    reverse=True,
)

case_args = lambda affinity_measures: (
    FunctionExpression(
        "list_aggregate",
        FunctionExpression(
            "list_value",
            *[
                ColumnExpression(affinity_measure).isnotnull()
                for affinity_measure in affinity_measures
            ]
        ),
        ConstantExpression("min"),
    )
    > 0,
    FunctionExpression(
        "list_value",
        *[ConstantExpression(affinity_measure) for affinity_measure in affinity_measures]
    ),
)

affinities_list_map = reduce(
    lambda acc, affinity_measures: acc.when(*case_args(affinity_measures)),
    affinity_measure_candidates[1:],
    CaseExpression(*case_args(affinity_measure_candidates[0])),
).alias("affinity_measures")

Join reaction backbone and result

reaction = reaction_backbone.join(reaction_result, "reaction_id", "left").select(
    "reaction_id",
    "ligand_id",
    "target_id",
    "affinity_ic50",
    "affinity_ki",
    "affinity_kd",
    "affinity_ec50",
    affinities_list_map,
    "delta_g",
    "k_off",
    "k_on",
    "ph",
    "temp",
)

reaction.dtypes

[INTEGER,
 INTEGER,
 INTEGER,
 FLOAT,
 FLOAT,
 FLOAT,
 FLOAT,
 VARCHAR[],
 DECIMAL(10,4),
 FLOAT,
 FLOAT,
 DECIMAL(10,4),
 DECIMAL(10,4)]

reaction.fetchdf()

	reaction_id	ligand_id	target_id	affinity_ic50	affinity_ki	affinity_kd	affinity_ec50	affinity_measures	delta_g	k_off	k_on	ph	temp
0	9826	6121	13	5.1	NaN	NaN	NaN	[affinity_ic50]	NaN	NaN	NaN	NaN	NaN
1	9827	6122	13	2.8	NaN	NaN	NaN	[affinity_ic50]	NaN	NaN	NaN	NaN	NaN
2	9828	6123	13	12.0	NaN	NaN	NaN	[affinity_ic50]	NaN	NaN	NaN	NaN	NaN
3	9829	6124	13	25.0	NaN	NaN	NaN	[affinity_ic50]	NaN	NaN	NaN	NaN	NaN
4	9830	6125	13	1.5	NaN	NaN	NaN	[affinity_ic50]	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...
2695544	50748287	50239048	4164	NaN	NaN	NaN	0.8	[affinity_ec50]	NaN	NaN	NaN	NaN	NaN
2695545	50748290	50239040	4164	NaN	NaN	NaN	1.0	[affinity_ec50]	NaN	NaN	NaN	NaN	NaN
2695546	50748294	50239049	4164	NaN	NaN	NaN	2.0	[affinity_ec50]	NaN	NaN	NaN	NaN	NaN
2695547	50748235	50239045	4164	NaN	NaN	NaN	86.0	[affinity_ec50]	NaN	NaN	NaN	NaN	NaN
2695548	50748301	50239042	50002942	NaN	NaN	NaN	728.0	[affinity_ec50]	NaN	NaN	NaN	NaN	NaN

2695549 rows × 13 columns

reaction.aggregate("affinity_measures, count(*) AS count").order("len(affinity_measures)")

┌─────────────────────────────────────────────┬─────────┐
│              affinity_measures              │  count  │
│                  varchar[]                  │  int64  │
├─────────────────────────────────────────────┼─────────┤
│ [affinity_ic50]                             │ 1778076 │
│ [affinity_ki]                               │  553740 │
│ [affinity_ec50]                             │  250765 │
│ [affinity_kd]                               │  104744 │
│ [affinity_ic50, affinity_ec50]              │    1409 │
│ [affinity_ic50, affinity_ki]                │    2889 │
│ [affinity_kd, affinity_ec50]                │      11 │
│ [affinity_ki, affinity_kd]                  │      61 │
│ [affinity_ic50, affinity_kd]                │     195 │
│ [affinity_ki, affinity_ec50]                │     817 │
│ [affinity_ic50, affinity_ki, affinity_kd]   │       9 │
│ [affinity_ic50, affinity_ki, affinity_ec50] │      99 │
│ NULL                                        │    2734 │
├─────────────────────────────────────────────┴─────────┤
│ 13 rows                                     2 columns │
└───────────────────────────────────────────────────────┘

Write

conn.sql(
    f"COPY (SELECT * FROM reaction) TO '{processed_data_path / 'reaction.parquet'}' (FORMAT 'parquet')"
)

Build Target Table

Use the BindingDB SQL DBs to build the (polymer) target table.

erDiagram
    Target {
        int id PK
        varchar scientific_name
        varchar display_name
        varchar type
        varchar topology
        varchar amino_acid_sequence
        float weight
    }

Spot check

initial_target = conn.sql(
    """
SELECT
    polymerid AS id,
    scientific_name,
    display_name,
    "type",
    topology,
    sequence AS amino_acid_sequence,
    weight
FROM
    bindingdb.polymer
WHERE
    component_id IS NULL
"""
)

Check for parsimony in low-cardinality categorical fields.

initial_target.select("topology").distinct()

┌───────────────────────────────┐
│           topology            │
│            varchar            │
├───────────────────────────────┤
│ NULL                          │
│ linear                        │
│ HTH motiff                    │
│ Bi-molecular anti-parallel G4 │
│ null                          │
│ Linear                        │
│ double helix                  │
│ Uni-molecular parallel G4     │
│ undefined                     │
│ NULL                          │
│ Four-stranded parallel G4     │
├───────────────────────────────┤
│            11 rows            │
└───────────────────────────────┘

Two issues:

“Nullish” values are encoded using multiple formats – “undefined”, “NULL” (string), NULL (value), and “null”
Inconsistent casing – “linear” vs. “Linear”

Detecting and filling “nullish” strings

from duckdb.typing import VARCHAR, BOOLEAN
from duckdb import InvalidInputException


def is_nullish_str(string: str) -> bool:
    return isinstance(string, str) and string.strip().lower() in (
        "null",
        "none",
        "undefined",
        "na",
        "nan",
        "n/a",
        "unknown",
        "",
    )


try:
    conn.remove_function("is_nullish_str")
except InvalidInputException:
    pass

conn.create_function("is_nullish_str", is_nullish_str, [VARCHAR], BOOLEAN)

<duckdb.duckdb.DuckDBPyConnection at 0x1071d2b30>

assert all(
    conn.sql(
        "SELECT is_nullish_str('null'), is_nullish_str('none'), is_nullish_str('undefined'), is_nullish_str('na'), is_nullish_str('nan'), is_nullish_str('n/a'), is_nullish_str('unknown'), is_nullish_str('')"
    ).fetchdf().iloc [0].values
)

Other string fields have nullish values that need to be corrected as well.

%sql SELECT * FROM initial_target WHERE is_nullish_str(scientific_name) OR is_nullish_str(display_name) OR is_nullish_str(type) OR is_nullish_str(topology)

Running query in 'duckdb'

id	scientific_name	display_name	type	topology	amino_acid_sequence	weight
1368	None	Unknown	null	null	None	358.43
416	None	Metallo-beta-lactamase type 2	undefined	undefined	MSKLSVFFIFLFCSIATAAESLPDLKIEKLDEGVYVHTSFEEVNGWGVVPKHGLVVLVNA EAYLIDTPFTAKDTEKLVTWFVERGYKIKGSISSHFHSDSTGGIEWLNSRSIPTYASELT NELLKKDGKVQATNSFSGVNYWLVKNKIEVFYPGPGHTPDNVVVWLPERKILFGGCFIKP YGLGNLGDANIEAWPKSAKLLKSKYGKAKLVVPSHSEVGDASLLKLTLEQAVKGLNESKK PSKPSN	27125.88
424	None	VIM-1 metallo-beta-lactamase	undefined	undefined	MLKVISSLLVYMTASVMAVASPLAHSGEPSGEYPTVNEIPVGEVRLYQIADGVWSHIATQ SFDGAVYPSNGLIVRDGDELLLIDTAWGAKNTAALLAEIEKQIGLPVTRAVSTHFHDDRV GGVDVLRAAGVATYASPSTRRLAEAEGNEIPTHSLEGLSSSGDAVRFGPVELFYPGAAHS TDNLVVYVPSANVLYGGCAVHELSSTSAGNVADADLAEWPTSVERIQKHYPEAEVVIPGH GLPGGLDLLQHTANVVKAHKNRSVAE	28014.53
687	None	5-hydroxytryptamine receptor 2A	undefined	undefined	MDILCEENTSLSSTTNSLMQLNDDTRLYSNDFNSGEANTSDAFNWTVDSENRTNLSCEGCLSPSCLSLLHLQEKNWSALLTAVVIILTIAGNILVIMAVSLEKKLQNATNYFLMSLAIADMLLGFLVMPVSMLTILYGYRWPLPSKLCAVWIYLDVLFSTASIMHLCAISLDRYVAIQNPIHHSRFNSRTKAFLKIIAVWTISVGISMPIPVFGLQDDSKVFKEGSCLLADDNFVLIGSFVSFFIPLTIMVITYFLTIKSLQKEATLCVSDLGTRAKLASFSFLPQSSLSSEKLFQRSIHREPGSYTGRRTMQSISNEQKACKVLGIVFFLFVVMWCPFFITNIMAVICKESCNEDVIGALLNVFVWIGYLSSAVNPLVYTLFNKTYRSAFSRYIQCQYKENKKPLQLILVNTIPALAYKSSQLQMGQKKNSKQDAKTTDNDCSMVALGKQHSEEASKDNSDGVNEKVSCV	52607.65
1094	None	Caspase-3 Substrate	null	Linear	Ac-Asp-Glu-Val-Asp-pNA	1438.65
1097	None	Caspase Substrate	null	null	None	358.43
1109	None	Caspase-1 Fluorogenic Substrate	null	null	Ac-Tyr-Val-Ala-Asp-AMC	1713.07
1111	None	Caspase-2 Fluorogenic Substrate	null	null	Ac-Val-Asp-Val-Ala-Asp-AMC	1830.24
1113	None	Caspase-4 Fluorogenic Substrate	null	null	Ac-Leu-Glu-Glu-Asp-AMC	1334.60
1119	None	Caspase-9 Fluorogenic Substrate	null	null	Ac-Leu-Glu-His-Asp-AFC	1518.25
1124	None	Caspase-6 Fluorogenic Substrate	null	Linear	Ac-Val-Glu-Ile-Asp-AMC	1560.92
1397	None	Not Specified	null	null	None	358.43
1830	None	Not Specified	null	null	None	358.43
2388	None	Fluorogenic peptide	null	null	Dabcyl-TSAVLQSGFR-Edans	2145.26
2726	Proteus vulgaris	Urease subunit gamma	null	null	MELTPREKDKLLLFTAGLVAERRLAKGLKLNYPEAVALISCAIMEGAREGKTVAQLMSEG RSVLAAEQVMEGVPEMIKDIQVECTFPDGTKLVSIHDPIV	10908.21
8	Escherichia coli	SecB	Protein	NULL	MSEQNNTEMTFQIQRIYTKDISFEAPNAPHVFQKDWQPEVKLDLDTASSQLADDVYEVVL RVTVTASLGEETAFLCEVQQGGIFSIAGIEGTQMAHCLGAYCPNILFPYARECITSMVSR GTFPQLNLAPVNFDALFMNYLQQQAGEGTEEHQDA	17264.58
14	Mus musculus	HyHEL-5 Fab	Antibody	NULL	EVQLQQSGAELMKPGASVKISCKASGYTFSDYWIEWVKQRPGHGLEWIGEILPGSGSTNY HERFKGKATFTADTSSSTAYMQLNSLTSEDSGVYYCLHGNYDFDGWGQGTTLTVSSAKTT PPSVYPLAPGSAAQTNSMVTLGCLVKGYFPEPVTVTWNSGSLSSGVHTFPAVLQSDLYTL SSSVTVPSSTWPSETVTCNVAHPASSTKVDKKILD	23011.84
15	Gallus gallus	Lysozyme C	Enzyme	NULL	MRSLLILVLCFLPLAALGKVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQA TNRNTDGSTDYGILQINSRWWCNDGRTPGSRNLCNIPCSALLSSDITASVNCAKKIVSDG NGMNAWVAWRNRCKGTDVQAWIRGCRL	16248.07
16	Mus musculus	HyHEL-5 Mutant(E50D)	Antibody	NULL	EVQLQQSGAELMKPGASVKISCKASGYTFSDYWIEWVKQRPGHGLEWIGDILPGSGSTNY HERFKGKATFTADTSSSTAYMQLNSLTSEDSGVYYCLHGNYDFDGWGQGTTLTVSSAKTT PPSVYPLAPGSAAQTNSMVTLGCLVKGYFPEPVTVTWNSGSLSSGVHTFPAVLQSDLYTL SSSVTVPSSTWPSETVTCNVAHPASSTKVDKKILD	22997.82
17	Mus musculus	HyHEL-5 Mutant(E50Q)	Antibody	NULL	EVQLQQSGAELMKPGASVKISCKASGYTFSDYWIEWVKQRPGHGLEWIGQILPGSGSTNY HERFKGKATFTADTSSSTAYMQLNSLTSEDSGVYYCLHGNYDFDGWGQGTTLTVSSAKTT PPSVYPLAPGSAAQTNSMVTLGCLVKGYFPEPVTVTWNSGSLSSGVHTFPAVLQSDLYTL SSSVTVPSSTWPSETVTCNVAHPASSTKVDKKILD	23011.87

Truncated to displaylimit of 20.

Build

target = conn.sql(
    """
SELECT
    polymerid AS id,
    CASE
        WHEN is_nullish_str(scientific_name) THEN NULL
        ELSE scientific_name
    END AS scientific_name,
    CASE
        WHEN is_nullish_str(display_name) THEN NULL
        ELSE display_name
    END AS display_name,
    CASE
        WHEN is_nullish_str(type) OR (type IS NULL) THEN NULL
        ELSE lower(type)
    END AS type,
    CASE
        WHEN is_nullish_str(topology) OR (topology IS NULL) THEN NULL
        ELSE lower(topology)
    END AS topology,
    sequence AS amino_acid_sequence,
    weight
FROM
    bindingdb.polymer
WHERE
    component_id IS NULL
"""
)

target.fetchdf()

	id	scientific_name	display_name	type	topology	amino_acid_sequence	weight
0	1	Human herpesvirus 1	Thymidine kinase	enzyme	linear	MASYPCHQHASAFDQAARSRGHNNRRTALRPRRQQEATEVRPEQKM...	40978.57
1	2	Streptomyces avidinii	Streptavadin(N23A)	protein	linear	DPSKDSKAQV SAAEAGITGT WYAQLGSTFI VTAGADGALT GT...	16447.70
2	3	Streptomyces avidinii	Streptavadin(Y43A)	protein	linear	DPSKDSKAQV SAAEAGITGT WYNQLGSTFI VTAGADGALT GT...	16398.63
3	4	Streptomyces avidinii	Streptavadin(S27A)	protein	linear	DPSKDSKAQV SAAEAGITGT WYNQLGATFI VTAGADGALT GT...	16474.72
4	6	Streptomyces avidinii	Streptavidin (Y43F)	protein	linear	DPSKDSKAQV SAAEAGITGT WYNQLGSTFI VTAGADGALT GT...	16474.72
...	...	...	...	...	...	...	...
5565	50008001	None	Penicillin-binding protein 2B	protein	None	MRKFNSHSIPIRLNLLFSIVILLFMTIIGRLLYMQVLNKDFYEKKL...	73859.38
5566	50008002	None	DNA gyrase subunit A	protein	None	MSDLAREITPVNIEEELKSSYLDYAMSVIVGRALPDVRDGLKPVHR...	97006.20
5567	50008003	None	1,3-beta-glucan synthase	protein	None	MSYNDNNNNHYYDPNQQGGVPPQQGDGYYQQQYDDMSQQQQYDNMG...	218939.56
5568	50008005	None	1,3-beta-glucan synthase	protein	None	MSYNNNGQQMSDQGYYDNNQGYYQPEDQQNGQAMYGDEGYYDPNIS...	213936.94
5569	50008006	None	Platelet glycoprotein 4	protein	None	MGCDRNCGLIAGAVIGAVLAVFGGILMPVGDLLIQKTIKKQVVLEE...	53057.78

5570 rows × 7 columns

Write

conn.sql(
    f"COPY (SELECT * FROM target) TO '{processed_data_path / 'target.parquet'}' (FORMAT 'parquet')"
)

DuckDB impressions

I’ve been following the DuckDB hype but haven’t had a chance to get firsthand experience until now. The entire BindingDB SQL database is only 6.5 GB, an easily manageable size even for Pandas. This isn’t what DuckDB was built for – it’s clear overkill. That said, it was a good opportunity to get familiar with the framework. While the data is too small to give DuckDB an obvious leg up, the framework does make lazily loading tables from the source MySQL database easy. There’d be little reason to use another tool if it weren’t for DuckDB’s API design. As is, transformations that would be concise in Pandas, Polars, or even Spark’s DataFrame API are verbose in DuckDB – FunctionExpression, CaseExpression, list_aggregate, etc.. Also, I prefer DataFrame APIs over SQL for their composability. DuckDB has implemented a Spark DataFrame API, but it’s experimental. The native “Expression” and “Relational” APIs support composition and piping but are idiosyncratic. Learning these could be justified for larger datasets but it wasn’t worth the squeeze here. Based on benchmarks (Database-like ops benchmark, TPC-H Benchmarks at Scale), Polars is comparable to DuckDB for datasets on the 50 - 100 GB scale. Considering code writing / reading efficiency as well as time efficiency, Polars is the framework of choice for problems of this scale. Larger datasets or high-frequency pipelines would benefit from DuckDB’s additional speed.

Introduction

Computational methods

BindingDB

Goal

Coarse source data model

BindingDB SQL Setup

Create MySQL database from dump file

Initialize SQL connection

Build Reaction Table

Build reaction table backbone

Build reaction results subtable

Build affinity_measures column

Join reaction backbone and result

Write

Build Target Table

Spot check

Detecting and filling “nullish” strings

Build

Write

DuckDB impressions

Build `affinity_measures` column