DTI 1a: Ducking BindingDB

Introduction to drug-target interaction with BindingDB and DuckDB.

Published

May 15, 2024

Introduction

Drug-target interaction, or DTI, refers to the binding of a drug (a chemical) to a target (typically, a protein). Beneficial binds between a drug and target will either up-regulate favorable behavior or down-regulate unfavorable behavior. Beyond ability to favorably change target behavior, drugs can be further filtered by their target and effect specificity (low chance of binding to unintended targets and low chance of side effects, respectively). It follows that large collections of DTIs are valuable during drug development. DTIs can be established experimentally but computational methods are attractive for their ability to reduce drug discovery cost and time.

Computational methods

Computational methods innovate both

  • the data structures used to represent the drugs and targets (SMILES, graphs, ECFP, etc.) and
  • the models used to predict or simulate binding affinity.

Data structures and the models that consume them are usually tightly coupled. That said, there’s opportunity to experiment with both somewhat independently of each other.

BindingDB

Released in 1995, BindingDB is one of the older datasets capturing drug-target interactions. Since the original release, new drug-target interactions have continued to be curated from scientific literature. As of May 7th, 2024, BindingDB contains 2,882,970 measurements, 1,243,034 compounds, 9,311 targets [1].

Goal

This series will focus on:

  1. Processing BindingDB drugs and targets for machine learning
  2. Experimenting with representations and embeddings
  3. Experimenting with binding affinity models

Due to the data scale, processing, embedding, and prediction will require thoughtful tool and pattern selection. For processing, frameworks like DuckDB, spark, and polars should make the problem tractable. Scaling to GPU-enabled cloud machines will likely be needed for embedding and prediction.

Coarse source data model

Coarsely, I’ll use a data model that captures reactions (which involve a ligand and target), targets, and ligands. Ligands will be stored as molecules (system level), atoms, and bonds. This data model is flexible to downstream feature engineering and embedding for ML.

erDiagram
    Reaction
    Target

    Ligand_Molecule
    Ligand_Atom
    Ligand_Bond

    Reaction }o--|| Ligand_Molecule : "reacts"
    Reaction }o--|| Target : "reacts"
    Ligand_Molecule ||--|{ Ligand_Atom : "contains"
    Ligand_Bond }|--|{ Ligand_Atom : "bonds"

Note: BindingDB tends to use “ligand” and “monomer” interchangeably with “drug” and “enzyme” interchangeably with “target”.

This release focuses on building the Reaction and Target tables with DuckDB. The next release will focus on the Ligand tables.

BindingDB SQL Setup

Create MySQL database from dump file

  1. Log into your MySQL server
  2. Create bindingdb database
mysql> create database bindingdb
  1. Load data from dump file into newly created bindingdb database
mysql -u username -p bindingdb < BDB-mySQL_All_202404.dmp

Initialize SQL connection

import duckdb

conn = duckdb.connect()
%load_ext sql
%config SqlMagic.displaylimit = 20
%sql conn --alias duckdb
%%sql
INSTALL mysql;
DETACH DATABASE IF EXISTS bindingdb;
LOAD mysql;
ATTACH 'host=localhost user=root port=3306 database=bindingdb' AS bindingdb (TYPE mysql_scanner, READ_ONLY);
Running query in 'duckdb'
Success

Build Reaction Table

Use the BindingDB SQL DBs to build the reaction table.

erDiagram
    Reaction {
        int reaction_id PK
        int ligand_id FK
        int target_id FK
        float affinity_ic50
        float affinity_ki
        float affinity_kd
        float affinity_ec50
        varchar[] affinity_measures "list of non-NULL affinity measures by their column names"
        decimal delta_g
        float k_off
        float k_on
        decimal ph
        decimal temp
    }

Build reaction table backbone

%%sql
WITH ligand_target_ids AS (
    SELECT
        reactant_set_id,
        inhibitor_monomerid AS ligand_id,
        COALESCE(
            enzyme_monomerid,
            enzyme_polymerid,
            enzyme_complexid
        ) AS target_id,
        CASE
            WHEN enzyme_monomerid IS NOT NULL THEN 'monomer'
            WHEN enzyme_polymerid IS NOT NULL THEN 'polymer'
            WHEN enzyme_complexid IS NOT NULL THEN 'complex'
        END AS target_type
    FROM
        bindingdb.enzyme_reactant_set
)
SELECT
    target_type,
    count(*) AS count
FROM
    ligand_target_ids
GROUP BY
    target_type
Running query in 'duckdb'
target_type count
complex 149181
polymer 2695549
monomer 19

Complex and monomer targets will be ignored given:

  1. Polymers are the majority target type.
  2. Representing monomers, polymers, and complex molecules in a single format is non-trivial. (We prefer a single representation format to simplify target embedding in later steps.)
reaction_backbone = conn.sql(
    """
SELECT
    cast(reactant_set_id AS INTEGER) AS reaction_id,
    cast(inhibitor_monomerid AS INTEGER) AS ligand_id,
    cast(enzyme_polymerid AS INTEGER) AS target_id,
FROM
    bindingdb.enzyme_reactant_set
WHERE
    enzyme_polymerid IS NOT NULL
"""
)

Build reaction results subtable

%%sql
SELECT
    reactant_set_id AS reaction_id,
    ic50 AS affinity_ic50,
    ki AS affinity_ki,
    kd AS affinity_kd,
    ec50 AS affinity_ec50,
    delta_g,
    koff AS k_off,
    kon AS k_on,
    ph,
    temp,
FROM
    bindingdb.ki_result
Running query in 'duckdb'
reaction_id affinity_ic50 affinity_ki affinity_kd affinity_ec50 delta_g k_off k_on ph temp
50958048 None 10000 None None None None None None None
50958049 None 150 None None None None None None None
50958050 None 6700 None None None None None None None
50958051 0.100000 None None None None None None None None
50958052 0.700000 None None None None None None None None
50958054 0.100000 None None None None None None None None
50958056 5.1 None None None None None None None None
50958057 5.7 None None None None None None None None
50958058 25 None None None None None None None None
50958060 5.8 None None None None None None None None
50958062 7.8 None None None None None None None None
50958063 2.7 None None None None None None None None
50958064 11 None None None None None None None None
50958066 3.8 None None None None None None None None
50958068 22 None None None None None None None None
50958069 25 None None None None None None None None
50958070 73 None None None None None None None None
50958072 26 None None None None None None None None
50958074 0.600000 None None None None None None None None
50958075 0.400000 None None None None None None None None
Truncated to displaylimit of 20.

The affinity measures and binding constants (koff, kon) are string types. These need to be converted.

conn.sql("SELECT ic50, ki, kd, ec50, delta_g, koff, kon, ph, temp FROM bindingdb.ki_result").dtypes
[VARCHAR,
 VARCHAR,
 VARCHAR,
 VARCHAR,
 DECIMAL(10,4),
 VARCHAR,
 VARCHAR,
 DECIMAL(10,4),
 DECIMAL(10,4)]
from duckdb.typing import VARCHAR, FLOAT
from duckdb import InvalidInputException
from typing import Optional


def string_to_float(string: str) -> Optional[float]:
    if isinstance(string, str):
        try:
            return float(string.replace(">", "").replace("<", ""))
        except ValueError:
            pass


try:
    conn.remove_function("string_to_float");
except InvalidInputException:
    pass

conn.create_function("string_to_float", string_to_float, [VARCHAR], FLOAT);
reaction_result = conn.sql(
    """
SELECT
    cast(reactant_set_id AS INTEGER) AS reaction_id,
    string_to_float(ic50) AS affinity_ic50,
    string_to_float(ki) AS affinity_ki,
    string_to_float(kd) AS affinity_kd,
    string_to_float(ec50) AS affinity_ec50,
    delta_g,
    string_to_float(koff) AS k_off,
    string_to_float(kon) AS k_on,
    ph,
    temp,
FROM
    bindingdb.ki_result
"""
)
reaction_result.fetchdf()
reaction_id affinity_ic50 affinity_ki affinity_kd affinity_ec50 delta_g k_off k_on ph temp
0 50958048 NaN 10000.0 NaN NaN NaN NaN NaN NaN NaN
1 50958049 NaN 150.0 NaN NaN NaN NaN NaN NaN NaN
2 50958050 NaN 6700.0 NaN NaN NaN NaN NaN NaN NaN
3 50958051 0.1 NaN NaN NaN NaN NaN NaN NaN NaN
4 50958052 0.7 NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ...
2844743 51451441 6700.0 NaN NaN NaN NaN NaN NaN NaN NaN
2844744 51451442 2700.0 NaN NaN NaN NaN NaN NaN NaN NaN
2844745 51451443 490.0 NaN NaN NaN NaN NaN NaN NaN NaN
2844746 51451444 17300.0 NaN NaN NaN NaN NaN NaN NaN NaN
2844747 51451445 510.0 NaN NaN NaN NaN NaN NaN NaN NaN

2844748 rows × 10 columns

Build affinity_measures column

The PySpark DataFrame API would allow this logic to be defined in a more concise way. The DuckDB Expression API is powerful but far more verbose.

from duckdb import (
    CaseExpression,
    ColumnExpression,
    ConstantExpression,
    FunctionExpression,
    StarExpression,
)

from itertools import combinations
from functools import reduce

affinity_cols = ["affinity_ic50", "affinity_ki", "affinity_kd", "affinity_ec50"]

affinity_measure_candidates = sorted(
    [
        comb
        for comb_r in [combinations(affinity_cols, r=i) for i in range(1, len(affinity_cols) + 1)]
        for comb in comb_r
    ],
    key=lambda x: len(x),
    reverse=True,
)

case_args = lambda affinity_measures: (
    FunctionExpression(
        "list_aggregate",
        FunctionExpression(
            "list_value",
            *[
                ColumnExpression(affinity_measure).isnotnull()
                for affinity_measure in affinity_measures
            ]
        ),
        ConstantExpression("min"),
    )
    > 0,
    FunctionExpression(
        "list_value",
        *[ConstantExpression(affinity_measure) for affinity_measure in affinity_measures]
    ),
)

affinities_list_map = reduce(
    lambda acc, affinity_measures: acc.when(*case_args(affinity_measures)),
    affinity_measure_candidates[1:],
    CaseExpression(*case_args(affinity_measure_candidates[0])),
).alias("affinity_measures")

Join reaction backbone and result

reaction = reaction_backbone.join(reaction_result, "reaction_id", "left").select(
    "reaction_id",
    "ligand_id",
    "target_id",
    "affinity_ic50",
    "affinity_ki",
    "affinity_kd",
    "affinity_ec50",
    affinities_list_map,
    "delta_g",
    "k_off",
    "k_on",
    "ph",
    "temp",
)
reaction.dtypes
[INTEGER,
 INTEGER,
 INTEGER,
 FLOAT,
 FLOAT,
 FLOAT,
 FLOAT,
 VARCHAR[],
 DECIMAL(10,4),
 FLOAT,
 FLOAT,
 DECIMAL(10,4),
 DECIMAL(10,4)]
reaction.fetchdf()
reaction_id ligand_id target_id affinity_ic50 affinity_ki affinity_kd affinity_ec50 affinity_measures delta_g k_off k_on ph temp
0 9826 6121 13 5.1 NaN NaN NaN [affinity_ic50] NaN NaN NaN NaN NaN
1 9827 6122 13 2.8 NaN NaN NaN [affinity_ic50] NaN NaN NaN NaN NaN
2 9828 6123 13 12.0 NaN NaN NaN [affinity_ic50] NaN NaN NaN NaN NaN
3 9829 6124 13 25.0 NaN NaN NaN [affinity_ic50] NaN NaN NaN NaN NaN
4 9830 6125 13 1.5 NaN NaN NaN [affinity_ic50] NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ...
2695544 50748287 50239048 4164 NaN NaN NaN 0.8 [affinity_ec50] NaN NaN NaN NaN NaN
2695545 50748290 50239040 4164 NaN NaN NaN 1.0 [affinity_ec50] NaN NaN NaN NaN NaN
2695546 50748294 50239049 4164 NaN NaN NaN 2.0 [affinity_ec50] NaN NaN NaN NaN NaN
2695547 50748235 50239045 4164 NaN NaN NaN 86.0 [affinity_ec50] NaN NaN NaN NaN NaN
2695548 50748301 50239042 50002942 NaN NaN NaN 728.0 [affinity_ec50] NaN NaN NaN NaN NaN

2695549 rows × 13 columns

reaction.aggregate("affinity_measures, count(*) AS count").order("len(affinity_measures)")
┌─────────────────────────────────────────────┬─────────┐
│              affinity_measures              │  count  │
│                  varchar[]                  │  int64  │
├─────────────────────────────────────────────┼─────────┤
│ [affinity_ic50]                             │ 1778076 │
│ [affinity_ki]                               │  553740 │
│ [affinity_ec50]                             │  250765 │
│ [affinity_kd]                               │  104744 │
│ [affinity_ic50, affinity_ec50]              │    1409 │
│ [affinity_ic50, affinity_ki]                │    2889 │
│ [affinity_kd, affinity_ec50]                │      11 │
│ [affinity_ki, affinity_kd]                  │      61 │
│ [affinity_ic50, affinity_kd]                │     195 │
│ [affinity_ki, affinity_ec50]                │     817 │
│ [affinity_ic50, affinity_ki, affinity_kd]   │       9 │
│ [affinity_ic50, affinity_ki, affinity_ec50] │      99 │
│ NULL                                        │    2734 │
├─────────────────────────────────────────────┴─────────┤
│ 13 rows                                     2 columns │
└───────────────────────────────────────────────────────┘

Write

conn.sql(
    f"COPY (SELECT * FROM reaction) TO '{processed_data_path / 'reaction.parquet'}' (FORMAT 'parquet')"
)

Build Target Table

Use the BindingDB SQL DBs to build the (polymer) target table.

erDiagram
    Target {
        int id PK
        varchar scientific_name
        varchar display_name
        varchar type
        varchar topology
        varchar amino_acid_sequence
        float weight
    }

Spot check

initial_target = conn.sql(
    """
SELECT
    polymerid AS id,
    scientific_name,
    display_name,
    "type",
    topology,
    sequence AS amino_acid_sequence,
    weight
FROM
    bindingdb.polymer
WHERE
    component_id IS NULL
"""
)

Check for parsimony in low-cardinality categorical fields.

initial_target.select("topology").distinct()
┌───────────────────────────────┐
│           topology            │
│            varchar            │
├───────────────────────────────┤
│ NULL                          │
│ linear                        │
│ HTH motiff                    │
│ Bi-molecular anti-parallel G4 │
│ null                          │
│ Linear                        │
│ double helix                  │
│ Uni-molecular parallel G4     │
│ undefined                     │
│ NULL                          │
│ Four-stranded parallel G4     │
├───────────────────────────────┤
│            11 rows            │
└───────────────────────────────┘

Two issues:

  1. “Nullish” values are encoded using multiple formats – “undefined”, “NULL” (string), NULL (value), and “null”
  2. Inconsistent casing – “linear” vs. “Linear”

Detecting and filling “nullish” strings

from duckdb.typing import VARCHAR, BOOLEAN
from duckdb import InvalidInputException


def is_nullish_str(string: str) -> bool:
    return isinstance(string, str) and string.strip().lower() in (
        "null",
        "none",
        "undefined",
        "na",
        "nan",
        "n/a",
        "unknown",
        "",
    )


try:
    conn.remove_function("is_nullish_str")
except InvalidInputException:
    pass

conn.create_function("is_nullish_str", is_nullish_str, [VARCHAR], BOOLEAN)
<duckdb.duckdb.DuckDBPyConnection at 0x1071d2b30>
assert all(
    conn.sql(
        "SELECT is_nullish_str('null'), is_nullish_str('none'), is_nullish_str('undefined'), is_nullish_str('na'), is_nullish_str('nan'), is_nullish_str('n/a'), is_nullish_str('unknown'), is_nullish_str('')"
    ).fetchdf().iloc [0].values
) 

Other string fields have nullish values that need to be corrected as well.

%sql SELECT * FROM initial_target WHERE is_nullish_str(scientific_name) OR is_nullish_str(display_name) OR is_nullish_str(type) OR is_nullish_str(topology)
Running query in 'duckdb'
id scientific_name display_name type topology amino_acid_sequence weight
1368 None Unknown null null None 358.43
416 None Metallo-beta-lactamase type 2 undefined undefined MSKLSVFFIFLFCSIATAAESLPDLKIEKLDEGVYVHTSFEEVNGWGVVPKHGLVVLVNA EAYLIDTPFTAKDTEKLVTWFVERGYKIKGSISSHFHSDSTGGIEWLNSRSIPTYASELT NELLKKDGKVQATNSFSGVNYWLVKNKIEVFYPGPGHTPDNVVVWLPERKILFGGCFIKP YGLGNLGDANIEAWPKSAKLLKSKYGKAKLVVPSHSEVGDASLLKLTLEQAVKGLNESKK PSKPSN 27125.88
424 None VIM-1 metallo-beta-lactamase undefined undefined MLKVISSLLVYMTASVMAVASPLAHSGEPSGEYPTVNEIPVGEVRLYQIADGVWSHIATQ SFDGAVYPSNGLIVRDGDELLLIDTAWGAKNTAALLAEIEKQIGLPVTRAVSTHFHDDRV GGVDVLRAAGVATYASPSTRRLAEAEGNEIPTHSLEGLSSSGDAVRFGPVELFYPGAAHS TDNLVVYVPSANVLYGGCAVHELSSTSAGNVADADLAEWPTSVERIQKHYPEAEVVIPGH GLPGGLDLLQHTANVVKAHKNRSVAE 28014.53
687 None 5-hydroxytryptamine receptor 2A undefined undefined MDILCEENTSLSSTTNSLMQLNDDTRLYSNDFNSGEANTSDAFNWTVDSENRTNLSCEGCLSPSCLSLLHLQEKNWSALLTAVVIILTIAGNILVIMAVSLEKKLQNATNYFLMSLAIADMLLGFLVMPVSMLTILYGYRWPLPSKLCAVWIYLDVLFSTASIMHLCAISLDRYVAIQNPIHHSRFNSRTKAFLKIIAVWTISVGISMPIPVFGLQDDSKVFKEGSCLLADDNFVLIGSFVSFFIPLTIMVITYFLTIKSLQKEATLCVSDLGTRAKLASFSFLPQSSLSSEKLFQRSIHREPGSYTGRRTMQSISNEQKACKVLGIVFFLFVVMWCPFFITNIMAVICKESCNEDVIGALLNVFVWIGYLSSAVNPLVYTLFNKTYRSAFSRYIQCQYKENKKPLQLILVNTIPALAYKSSQLQMGQKKNSKQDAKTTDNDCSMVALGKQHSEEASKDNSDGVNEKVSCV 52607.65
1094 None Caspase-3 Substrate null Linear Ac-Asp-Glu-Val-Asp-pNA 1438.65
1097 None Caspase Substrate null null None 358.43
1109 None Caspase-1 Fluorogenic Substrate null null Ac-Tyr-Val-Ala-Asp-AMC 1713.07
1111 None Caspase-2 Fluorogenic Substrate null null Ac-Val-Asp-Val-Ala-Asp-AMC 1830.24
1113 None Caspase-4 Fluorogenic Substrate null null Ac-Leu-Glu-Glu-Asp-AMC 1334.60
1119 None Caspase-9 Fluorogenic Substrate null null Ac-Leu-Glu-His-Asp-AFC 1518.25
1124 None Caspase-6 Fluorogenic Substrate null Linear Ac-Val-Glu-Ile-Asp-AMC 1560.92
1397 None Not Specified null null None 358.43
1830 None Not Specified null null None 358.43
2388 None Fluorogenic peptide null null Dabcyl-TSAVLQSGFR-Edans 2145.26
2726 Proteus vulgaris Urease subunit gamma null null MELTPREKDKLLLFTAGLVAERRLAKGLKLNYPEAVALISCAIMEGAREGKTVAQLMSEG RSVLAAEQVMEGVPEMIKDIQVECTFPDGTKLVSIHDPIV 10908.21
8 Escherichia coli SecB Protein NULL MSEQNNTEMTFQIQRIYTKDISFEAPNAPHVFQKDWQPEVKLDLDTASSQLADDVYEVVL RVTVTASLGEETAFLCEVQQGGIFSIAGIEGTQMAHCLGAYCPNILFPYARECITSMVSR GTFPQLNLAPVNFDALFMNYLQQQAGEGTEEHQDA 17264.58
14 Mus musculus HyHEL-5 Fab Antibody NULL EVQLQQSGAELMKPGASVKISCKASGYTFSDYWIEWVKQRPGHGLEWIGEILPGSGSTNY HERFKGKATFTADTSSSTAYMQLNSLTSEDSGVYYCLHGNYDFDGWGQGTTLTVSSAKTT PPSVYPLAPGSAAQTNSMVTLGCLVKGYFPEPVTVTWNSGSLSSGVHTFPAVLQSDLYTL SSSVTVPSSTWPSETVTCNVAHPASSTKVDKKILD 23011.84
15 Gallus gallus Lysozyme C Enzyme NULL MRSLLILVLCFLPLAALGKVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQA TNRNTDGSTDYGILQINSRWWCNDGRTPGSRNLCNIPCSALLSSDITASVNCAKKIVSDG NGMNAWVAWRNRCKGTDVQAWIRGCRL 16248.07
16 Mus musculus HyHEL-5 Mutant(E50D) Antibody NULL EVQLQQSGAELMKPGASVKISCKASGYTFSDYWIEWVKQRPGHGLEWIGDILPGSGSTNY HERFKGKATFTADTSSSTAYMQLNSLTSEDSGVYYCLHGNYDFDGWGQGTTLTVSSAKTT PPSVYPLAPGSAAQTNSMVTLGCLVKGYFPEPVTVTWNSGSLSSGVHTFPAVLQSDLYTL SSSVTVPSSTWPSETVTCNVAHPASSTKVDKKILD 22997.82
17 Mus musculus HyHEL-5 Mutant(E50Q) Antibody NULL EVQLQQSGAELMKPGASVKISCKASGYTFSDYWIEWVKQRPGHGLEWIGQILPGSGSTNY HERFKGKATFTADTSSSTAYMQLNSLTSEDSGVYYCLHGNYDFDGWGQGTTLTVSSAKTT PPSVYPLAPGSAAQTNSMVTLGCLVKGYFPEPVTVTWNSGSLSSGVHTFPAVLQSDLYTL SSSVTVPSSTWPSETVTCNVAHPASSTKVDKKILD 23011.87
Truncated to displaylimit of 20.

Build

target = conn.sql(
    """
SELECT
    polymerid AS id,
    CASE
        WHEN is_nullish_str(scientific_name) THEN NULL
        ELSE scientific_name
    END AS scientific_name,
    CASE
        WHEN is_nullish_str(display_name) THEN NULL
        ELSE display_name
    END AS display_name,
    CASE
        WHEN is_nullish_str(type) OR (type IS NULL) THEN NULL
        ELSE lower(type)
    END AS type,
    CASE
        WHEN is_nullish_str(topology) OR (topology IS NULL) THEN NULL
        ELSE lower(topology)
    END AS topology,
    sequence AS amino_acid_sequence,
    weight
FROM
    bindingdb.polymer
WHERE
    component_id IS NULL
"""
)
target.fetchdf()
id scientific_name display_name type topology amino_acid_sequence weight
0 1 Human herpesvirus 1 Thymidine kinase enzyme linear MASYPCHQHASAFDQAARSRGHNNRRTALRPRRQQEATEVRPEQKM... 40978.57
1 2 Streptomyces avidinii Streptavadin(N23A) protein linear DPSKDSKAQV SAAEAGITGT WYAQLGSTFI VTAGADGALT GT... 16447.70
2 3 Streptomyces avidinii Streptavadin(Y43A) protein linear DPSKDSKAQV SAAEAGITGT WYNQLGSTFI VTAGADGALT GT... 16398.63
3 4 Streptomyces avidinii Streptavadin(S27A) protein linear DPSKDSKAQV SAAEAGITGT WYNQLGATFI VTAGADGALT GT... 16474.72
4 6 Streptomyces avidinii Streptavidin (Y43F) protein linear DPSKDSKAQV SAAEAGITGT WYNQLGSTFI VTAGADGALT GT... 16474.72
... ... ... ... ... ... ... ...
5565 50008001 None Penicillin-binding protein 2B protein None MRKFNSHSIPIRLNLLFSIVILLFMTIIGRLLYMQVLNKDFYEKKL... 73859.38
5566 50008002 None DNA gyrase subunit A protein None MSDLAREITPVNIEEELKSSYLDYAMSVIVGRALPDVRDGLKPVHR... 97006.20
5567 50008003 None 1,3-beta-glucan synthase protein None MSYNDNNNNHYYDPNQQGGVPPQQGDGYYQQQYDDMSQQQQYDNMG... 218939.56
5568 50008005 None 1,3-beta-glucan synthase protein None MSYNNNGQQMSDQGYYDNNQGYYQPEDQQNGQAMYGDEGYYDPNIS... 213936.94
5569 50008006 None Platelet glycoprotein 4 protein None MGCDRNCGLIAGAVIGAVLAVFGGILMPVGDLLIQKTIKKQVVLEE... 53057.78

5570 rows × 7 columns

Write

conn.sql(
    f"COPY (SELECT * FROM target) TO '{processed_data_path / 'target.parquet'}' (FORMAT 'parquet')"
)

DuckDB impressions

I’ve been following the DuckDB hype but haven’t had a chance to get firsthand experience until now. The entire BindingDB SQL database is only 6.5 GB, an easily manageable size even for Pandas. This isn’t what DuckDB was built for – it’s clear overkill. That said, it was a good opportunity to get familiar with the framework. While the data is too small to give DuckDB an obvious leg up, the framework does make lazily loading tables from the source MySQL database easy. There’d be little reason to use another tool if it weren’t for DuckDB’s API design. As is, transformations that would be concise in Pandas, Polars, or even Spark’s DataFrame API are verbose in DuckDB – FunctionExpression, CaseExpression, list_aggregate, etc.. Also, I prefer DataFrame APIs over SQL for their composability. DuckDB has implemented a Spark DataFrame API, but it’s experimental. The native “Expression” and “Relational” APIs support composition and piping but are idiosyncratic. Learning these could be justified for larger datasets but it wasn’t worth the squeeze here. Based on benchmarks (Database-like ops benchmark, TPC-H Benchmarks at Scale), Polars is comparable to DuckDB for datasets on the 50 - 100 GB scale. Considering code writing / reading efficiency as well as time efficiency, Polars is the framework of choice for problems of this scale. Larger datasets or high-frequency pipelines would benefit from DuckDB’s additional speed.