DTI 1a: Ducking BindingDB
Introduction to drug-target interaction with BindingDB and DuckDB.
Introduction
Drug-target interaction, or DTI, refers to the binding of a drug (a chemical) to a target (typically, a protein). Beneficial binds between a drug and target will either up-regulate favorable behavior or down-regulate unfavorable behavior. Beyond ability to favorably change target behavior, drugs can be further filtered by their target and effect specificity (low chance of binding to unintended targets and low chance of side effects, respectively). It follows that large collections of DTIs are valuable during drug development. DTIs can be established experimentally but computational methods are attractive for their ability to reduce drug discovery cost and time.
Computational methods
Computational methods innovate both
- the data structures used to represent the drugs and targets (SMILES, graphs, ECFP, etc.) and
- the models used to predict or simulate binding affinity.
Data structures and the models that consume them are usually tightly coupled. That said, there’s opportunity to experiment with both somewhat independently of each other.
BindingDB
Released in 1995, BindingDB is one of the older datasets capturing drug-target interactions. Since the original release, new drug-target interactions have continued to be curated from scientific literature. As of May 7th, 2024, BindingDB contains 2,882,970 measurements, 1,243,034 compounds, 9,311 targets [1].
Goal
This series will focus on:
- Processing BindingDB drugs and targets for machine learning
- Experimenting with representations and embeddings
- Experimenting with binding affinity models
Due to the data scale, processing, embedding, and prediction will require thoughtful tool and pattern selection. For processing, frameworks like DuckDB, spark, and polars should make the problem tractable. Scaling to GPU-enabled cloud machines will likely be needed for embedding and prediction.
Coarse source data model
Coarsely, I’ll use a data model that captures reactions (which involve a ligand and target), targets, and ligands. Ligands will be stored as molecules (system level), atoms, and bonds. This data model is flexible to downstream feature engineering and embedding for ML.
Note: BindingDB tends to use “ligand” and “monomer” interchangeably with “drug” and “enzyme” interchangeably with “target”.
This release focuses on building the Reaction and Target tables with DuckDB. The next release will focus on the Ligand tables.
BindingDB SQL Setup
Create MySQL database from dump file
- Log into your MySQL server
- Create
bindingdb
database
mysql> create database bindingdb
- Load data from dump file into newly created
bindingdb
database
mysql -u username -p bindingdb < BDB-mySQL_All_202404.dmp
Initialize SQL connection
import duckdb
= duckdb.connect() conn
%load_ext sql
%config SqlMagic.displaylimit = 20
%sql conn --alias duckdb
%%sql
;
INSTALL mysql;
DETACH DATABASE IF EXISTS bindingdb;
LOAD mysql'host=localhost user=root port=3306 database=bindingdb' AS bindingdb (TYPE mysql_scanner, READ_ONLY); ATTACH
Success |
---|
Build Reaction Table
Use the BindingDB SQL DBs to build the reaction table.
Build reaction table backbone
%%sql
WITH ligand_target_ids AS (
SELECT
reactant_set_id,
inhibitor_monomerid AS ligand_id,
COALESCE(
enzyme_monomerid,
enzyme_polymerid,
enzyme_complexid
) AS target_id,
CASE'monomer'
WHEN enzyme_monomerid IS NOT NULL THEN 'polymer'
WHEN enzyme_polymerid IS NOT NULL THEN 'complex'
WHEN enzyme_complexid IS NOT NULL THEN
END AS target_type
FROM
bindingdb.enzyme_reactant_set
)
SELECT
target_type,*) AS count
count(
FROM
ligand_target_ids
GROUP BY target_type
target_type | count |
---|---|
complex | 149181 |
polymer | 2695549 |
monomer | 19 |
Complex and monomer targets will be ignored given:
- Polymers are the majority target type.
- Representing monomers, polymers, and complex molecules in a single format is non-trivial. (We prefer a single representation format to simplify target embedding in later steps.)
= conn.sql(
reaction_backbone """
SELECT
cast(reactant_set_id AS INTEGER) AS reaction_id,
cast(inhibitor_monomerid AS INTEGER) AS ligand_id,
cast(enzyme_polymerid AS INTEGER) AS target_id,
FROM
bindingdb.enzyme_reactant_set
WHERE
enzyme_polymerid IS NOT NULL
"""
)
Build reaction results subtable
%%sql
SELECT
reactant_set_id AS reaction_id,
ic50 AS affinity_ic50,
ki AS affinity_ki,
kd AS affinity_kd,
ec50 AS affinity_ec50,
delta_g,
koff AS k_off,
kon AS k_on,
ph,
temp,
FROM bindingdb.ki_result
reaction_id | affinity_ic50 | affinity_ki | affinity_kd | affinity_ec50 | delta_g | k_off | k_on | ph | temp |
---|---|---|---|---|---|---|---|---|---|
50958048 | None | 10000 | None | None | None | None | None | None | None |
50958049 | None | 150 | None | None | None | None | None | None | None |
50958050 | None | 6700 | None | None | None | None | None | None | None |
50958051 | 0.100000 | None | None | None | None | None | None | None | None |
50958052 | 0.700000 | None | None | None | None | None | None | None | None |
50958054 | 0.100000 | None | None | None | None | None | None | None | None |
50958056 | 5.1 | None | None | None | None | None | None | None | None |
50958057 | 5.7 | None | None | None | None | None | None | None | None |
50958058 | 25 | None | None | None | None | None | None | None | None |
50958060 | 5.8 | None | None | None | None | None | None | None | None |
50958062 | 7.8 | None | None | None | None | None | None | None | None |
50958063 | 2.7 | None | None | None | None | None | None | None | None |
50958064 | 11 | None | None | None | None | None | None | None | None |
50958066 | 3.8 | None | None | None | None | None | None | None | None |
50958068 | 22 | None | None | None | None | None | None | None | None |
50958069 | 25 | None | None | None | None | None | None | None | None |
50958070 | 73 | None | None | None | None | None | None | None | None |
50958072 | 26 | None | None | None | None | None | None | None | None |
50958074 | 0.600000 | None | None | None | None | None | None | None | None |
50958075 | 0.400000 | None | None | None | None | None | None | None | None |
The affinity measures and binding constants (koff
, kon
) are string types. These need to be converted.
"SELECT ic50, ki, kd, ec50, delta_g, koff, kon, ph, temp FROM bindingdb.ki_result").dtypes conn.sql(
[VARCHAR,
VARCHAR,
VARCHAR,
VARCHAR,
DECIMAL(10,4),
VARCHAR,
VARCHAR,
DECIMAL(10,4),
DECIMAL(10,4)]
from duckdb.typing import VARCHAR, FLOAT
from duckdb import InvalidInputException
from typing import Optional
def string_to_float(string: str) -> Optional[float]:
if isinstance(string, str):
try:
return float(string.replace(">", "").replace("<", ""))
except ValueError:
pass
try:
"string_to_float");
conn.remove_function(except InvalidInputException:
pass
"string_to_float", string_to_float, [VARCHAR], FLOAT); conn.create_function(
= conn.sql(
reaction_result """
SELECT
cast(reactant_set_id AS INTEGER) AS reaction_id,
string_to_float(ic50) AS affinity_ic50,
string_to_float(ki) AS affinity_ki,
string_to_float(kd) AS affinity_kd,
string_to_float(ec50) AS affinity_ec50,
delta_g,
string_to_float(koff) AS k_off,
string_to_float(kon) AS k_on,
ph,
temp,
FROM
bindingdb.ki_result
"""
)
reaction_result.fetchdf()
reaction_id | affinity_ic50 | affinity_ki | affinity_kd | affinity_ec50 | delta_g | k_off | k_on | ph | temp | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 50958048 | NaN | 10000.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 50958049 | NaN | 150.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 50958050 | NaN | 6700.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 50958051 | 0.1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 50958052 | 0.7 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2844743 | 51451441 | 6700.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2844744 | 51451442 | 2700.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2844745 | 51451443 | 490.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2844746 | 51451444 | 17300.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2844747 | 51451445 | 510.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2844748 rows × 10 columns
Build affinity_measures
column
The PySpark DataFrame API would allow this logic to be defined in a more concise way. The DuckDB Expression API is powerful but far more verbose.
from duckdb import (
CaseExpression,
ColumnExpression,
ConstantExpression,
FunctionExpression,
StarExpression,
)
from itertools import combinations
from functools import reduce
= ["affinity_ic50", "affinity_ki", "affinity_kd", "affinity_ec50"]
affinity_cols
= sorted(
affinity_measure_candidates
[
combfor comb_r in [combinations(affinity_cols, r=i) for i in range(1, len(affinity_cols) + 1)]
for comb in comb_r
],=lambda x: len(x),
key=True,
reverse
)
= lambda affinity_measures: (
case_args
FunctionExpression("list_aggregate",
FunctionExpression("list_value",
*[
ColumnExpression(affinity_measure).isnotnull()for affinity_measure in affinity_measures
]
),"min"),
ConstantExpression(
)> 0,
FunctionExpression("list_value",
*[ConstantExpression(affinity_measure) for affinity_measure in affinity_measures]
),
)
= reduce(
affinities_list_map lambda acc, affinity_measures: acc.when(*case_args(affinity_measures)),
1:],
affinity_measure_candidates[*case_args(affinity_measure_candidates[0])),
CaseExpression("affinity_measures") ).alias(
Join reaction backbone and result
= reaction_backbone.join(reaction_result, "reaction_id", "left").select(
reaction "reaction_id",
"ligand_id",
"target_id",
"affinity_ic50",
"affinity_ki",
"affinity_kd",
"affinity_ec50",
affinities_list_map,"delta_g",
"k_off",
"k_on",
"ph",
"temp",
)
reaction.dtypes
[INTEGER,
INTEGER,
INTEGER,
FLOAT,
FLOAT,
FLOAT,
FLOAT,
VARCHAR[],
DECIMAL(10,4),
FLOAT,
FLOAT,
DECIMAL(10,4),
DECIMAL(10,4)]
reaction.fetchdf()
reaction_id | ligand_id | target_id | affinity_ic50 | affinity_ki | affinity_kd | affinity_ec50 | affinity_measures | delta_g | k_off | k_on | ph | temp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 9826 | 6121 | 13 | 5.1 | NaN | NaN | NaN | [affinity_ic50] | NaN | NaN | NaN | NaN | NaN |
1 | 9827 | 6122 | 13 | 2.8 | NaN | NaN | NaN | [affinity_ic50] | NaN | NaN | NaN | NaN | NaN |
2 | 9828 | 6123 | 13 | 12.0 | NaN | NaN | NaN | [affinity_ic50] | NaN | NaN | NaN | NaN | NaN |
3 | 9829 | 6124 | 13 | 25.0 | NaN | NaN | NaN | [affinity_ic50] | NaN | NaN | NaN | NaN | NaN |
4 | 9830 | 6125 | 13 | 1.5 | NaN | NaN | NaN | [affinity_ic50] | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2695544 | 50748287 | 50239048 | 4164 | NaN | NaN | NaN | 0.8 | [affinity_ec50] | NaN | NaN | NaN | NaN | NaN |
2695545 | 50748290 | 50239040 | 4164 | NaN | NaN | NaN | 1.0 | [affinity_ec50] | NaN | NaN | NaN | NaN | NaN |
2695546 | 50748294 | 50239049 | 4164 | NaN | NaN | NaN | 2.0 | [affinity_ec50] | NaN | NaN | NaN | NaN | NaN |
2695547 | 50748235 | 50239045 | 4164 | NaN | NaN | NaN | 86.0 | [affinity_ec50] | NaN | NaN | NaN | NaN | NaN |
2695548 | 50748301 | 50239042 | 50002942 | NaN | NaN | NaN | 728.0 | [affinity_ec50] | NaN | NaN | NaN | NaN | NaN |
2695549 rows × 13 columns
"affinity_measures, count(*) AS count").order("len(affinity_measures)") reaction.aggregate(
┌─────────────────────────────────────────────┬─────────┐
│ affinity_measures │ count │
│ varchar[] │ int64 │
├─────────────────────────────────────────────┼─────────┤
│ [affinity_ic50] │ 1778076 │
│ [affinity_ki] │ 553740 │
│ [affinity_ec50] │ 250765 │
│ [affinity_kd] │ 104744 │
│ [affinity_ic50, affinity_ec50] │ 1409 │
│ [affinity_ic50, affinity_ki] │ 2889 │
│ [affinity_kd, affinity_ec50] │ 11 │
│ [affinity_ki, affinity_kd] │ 61 │
│ [affinity_ic50, affinity_kd] │ 195 │
│ [affinity_ki, affinity_ec50] │ 817 │
│ [affinity_ic50, affinity_ki, affinity_kd] │ 9 │
│ [affinity_ic50, affinity_ki, affinity_ec50] │ 99 │
│ NULL │ 2734 │
├─────────────────────────────────────────────┴─────────┤
│ 13 rows 2 columns │
└───────────────────────────────────────────────────────┘
Write
conn.sql(f"COPY (SELECT * FROM reaction) TO '{processed_data_path / 'reaction.parquet'}' (FORMAT 'parquet')"
)
Build Target Table
Use the BindingDB SQL DBs to build the (polymer) target table.
Spot check
= conn.sql(
initial_target """
SELECT
polymerid AS id,
scientific_name,
display_name,
"type",
topology,
sequence AS amino_acid_sequence,
weight
FROM
bindingdb.polymer
WHERE
component_id IS NULL
"""
)
Check for parsimony in low-cardinality categorical fields.
"topology").distinct() initial_target.select(
┌───────────────────────────────┐
│ topology │
│ varchar │
├───────────────────────────────┤
│ NULL │
│ linear │
│ HTH motiff │
│ Bi-molecular anti-parallel G4 │
│ null │
│ Linear │
│ double helix │
│ Uni-molecular parallel G4 │
│ undefined │
│ NULL │
│ Four-stranded parallel G4 │
├───────────────────────────────┤
│ 11 rows │
└───────────────────────────────┘
Two issues:
- “Nullish” values are encoded using multiple formats – “undefined”, “NULL” (string), NULL (value), and “null”
- Inconsistent casing – “linear” vs. “Linear”
Detecting and filling “nullish” strings
from duckdb.typing import VARCHAR, BOOLEAN
from duckdb import InvalidInputException
def is_nullish_str(string: str) -> bool:
return isinstance(string, str) and string.strip().lower() in (
"null",
"none",
"undefined",
"na",
"nan",
"n/a",
"unknown",
"",
)
try:
"is_nullish_str")
conn.remove_function(except InvalidInputException:
pass
"is_nullish_str", is_nullish_str, [VARCHAR], BOOLEAN) conn.create_function(
<duckdb.duckdb.DuckDBPyConnection at 0x1071d2b30>
assert all(
conn.sql("SELECT is_nullish_str('null'), is_nullish_str('none'), is_nullish_str('undefined'), is_nullish_str('na'), is_nullish_str('nan'), is_nullish_str('n/a'), is_nullish_str('unknown'), is_nullish_str('')"
0].values
).fetchdf().iloc [ )
Other string fields have nullish values that need to be corrected as well.
%sql SELECT * FROM initial_target WHERE is_nullish_str(scientific_name) OR is_nullish_str(display_name) OR is_nullish_str(type) OR is_nullish_str(topology)
id | scientific_name | display_name | type | topology | amino_acid_sequence | weight |
---|---|---|---|---|---|---|
1368 | None | Unknown | null | null | None | 358.43 |
416 | None | Metallo-beta-lactamase type 2 | undefined | undefined | MSKLSVFFIFLFCSIATAAESLPDLKIEKLDEGVYVHTSFEEVNGWGVVPKHGLVVLVNA EAYLIDTPFTAKDTEKLVTWFVERGYKIKGSISSHFHSDSTGGIEWLNSRSIPTYASELT NELLKKDGKVQATNSFSGVNYWLVKNKIEVFYPGPGHTPDNVVVWLPERKILFGGCFIKP YGLGNLGDANIEAWPKSAKLLKSKYGKAKLVVPSHSEVGDASLLKLTLEQAVKGLNESKK PSKPSN | 27125.88 |
424 | None | VIM-1 metallo-beta-lactamase | undefined | undefined | MLKVISSLLVYMTASVMAVASPLAHSGEPSGEYPTVNEIPVGEVRLYQIADGVWSHIATQ SFDGAVYPSNGLIVRDGDELLLIDTAWGAKNTAALLAEIEKQIGLPVTRAVSTHFHDDRV GGVDVLRAAGVATYASPSTRRLAEAEGNEIPTHSLEGLSSSGDAVRFGPVELFYPGAAHS TDNLVVYVPSANVLYGGCAVHELSSTSAGNVADADLAEWPTSVERIQKHYPEAEVVIPGH GLPGGLDLLQHTANVVKAHKNRSVAE | 28014.53 |
687 | None | 5-hydroxytryptamine receptor 2A | undefined | undefined | MDILCEENTSLSSTTNSLMQLNDDTRLYSNDFNSGEANTSDAFNWTVDSENRTNLSCEGCLSPSCLSLLHLQEKNWSALLTAVVIILTIAGNILVIMAVSLEKKLQNATNYFLMSLAIADMLLGFLVMPVSMLTILYGYRWPLPSKLCAVWIYLDVLFSTASIMHLCAISLDRYVAIQNPIHHSRFNSRTKAFLKIIAVWTISVGISMPIPVFGLQDDSKVFKEGSCLLADDNFVLIGSFVSFFIPLTIMVITYFLTIKSLQKEATLCVSDLGTRAKLASFSFLPQSSLSSEKLFQRSIHREPGSYTGRRTMQSISNEQKACKVLGIVFFLFVVMWCPFFITNIMAVICKESCNEDVIGALLNVFVWIGYLSSAVNPLVYTLFNKTYRSAFSRYIQCQYKENKKPLQLILVNTIPALAYKSSQLQMGQKKNSKQDAKTTDNDCSMVALGKQHSEEASKDNSDGVNEKVSCV | 52607.65 |
1094 | None | Caspase-3 Substrate | null | Linear | Ac-Asp-Glu-Val-Asp-pNA | 1438.65 |
1097 | None | Caspase Substrate | null | null | None | 358.43 |
1109 | None | Caspase-1 Fluorogenic Substrate | null | null | Ac-Tyr-Val-Ala-Asp-AMC | 1713.07 |
1111 | None | Caspase-2 Fluorogenic Substrate | null | null | Ac-Val-Asp-Val-Ala-Asp-AMC | 1830.24 |
1113 | None | Caspase-4 Fluorogenic Substrate | null | null | Ac-Leu-Glu-Glu-Asp-AMC | 1334.60 |
1119 | None | Caspase-9 Fluorogenic Substrate | null | null | Ac-Leu-Glu-His-Asp-AFC | 1518.25 |
1124 | None | Caspase-6 Fluorogenic Substrate | null | Linear | Ac-Val-Glu-Ile-Asp-AMC | 1560.92 |
1397 | None | Not Specified | null | null | None | 358.43 |
1830 | None | Not Specified | null | null | None | 358.43 |
2388 | None | Fluorogenic peptide | null | null | Dabcyl-TSAVLQSGFR-Edans | 2145.26 |
2726 | Proteus vulgaris | Urease subunit gamma | null | null | MELTPREKDKLLLFTAGLVAERRLAKGLKLNYPEAVALISCAIMEGAREGKTVAQLMSEG RSVLAAEQVMEGVPEMIKDIQVECTFPDGTKLVSIHDPIV | 10908.21 |
8 | Escherichia coli | SecB | Protein | NULL | MSEQNNTEMTFQIQRIYTKDISFEAPNAPHVFQKDWQPEVKLDLDTASSQLADDVYEVVL RVTVTASLGEETAFLCEVQQGGIFSIAGIEGTQMAHCLGAYCPNILFPYARECITSMVSR GTFPQLNLAPVNFDALFMNYLQQQAGEGTEEHQDA | 17264.58 |
14 | Mus musculus | HyHEL-5 Fab | Antibody | NULL | EVQLQQSGAELMKPGASVKISCKASGYTFSDYWIEWVKQRPGHGLEWIGEILPGSGSTNY HERFKGKATFTADTSSSTAYMQLNSLTSEDSGVYYCLHGNYDFDGWGQGTTLTVSSAKTT PPSVYPLAPGSAAQTNSMVTLGCLVKGYFPEPVTVTWNSGSLSSGVHTFPAVLQSDLYTL SSSVTVPSSTWPSETVTCNVAHPASSTKVDKKILD | 23011.84 |
15 | Gallus gallus | Lysozyme C | Enzyme | NULL | MRSLLILVLCFLPLAALGKVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQA TNRNTDGSTDYGILQINSRWWCNDGRTPGSRNLCNIPCSALLSSDITASVNCAKKIVSDG NGMNAWVAWRNRCKGTDVQAWIRGCRL | 16248.07 |
16 | Mus musculus | HyHEL-5 Mutant(E50D) | Antibody | NULL | EVQLQQSGAELMKPGASVKISCKASGYTFSDYWIEWVKQRPGHGLEWIGDILPGSGSTNY HERFKGKATFTADTSSSTAYMQLNSLTSEDSGVYYCLHGNYDFDGWGQGTTLTVSSAKTT PPSVYPLAPGSAAQTNSMVTLGCLVKGYFPEPVTVTWNSGSLSSGVHTFPAVLQSDLYTL SSSVTVPSSTWPSETVTCNVAHPASSTKVDKKILD | 22997.82 |
17 | Mus musculus | HyHEL-5 Mutant(E50Q) | Antibody | NULL | EVQLQQSGAELMKPGASVKISCKASGYTFSDYWIEWVKQRPGHGLEWIGQILPGSGSTNY HERFKGKATFTADTSSSTAYMQLNSLTSEDSGVYYCLHGNYDFDGWGQGTTLTVSSAKTT PPSVYPLAPGSAAQTNSMVTLGCLVKGYFPEPVTVTWNSGSLSSGVHTFPAVLQSDLYTL SSSVTVPSSTWPSETVTCNVAHPASSTKVDKKILD | 23011.87 |
Build
= conn.sql(
target """
SELECT
polymerid AS id,
CASE
WHEN is_nullish_str(scientific_name) THEN NULL
ELSE scientific_name
END AS scientific_name,
CASE
WHEN is_nullish_str(display_name) THEN NULL
ELSE display_name
END AS display_name,
CASE
WHEN is_nullish_str(type) OR (type IS NULL) THEN NULL
ELSE lower(type)
END AS type,
CASE
WHEN is_nullish_str(topology) OR (topology IS NULL) THEN NULL
ELSE lower(topology)
END AS topology,
sequence AS amino_acid_sequence,
weight
FROM
bindingdb.polymer
WHERE
component_id IS NULL
"""
)
target.fetchdf()
id | scientific_name | display_name | type | topology | amino_acid_sequence | weight | |
---|---|---|---|---|---|---|---|
0 | 1 | Human herpesvirus 1 | Thymidine kinase | enzyme | linear | MASYPCHQHASAFDQAARSRGHNNRRTALRPRRQQEATEVRPEQKM... | 40978.57 |
1 | 2 | Streptomyces avidinii | Streptavadin(N23A) | protein | linear | DPSKDSKAQV SAAEAGITGT WYAQLGSTFI VTAGADGALT GT... | 16447.70 |
2 | 3 | Streptomyces avidinii | Streptavadin(Y43A) | protein | linear | DPSKDSKAQV SAAEAGITGT WYNQLGSTFI VTAGADGALT GT... | 16398.63 |
3 | 4 | Streptomyces avidinii | Streptavadin(S27A) | protein | linear | DPSKDSKAQV SAAEAGITGT WYNQLGATFI VTAGADGALT GT... | 16474.72 |
4 | 6 | Streptomyces avidinii | Streptavidin (Y43F) | protein | linear | DPSKDSKAQV SAAEAGITGT WYNQLGSTFI VTAGADGALT GT... | 16474.72 |
... | ... | ... | ... | ... | ... | ... | ... |
5565 | 50008001 | None | Penicillin-binding protein 2B | protein | None | MRKFNSHSIPIRLNLLFSIVILLFMTIIGRLLYMQVLNKDFYEKKL... | 73859.38 |
5566 | 50008002 | None | DNA gyrase subunit A | protein | None | MSDLAREITPVNIEEELKSSYLDYAMSVIVGRALPDVRDGLKPVHR... | 97006.20 |
5567 | 50008003 | None | 1,3-beta-glucan synthase | protein | None | MSYNDNNNNHYYDPNQQGGVPPQQGDGYYQQQYDDMSQQQQYDNMG... | 218939.56 |
5568 | 50008005 | None | 1,3-beta-glucan synthase | protein | None | MSYNNNGQQMSDQGYYDNNQGYYQPEDQQNGQAMYGDEGYYDPNIS... | 213936.94 |
5569 | 50008006 | None | Platelet glycoprotein 4 | protein | None | MGCDRNCGLIAGAVIGAVLAVFGGILMPVGDLLIQKTIKKQVVLEE... | 53057.78 |
5570 rows × 7 columns
Write
conn.sql(f"COPY (SELECT * FROM target) TO '{processed_data_path / 'target.parquet'}' (FORMAT 'parquet')"
)
DuckDB impressions
I’ve been following the DuckDB hype but haven’t had a chance to get firsthand experience until now. The entire BindingDB SQL database is only 6.5 GB, an easily manageable size even for Pandas. This isn’t what DuckDB was built for – it’s clear overkill. That said, it was a good opportunity to get familiar with the framework. While the data is too small to give DuckDB an obvious leg up, the framework does make lazily loading tables from the source MySQL database easy. There’d be little reason to use another tool if it weren’t for DuckDB’s API design. As is, transformations that would be concise in Pandas, Polars, or even Spark’s DataFrame API are verbose in DuckDB – FunctionExpression
, CaseExpression
, list_aggregate
, etc.. Also, I prefer DataFrame APIs over SQL for their composability. DuckDB has implemented a Spark DataFrame API, but it’s experimental. The native “Expression” and “Relational” APIs support composition and piping but are idiosyncratic. Learning these could be justified for larger datasets but it wasn’t worth the squeeze here. Based on benchmarks (Database-like ops benchmark, TPC-H Benchmarks at Scale), Polars is comparable to DuckDB for datasets on the 50 - 100 GB scale. Considering code writing / reading efficiency as well as time efficiency, Polars is the framework of choice for problems of this scale. Larger datasets or high-frequency pipelines would benefit from DuckDB’s additional speed.