Skip to content

GSOC'25 - MDEV-36100 Automatic embedding generation and "expensive" functions #4096

Draft
apostolis1 wants to merge 55 commits into
MariaDB:mainfrom
apostolis1:MDEV-36100-GSOC
Draft

GSOC'25 - MDEV-36100 Automatic embedding generation and "expensive" functions #4096
apostolis1 wants to merge 55 commits into
MariaDB:mainfrom
apostolis1:MDEV-36100-GSOC

Conversation

@apostolis1

@apostolis1 apostolis1 commented Jun 9, 2025

Copy link
Copy Markdown
Contributor

This DRAFT PR is part of the GSOC'25 project "MDEV-36100 Generate vector embeddings automatically on INSERT".
This is a DRAFT PR, meant to be used for mentor feedback and will be updated as the project progresses.

There are two main aspects on this PR:

  1. Plugin to automatically generate vector emebeddings
  2. Avoidance of recalculation of stored expensive virtual columns
  • The Jira issue number for this PR is: MDEV-36100

Description

TODO: This will be updated

Release Notes

The plugin uses the following session variables:

  1. generate_embedding_openai_host: The API endpoint the request will be sent to
  2. generate_embedding_openai_api_key: The API key used for the request, provided via HTTP Bearer authentication

and provides the following status variables to help system admins keep track of requests:

  1. generate_embedding_openai_successful_http_requests
  2. generate_embedding_openai_total_http_requests

How can this PR be tested?

This PR creates a new testing suite, gen_embedding,which includes tests both for the plugin and the expensive functionality.
Expensive tests have the prefix expensive_, to distinguish them from the plugin tests.

Basing the PR against the correct MariaDB version

  • This is a new feature or a refactoring, and the PR is based against the main branch.
  • This is a bug fix, and the PR is based against the earliest maintained branch in which the bug can be reproduced.

Future Work

In terms of the plugin, the status variables are not being reset by FLUSH STATUS, it remains to be investigated why that is and whether there is special support needed from the plugin.

As far as the stored expensive virtual columns are concerned:

  • Currently, changes in the virtual columns data type in ALTER statements are not detected.
    So, something like the following incorrectly does not recompute the virtual column:
create table t (
    doc text,
    embedding vector(1536) as (generate_embedding_openai(doc, "text-embedding-ada-002")) stored
) engine=MyISAM;
alter table t modify embedding vector(2000);
  • We would also like to avoid recomputing stored expensive virtual columns in replication setups whenever possible.
    Initially, we would want to do this for Row-Based Logging and then for Mixed Logging.

PR quality check

  • I checked the CODING_STANDARDS.md file and my PR conforms to this where appropriate.
  • For any trivial modifications to the PR, I am ok with the reviewer making the changes themselves.

Comment thread plugin/gen_embedding/plugin.cc Outdated
@svoj svoj added the External Contribution All PRs from entities outside of MariaDB Foundation, Corporation, Codership agreements. label Jun 9, 2025
Comment thread plugin/gen_embedding/plugin.cc Outdated
Comment thread plugin/gen_embedding/plugin.cc
Comment thread plugin/gen_embedding/plugin.cc Outdated
Comment thread plugin/gen_embedding/plugin.cc Outdated
@vuvova vuvova added the GSoC label Jul 16, 2025
Comment thread plugin/gen_embedding/CMakeLists.txt Outdated
Comment thread plugin/gen_embedding/CMakeLists.txt Outdated
Comment thread plugin/gen_embedding/mysql-test/gen_embedding/minimal_socket_server.py Outdated
Comment thread plugin/gen_embedding/mysql-test/gen_embedding/minimal_socket_server.py Outdated
Comment thread plugin/gen_embedding/mysql-test/gen_embedding/suite.pm Outdated
Comment thread plugin/gen_embedding/plugin.cc Outdated
Comment thread plugin/gen_embedding/plugin.cc Outdated
Comment thread plugin/gen_embedding/plugin.cc Outdated
Comment thread plugin/gen_embedding/plugin.cc Outdated
Comment thread plugin/gen_embedding/plugin.cc Outdated
Comment thread plugin/gen_embedding/plugin.cc
Comment thread plugin/gen_embedding/plugin.cc Outdated
Comment thread sql/item.cc Outdated
Comment thread sql/item.cc Outdated
@CLAassistant

CLAassistant commented Aug 26, 2025

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@apostolis1 apostolis1 force-pushed the MDEV-36100-GSOC branch 2 times, most recently from 97878cc to cd9296f Compare August 27, 2025 15:37
Comment thread sql/sql_table.cc
/*
If there is no change in the dependencies and the vcol expression and type (see TODO below), we can copy directly
*/
if (!to->check_dependencies_in_write_set(*ptr) && def->field->vcol_info && (*ptr)->vcol_info->is_equal(def->field->vcol_info, true))

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check does not detect changes in the column's data type, for example changes like
alter table t modify embedding vector(2000);
What would be a good way to detect those?

@apostolis1

Copy link
Copy Markdown
Contributor Author

For anyone interested, I have created this gist as well with some additional information on this work.

@vuvova vuvova self-requested a review April 9, 2026 11:21
@vuvova

vuvova commented May 27, 2026

Copy link
Copy Markdown
Member

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new MariaDB plugin, GENERATE_EMBEDDING_OPENAI, which utilizes libcurl to fetch text embeddings from OpenAI's API, alongside core server optimizations to avoid recomputing expensive virtual stored columns during ALTER TABLE operations when dependencies remain unchanged. Feedback on the changes highlights critical safety issues in the plugin, such as potential null-pointer dereferences when initializing curl or handling unconfigured variables, and an uninitialized variable that could cause undefined behavior. Additionally, the global counters require atomic operations to prevent race conditions, and the global signature change of Item::eq should be reverted to avoid a major ABI break. Finally, optimizations are suggested for json_path setup efficiency, and a fix is proposed for the ALTER TABLE logic to ensure vector dimensions are verified before skipping recomputation.

long http_response_code;
size_t max_chars;
int strLen;
std::string response, model_name, authorization= std::string("Authorization: Bearer ") + THDVAR(current_thd, api_key);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

If api_key or host is NULL, constructing a std::string from it or passing it to curl_easy_setopt will cause undefined behavior or a server crash. We should check both variables and handle them gracefully before proceeding. Note that empty strings should be explicitly allowed as they may represent the default state or be considered valid.

    const char *key_str= THDVAR(current_thd, api_key);
    const char *host_str= THDVAR(current_thd, host);
    if (!key_str)
    { 
      my_printf_error(1, "GENERATE_EMBEDDING_OPENAI: api_key is not set", ME_WARNING);
      null_value= true;
      return 1;
    }
    if (!host_str)
    {
      my_printf_error(1, "GENERATE_EMBEDDING_OPENAI: host is not set", ME_WARNING);
      null_value= true;
      return 1;
    }
    std::string response, model_name, authorization= std::string("Authorization: Bearer ") + key_str;
References
  1. Validation logic for configuration variables should explicitly allow empty strings if they are considered valid or represent the default state.

slist1= curl_slist_append(slist1, authorization.c_str());
slist1= curl_slist_append(slist1, "Content-Type: application/json; charset=utf-8");
hnd= curl_easy_init();
curl_easy_setopt(hnd, CURLOPT_BUFFERSIZE, 102400L);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

If curl_easy_init() fails and returns NULL, the subsequent curl_easy_setopt calls will dereference a null pointer and crash the server. We must check if hnd is NULL before using it.

    hnd= curl_easy_init();
    if (!hnd)
    {
      my_printf_error(1, "GENERATE_EMBEDDING_OPENAI: curl_easy_init() failed", ME_ERROR_LOG | ME_WARNING);
      null_value= true;
      goto cleanup;
    }

Comment thread plugin/gen_embedding/plugin.cc Outdated
{
String *js= &api_response;
json_engine_t je;
json_path_t p;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The local variable p is completely uninitialized when passed to json_get_path_start, which expects an already initialized/parsed path. This will read garbage and cause undefined behavior or crashes. It should be initialized by copying json_path->p.

  json_path_t p= json_path->p;

Comment on lines +29 to +30
ulonglong curl_requests= 0; /* Number of total curl requests */
ulonglong successful_curl_requests= 0; /* Number of successful curl requests */

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Since MariaDB is multi-threaded, these global counters are subject to race conditions and lost updates when multiple threads execute the function concurrently. We should use atomic operations to prevent race conditions.

#include <atomic>
std::atomic<ulonglong> curl_requests{0}; /* Number of total curl requests */
std::atomic<ulonglong> successful_curl_requests{0}; /* Number of successful curl requests */

Comment thread sql/sql_table.cc
/*
If there is no change in the dependencies and the vcol expression and type (see TODO below), we can copy directly
*/
if (!to->check_dependencies_in_write_set(*ptr) && def->field->vcol_info && (*ptr)->vcol_info->is_equal(def->field->vcol_info, true))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

As noted in the TODO comment, we must detect changes in the field type/length (such as changing a vector dimension from 1536 to 1537). Since Virtual_column_info::is_equal only compares the expression and type handler, it does NOT detect changes in the field length/dimension. To prevent copying a mismatched vector directly without recomputation, we must also verify that (*ptr)->field_length == def->field->field_length.

        if (!to->check_dependencies_in_write_set(*ptr) && 
            def->field->vcol_info && 
            (*ptr)->field_length == def->field->field_length &&
            (*ptr)->vcol_info->is_equal(def->field->vcol_info, true))

return 1;
}

model_name= (std::string) args[1]->val_str()->c_ptr();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The model pointer has already been retrieved and checked for NULL on line 101. Calling args[1]->val_str() again is redundant, inefficient, and potentially unsafe if the item is not constant. We should reuse the existing model pointer.

    model_name= std::string(model->ptr(), model->length());

Comment on lines +244 to +246
MEM_ROOT *root= thd->mem_root;
json_path= (json_path_with_flags *) alloc_root(root, sizeof(json_path_with_flags));
// We want to store max_dimensions floats, each 4 bytes

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Setting up the json_path on every call to read_json is highly inefficient. Since JSON_EMBEDDING_PATH is a constant, we can set up json_path once in fix_length_and_dec on the statement mem_root and reuse it.

    MEM_ROOT *root= thd->mem_root;
    json_path= (json_path_with_flags *) alloc_root(root, sizeof(json_path_with_flags));
    json_path->p.types_used= JSON_PATH_KEY_NULL;
    const uchar *s_p= (const uchar *) JSON_EMBEDDING_PATH;
    if (json_path_setup(&json_path->p, &my_charset_utf8mb4_general_ci, s_p,
                        s_p + strlen((const char *) s_p)))
      return true;

Comment on lines +308 to +312
json_path->p.types_used= JSON_PATH_KEY_NULL;
const uchar *s_p= (const uchar *) JSON_EMBEDDING_PATH;
if (json_path_setup(&json_path->p,&my_charset_utf8mb4_general_ci, s_p,
s_p + strlen((const char *) s_p)))
goto return_null;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since the json_path is now set up once in fix_length_and_dec, we can completely remove the redundant setup from read_json.

Comment thread sql/item.h
Eq_config(bool binary_cmp, bool omit_table_names= false)
: binary_cmp(binary_cmp), omit_table_names(omit_table_names) {}
};
virtual bool eq(const Item *, const Eq_config &config) const;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Changing the signature of Item::eq globally is a major ABI break that will affect all third-party storage engines and plugins overriding eq. Instead of changing the signature of Item::eq, we can preserve the original is_equivalent method in Virtual_column_info which clones the expression and renames the table/db names using rename_table_processor (which was deleted in this PR). This avoids any ABI break and keeps the change localized.

Comment on lines +27 to +30
request_line = request.splitlines()[0]
method, path, _ = request_line.split()
if path == "/success" and method in ("GET", "POST"):
parsed_input = json.loads(request.splitlines()[-1])["input"]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using request.splitlines()[-1] to extract the JSON body is fragile and will fail if the request has trailing whitespace/newlines or if the body itself contains literal newlines. Instead, use request.partition("\r\n\r\n") to reliably separate the headers from the body.

Suggested change
request_line = request.splitlines()[0]
method, path, _ = request_line.split()
if path == "/success" and method in ("GET", "POST"):
parsed_input = json.loads(request.splitlines()[-1])["input"]
headers, _, body = request.partition("\r\n\r\n")
request_line = headers.splitlines()[0]
method, path, _ = request_line.split()
if path == "/success" and method in ("GET", "POST"):
parsed_input = json.loads(body)["input"]

@svoj svoj left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This review spans only plugin/gen_embedding changes.

Coding style is inconsistent: spacing, lines over 80 characters, trailing spaces, variable names.

I'd move tests from plugin/gen_embedding/mysql-test/gen_embedding/t/ plugin/gen_embedding/mysql-test/gen_embedding/r/ to plugin/gen_embedding/mysql-test/gen_embedding.

I'd move the whole plugin either to plugin/ai/generate_embedding_openai or plugin/func_generate_embedding_openai.


decimals= 0;
MEM_ROOT *root= thd->mem_root;
json_path= (json_path_with_flags *) alloc_root(root, sizeof(json_path_with_flags));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either use thd->alloc<json_path_with_flags>(1) or thd_alloc(thd, sizeof(json_path_with_flags)).

The vector being parsed is a JSON array of numbers
The output vector is in binary format, will be stored in buf by this function
Returns 0 on success, 1 on error
*/

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function comment should be moved before function itself.

/*
Make the request to OpenAI API to generate the embedding
Store the result in api_response
*/

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function comment should be moved before function itself.

curl_easy_cleanup(hnd);
hnd= NULL;
curl_slist_free_all(slist1);
slist1= NULL;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resetting hnd and slist1 is redundant. cleanup label duplicates preceding code, this can be avoided.

return 0;

json_path->p.types_used= JSON_PATH_KEY_NULL;
const uchar *s_p= (const uchar *) JSON_EMBEDDING_PATH;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant variable.

post_fields.length(post_fields.length()+strLen); // We need to manually update the length
post_fields.append(STRING_WITH_LEN("\", \"model\": \""));
post_fields.append(model->c_ptr(), model->length());
post_fields.append(STRING_WITH_LEN("\",\"encoding_format\": \"float\"}"));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather use base64 instead of float: 2-2.5x less data to transfer and much cheaper to parse.

json_get_path_start(&je, js->charset(),(const uchar *) js->ptr(),
(const uchar *) js->ptr() + js->length(), &p);

while (json_get_path_next(&je, &p) == 0)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd avoid this loop unless we really want to handle multiple embeddings at once. We probably don't within the scope of this SQL function.

MEM_ROOT *root= thd->mem_root;
json_path= (json_path_with_flags *) alloc_root(root, sizeof(json_path_with_flags));
// We want to store max_dimensions floats, each 4 bytes
fix_length_and_charset(max_dimensions * 4, &my_charset_bin);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not max_dimensions * sizeof(float)?


LEX_CSTRING func_name_cstring() const override
{
static LEX_CSTRING name= "GENERATE_EMBEDDING_OPENAI"_LEX_CSTRING;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that most modern LLMs expose REST APIs, I’d prefer a generic function such as AI_EMBEDDING(data[, model]) that abstracts away provider-specific differences. This would avoid maintaining multiple plugins with nearly identical implementations.

static MYSQL_THDVAR_STR(api_key,
PLUGIN_VAR_OPCMDARG | PLUGIN_VAR_MEMALLOC,
"OpenAI API key",
NULL, NULL, "");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that OpenAI key is exposed via session variable needs to be reevaluated.

  1. value may shortly become visible to SHOW PROCESSLIST
  2. value may be written to query log
  3. unsure if it matters, but DBA can't set per-server key without exposing it to users
  4. are there other ways for the value to become unintentionally visible?

apostolis1 and others added 29 commits June 21, 2026 23:49
json_escape can do the conversion internally, there is no need for it to be done separately
This ensures that calls to curl_easy_cleanup and curl_slist_free_all succeed
This means that the post_fields buffer can be reused for different rows, and only allocate memory when necessary. It also avoids manual memory management.
The print statements in python need to be changed to flush the output
This is the initial implementation to avoid recomputing expensive virtual stored columns on UPDATE statements. This commit includes:
- Implementation of processors and methods for walking through expensive stored virtual columns and their dependencies.
- Relevant test cases using the gen_embedding plugin. Those tests need to be run with the --force-restart flag for now, due to status variables not being reset.
…e write_set.

Has some pending issues with alter statements that change the expression and type of vcols.
…ate_embedding_openai_successful_http_requests' variable

This allows them to be executed no matter the initial value of the Generate_embedding_openai_successful_http_requests status variable. It also means there is no need to reset the variable.
@svoj svoj force-pushed the MDEV-36100-GSOC branch from 22a5718 to 276c164 Compare June 22, 2026 10:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

External Contribution All PRs from entities outside of MariaDB Foundation, Corporation, Codership agreements. GSoC

Development

Successfully merging this pull request may close these issues.

5 participants