GSOC'25 - MDEV-36100 Automatic embedding generation and "expensive" functions #4096
GSOC'25 - MDEV-36100 Automatic embedding generation and "expensive" functions #4096apostolis1 wants to merge 55 commits into
Conversation
97878cc to
cd9296f
Compare
| /* | ||
| If there is no change in the dependencies and the vcol expression and type (see TODO below), we can copy directly | ||
| */ | ||
| if (!to->check_dependencies_in_write_set(*ptr) && def->field->vcol_info && (*ptr)->vcol_info->is_equal(def->field->vcol_info, true)) |
There was a problem hiding this comment.
This check does not detect changes in the column's data type, for example changes like
alter table t modify embedding vector(2000);
What would be a good way to detect those?
4154943 to
80cad15
Compare
|
For anyone interested, I have created this gist as well with some additional information on this work. |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a new MariaDB plugin, GENERATE_EMBEDDING_OPENAI, which utilizes libcurl to fetch text embeddings from OpenAI's API, alongside core server optimizations to avoid recomputing expensive virtual stored columns during ALTER TABLE operations when dependencies remain unchanged. Feedback on the changes highlights critical safety issues in the plugin, such as potential null-pointer dereferences when initializing curl or handling unconfigured variables, and an uninitialized variable that could cause undefined behavior. Additionally, the global counters require atomic operations to prevent race conditions, and the global signature change of Item::eq should be reverted to avoid a major ABI break. Finally, optimizations are suggested for json_path setup efficiency, and a fix is proposed for the ALTER TABLE logic to ensure vector dimensions are verified before skipping recomputation.
| long http_response_code; | ||
| size_t max_chars; | ||
| int strLen; | ||
| std::string response, model_name, authorization= std::string("Authorization: Bearer ") + THDVAR(current_thd, api_key); |
There was a problem hiding this comment.
If api_key or host is NULL, constructing a std::string from it or passing it to curl_easy_setopt will cause undefined behavior or a server crash. We should check both variables and handle them gracefully before proceeding. Note that empty strings should be explicitly allowed as they may represent the default state or be considered valid.
const char *key_str= THDVAR(current_thd, api_key);
const char *host_str= THDVAR(current_thd, host);
if (!key_str)
{
my_printf_error(1, "GENERATE_EMBEDDING_OPENAI: api_key is not set", ME_WARNING);
null_value= true;
return 1;
}
if (!host_str)
{
my_printf_error(1, "GENERATE_EMBEDDING_OPENAI: host is not set", ME_WARNING);
null_value= true;
return 1;
}
std::string response, model_name, authorization= std::string("Authorization: Bearer ") + key_str;References
- Validation logic for configuration variables should explicitly allow empty strings if they are considered valid or represent the default state.
| slist1= curl_slist_append(slist1, authorization.c_str()); | ||
| slist1= curl_slist_append(slist1, "Content-Type: application/json; charset=utf-8"); | ||
| hnd= curl_easy_init(); | ||
| curl_easy_setopt(hnd, CURLOPT_BUFFERSIZE, 102400L); |
There was a problem hiding this comment.
If curl_easy_init() fails and returns NULL, the subsequent curl_easy_setopt calls will dereference a null pointer and crash the server. We must check if hnd is NULL before using it.
hnd= curl_easy_init();
if (!hnd)
{
my_printf_error(1, "GENERATE_EMBEDDING_OPENAI: curl_easy_init() failed", ME_ERROR_LOG | ME_WARNING);
null_value= true;
goto cleanup;
}| { | ||
| String *js= &api_response; | ||
| json_engine_t je; | ||
| json_path_t p; |
There was a problem hiding this comment.
| ulonglong curl_requests= 0; /* Number of total curl requests */ | ||
| ulonglong successful_curl_requests= 0; /* Number of successful curl requests */ |
There was a problem hiding this comment.
Since MariaDB is multi-threaded, these global counters are subject to race conditions and lost updates when multiple threads execute the function concurrently. We should use atomic operations to prevent race conditions.
#include <atomic>
std::atomic<ulonglong> curl_requests{0}; /* Number of total curl requests */
std::atomic<ulonglong> successful_curl_requests{0}; /* Number of successful curl requests */| /* | ||
| If there is no change in the dependencies and the vcol expression and type (see TODO below), we can copy directly | ||
| */ | ||
| if (!to->check_dependencies_in_write_set(*ptr) && def->field->vcol_info && (*ptr)->vcol_info->is_equal(def->field->vcol_info, true)) |
There was a problem hiding this comment.
As noted in the TODO comment, we must detect changes in the field type/length (such as changing a vector dimension from 1536 to 1537). Since Virtual_column_info::is_equal only compares the expression and type handler, it does NOT detect changes in the field length/dimension. To prevent copying a mismatched vector directly without recomputation, we must also verify that (*ptr)->field_length == def->field->field_length.
if (!to->check_dependencies_in_write_set(*ptr) &&
def->field->vcol_info &&
(*ptr)->field_length == def->field->field_length &&
(*ptr)->vcol_info->is_equal(def->field->vcol_info, true))| return 1; | ||
| } | ||
|
|
||
| model_name= (std::string) args[1]->val_str()->c_ptr(); |
There was a problem hiding this comment.
| MEM_ROOT *root= thd->mem_root; | ||
| json_path= (json_path_with_flags *) alloc_root(root, sizeof(json_path_with_flags)); | ||
| // We want to store max_dimensions floats, each 4 bytes |
There was a problem hiding this comment.
Setting up the json_path on every call to read_json is highly inefficient. Since JSON_EMBEDDING_PATH is a constant, we can set up json_path once in fix_length_and_dec on the statement mem_root and reuse it.
MEM_ROOT *root= thd->mem_root;
json_path= (json_path_with_flags *) alloc_root(root, sizeof(json_path_with_flags));
json_path->p.types_used= JSON_PATH_KEY_NULL;
const uchar *s_p= (const uchar *) JSON_EMBEDDING_PATH;
if (json_path_setup(&json_path->p, &my_charset_utf8mb4_general_ci, s_p,
s_p + strlen((const char *) s_p)))
return true;| json_path->p.types_used= JSON_PATH_KEY_NULL; | ||
| const uchar *s_p= (const uchar *) JSON_EMBEDDING_PATH; | ||
| if (json_path_setup(&json_path->p,&my_charset_utf8mb4_general_ci, s_p, | ||
| s_p + strlen((const char *) s_p))) | ||
| goto return_null; |
| Eq_config(bool binary_cmp, bool omit_table_names= false) | ||
| : binary_cmp(binary_cmp), omit_table_names(omit_table_names) {} | ||
| }; | ||
| virtual bool eq(const Item *, const Eq_config &config) const; |
There was a problem hiding this comment.
Changing the signature of Item::eq globally is a major ABI break that will affect all third-party storage engines and plugins overriding eq. Instead of changing the signature of Item::eq, we can preserve the original is_equivalent method in Virtual_column_info which clones the expression and renames the table/db names using rename_table_processor (which was deleted in this PR). This avoids any ABI break and keeps the change localized.
| request_line = request.splitlines()[0] | ||
| method, path, _ = request_line.split() | ||
| if path == "/success" and method in ("GET", "POST"): | ||
| parsed_input = json.loads(request.splitlines()[-1])["input"] |
There was a problem hiding this comment.
Using request.splitlines()[-1] to extract the JSON body is fragile and will fail if the request has trailing whitespace/newlines or if the body itself contains literal newlines. Instead, use request.partition("\r\n\r\n") to reliably separate the headers from the body.
| request_line = request.splitlines()[0] | |
| method, path, _ = request_line.split() | |
| if path == "/success" and method in ("GET", "POST"): | |
| parsed_input = json.loads(request.splitlines()[-1])["input"] | |
| headers, _, body = request.partition("\r\n\r\n") | |
| request_line = headers.splitlines()[0] | |
| method, path, _ = request_line.split() | |
| if path == "/success" and method in ("GET", "POST"): | |
| parsed_input = json.loads(body)["input"] |
svoj
left a comment
There was a problem hiding this comment.
This review spans only plugin/gen_embedding changes.
Coding style is inconsistent: spacing, lines over 80 characters, trailing spaces, variable names.
I'd move tests from plugin/gen_embedding/mysql-test/gen_embedding/t/ plugin/gen_embedding/mysql-test/gen_embedding/r/ to plugin/gen_embedding/mysql-test/gen_embedding.
I'd move the whole plugin either to plugin/ai/generate_embedding_openai or plugin/func_generate_embedding_openai.
|
|
||
| decimals= 0; | ||
| MEM_ROOT *root= thd->mem_root; | ||
| json_path= (json_path_with_flags *) alloc_root(root, sizeof(json_path_with_flags)); |
There was a problem hiding this comment.
Either use thd->alloc<json_path_with_flags>(1) or thd_alloc(thd, sizeof(json_path_with_flags)).
| The vector being parsed is a JSON array of numbers | ||
| The output vector is in binary format, will be stored in buf by this function | ||
| Returns 0 on success, 1 on error | ||
| */ |
There was a problem hiding this comment.
Function comment should be moved before function itself.
| /* | ||
| Make the request to OpenAI API to generate the embedding | ||
| Store the result in api_response | ||
| */ |
There was a problem hiding this comment.
Function comment should be moved before function itself.
| curl_easy_cleanup(hnd); | ||
| hnd= NULL; | ||
| curl_slist_free_all(slist1); | ||
| slist1= NULL; |
There was a problem hiding this comment.
Resetting hnd and slist1 is redundant. cleanup label duplicates preceding code, this can be avoided.
| return 0; | ||
|
|
||
| json_path->p.types_used= JSON_PATH_KEY_NULL; | ||
| const uchar *s_p= (const uchar *) JSON_EMBEDDING_PATH; |
| post_fields.length(post_fields.length()+strLen); // We need to manually update the length | ||
| post_fields.append(STRING_WITH_LEN("\", \"model\": \"")); | ||
| post_fields.append(model->c_ptr(), model->length()); | ||
| post_fields.append(STRING_WITH_LEN("\",\"encoding_format\": \"float\"}")); |
There was a problem hiding this comment.
I'd rather use base64 instead of float: 2-2.5x less data to transfer and much cheaper to parse.
| json_get_path_start(&je, js->charset(),(const uchar *) js->ptr(), | ||
| (const uchar *) js->ptr() + js->length(), &p); | ||
|
|
||
| while (json_get_path_next(&je, &p) == 0) |
There was a problem hiding this comment.
I'd avoid this loop unless we really want to handle multiple embeddings at once. We probably don't within the scope of this SQL function.
| MEM_ROOT *root= thd->mem_root; | ||
| json_path= (json_path_with_flags *) alloc_root(root, sizeof(json_path_with_flags)); | ||
| // We want to store max_dimensions floats, each 4 bytes | ||
| fix_length_and_charset(max_dimensions * 4, &my_charset_bin); |
There was a problem hiding this comment.
Why not max_dimensions * sizeof(float)?
|
|
||
| LEX_CSTRING func_name_cstring() const override | ||
| { | ||
| static LEX_CSTRING name= "GENERATE_EMBEDDING_OPENAI"_LEX_CSTRING; |
There was a problem hiding this comment.
Given that most modern LLMs expose REST APIs, I’d prefer a generic function such as AI_EMBEDDING(data[, model]) that abstracts away provider-specific differences. This would avoid maintaining multiple plugins with nearly identical implementations.
| static MYSQL_THDVAR_STR(api_key, | ||
| PLUGIN_VAR_OPCMDARG | PLUGIN_VAR_MEMALLOC, | ||
| "OpenAI API key", | ||
| NULL, NULL, ""); |
There was a problem hiding this comment.
The fact that OpenAI key is exposed via session variable needs to be reevaluated.
- value may shortly become visible to
SHOW PROCESSLIST - value may be written to query log
- unsure if it matters, but DBA can't set per-server key without exposing it to users
- are there other ways for the value to become unintentionally visible?
…o consideration. The curl post works with an internal testing mock server, not with the actual openai server, as no authorization is used.
…or testing the plugin
json_escape can do the conversion internally, there is no need for it to be done separately
This ensures that calls to curl_easy_cleanup and curl_slist_free_all succeed
This means that the post_fields buffer can be reused for different rows, and only allocate memory when necessary. It also avoids manual memory management.
The print statements in python need to be changed to flush the output
This is the initial implementation to avoid recomputing expensive virtual stored columns on UPDATE statements. This commit includes: - Implementation of processors and methods for walking through expensive stored virtual columns and their dependencies. - Relevant test cases using the gen_embedding plugin. Those tests need to be run with the --force-restart flag for now, due to status variables not being reset.
…or sharing purposes
…e write_set. Has some pending issues with alter statements that change the expression and type of vcols.
…e bool and type uint'
…ate_embedding_openai_successful_http_requests' variable This allows them to be executed no matter the initial value of the Generate_embedding_openai_successful_http_requests status variable. It also means there is no need to reset the variable.
This DRAFT PR is part of the GSOC'25 project "MDEV-36100 Generate vector embeddings automatically on INSERT".
This is a DRAFT PR, meant to be used for mentor feedback and will be updated as the project progresses.
There are two main aspects on this PR:
Description
TODO: This will be updated
Release Notes
The plugin uses the following session variables:
generate_embedding_openai_host: The API endpoint the request will be sent togenerate_embedding_openai_api_key: The API key used for the request, provided via HTTP Bearer authenticationand provides the following status variables to help system admins keep track of requests:
generate_embedding_openai_successful_http_requestsgenerate_embedding_openai_total_http_requestsHow can this PR be tested?
This PR creates a new testing suite,
gen_embedding,which includes tests both for the plugin and the expensive functionality.Expensive tests have the prefix
expensive_, to distinguish them from the plugin tests.Basing the PR against the correct MariaDB version
mainbranch.Future Work
In terms of the plugin, the status variables are not being reset by
FLUSH STATUS, it remains to be investigated why that is and whether there is special support needed from the plugin.As far as the stored expensive virtual columns are concerned:
ALTERstatements are not detected.So, something like the following incorrectly does not recompute the virtual column:
Initially, we would want to do this for Row-Based Logging and then for Mixed Logging.
PR quality check