Skip to content

[feat](fe) Support named bloom filter indexes in DDL and schema change#64652

Open
hoshinojyunn wants to merge 1 commit into
apache:masterfrom
hoshinojyunn:master
Open

[feat](fe) Support named bloom filter indexes in DDL and schema change#64652
hoshinojyunn wants to merge 1 commit into
apache:masterfrom
hoshinojyunn:master

Conversation

@hoshinojyunn

@hoshinojyunn hoshinojyunn commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

Assumption / semantic boundary:
A named bloom filter index and a property-managed bloom filter index cannot be defined on the same column. The two metadata forms are managed separately, and users must use DROP INDEX for named bloom filter indexes and ALTER TABLE SET("bloom_filter_columns" = ...) for property-managed bloom filter indexes.

Problem Summary:
Previously Doris only supported bloom filter indexes through table properties such as "bloom_filter_columns" and "bloom_filter_fpp". That made bloom filter indexes behave differently from other index types and prevented users from managing them with named INDEX / CREATE INDEX / DROP INDEX syntax.

This change adds named USING BLOOMFILTER syntax in the Nereids parser and FE DDL pipeline, keeps the existing table-property bloom filter behavior for legacy metadata, and enforces that legacy bloom filter columns and named bloom filter indexes cannot be defined on the same column. The schema change path now distinguishes legacy bloom filter management from named bloom filter management, while FE->BE materialization continues to apply the table-level bloom filter fpp to both forms.

The patch also fixes bloom filter materialization on shadow columns and ensures tablet metadata marks named bloom filter columns correctly on the BE side. FE unit tests and bloom filter regression tests are added to cover parser analysis, semantic validation, schema change checks, FE->BE task generation, and named bloom filter DDL behavior.

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Assumption / semantic boundary:

A named bloom filter index and a property-managed bloom filter index cannot be
defined on the same column. They are tracked as two separate metadata forms.
DROP INDEX only applies to named bloom filter indexes, while
ALTER TABLE SET("bloom_filter_columns" = ...) only applies to
property-managed bloom filter indexes.

Bloom filter indexes in Doris were historically managed only by table properties:

  • "bloom_filter_columns"
  • "bloom_filter_fpp"

That created three problems:

  1. Bloom filter indexes behaved differently from other index types and could not use
    named INDEX / CREATE INDEX / DROP INDEX syntax.
  2. FE schema change logic only tracked the legacy property-based bloom filter
    definition, which made the metadata model awkward once named bloom filter indexes
    were introduced.
  3. FE->BE materialization needed to recognize both legacy bloom filter columns and
    named bloom filter indexes, including schema change shadow columns.

This PR introduces named bloom filter indexes with USING BLOOMFILTER, supports both
inline table-definition syntax and standalone CREATE INDEX, and keeps compatibility
with legacy table-property bloom filters. To avoid semantic ambiguity, a column cannot
be managed by both legacy bloom filter properties and a named bloom filter index at
the same time.

Main changes

  1. Parser and analysis

    • Add BLOOMFILTER keyword in the Nereids lexer/parser.
    • Parse USING BLOOMFILTER in table index definitions and CREATE INDEX.
    • Extend IndexDefinition semantic checks for bloom filter indexes.
    • Reject index-level properties on bloom filter indexes and keep bloom filter fpp
      as a table-level property.
  2. FE DDL and schema change semantics

    • Add helpers to extract named bloom filter columns from index metadata.
    • Keep legacy property-managed bloom filter columns and named bloom filter indexes
      as separate metadata sources.
    • Reject defining both forms on the same column during create/alter.
    • Preserve legacy ALTER TABLE SET("bloom_filter_columns" = ...) semantics for
      property-managed bloom filter creation and deletion.
    • Restrict DROP INDEX to named bloom filter indexes and return a clearer error if
      the target refers to a legacy property-managed bloom filter column.
  3. FE->BE materialization

    • Materialize bloom filter flags when either legacy or named bloom filter metadata
      applies to a column.
    • Reuse the table-level bloom filter fpp for both forms.
    • Normalize shadow column names before bloom filter matching so schema-changed
      columns inherit the expected bloom filter metadata.
    • Mark named bloom filter columns correctly in tablet metadata on the BE side.
  4. Tests

    • Add FE parser/analyzer/DDL/materialization unit tests.
    • Add regression coverage for named bloom filter DDL and schema change behavior.

Supported examples

  1. Create a named bloom filter index in CREATE TABLE
CREATE TABLE docs (
    id BIGINT,
    content TEXT,
    INDEX idx_content (content) USING BLOOMFILTER
) ENGINE=OLAP
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 1
PROPERTIES (
    "replication_num" = "1"
);
  1. Create a named bloom filter index on an existing table
CREATE INDEX idx_content ON docs(content) USING BLOOMFILTER;
ALTER TABLE docs ADD INDEX idx_content(content) USING BLOOMFILTER;
  1. Drop a named bloom filter index
DROP INDEX idx_content ON docs;
ALTER TABLE docs DROP INDEX idx_content;
  1. Semantic boundary
    • A column can be managed by either a named bloom filter index or a
      property-managed bloom filter index, but not both.
    • Use DROP INDEX to remove named bloom filter indexes.
    • Use ALTER TABLE SET("bloom_filter_columns" = ...) to add or remove
      property-managed bloom filter indexes.

Release note

Users can now create and manage bloom filter indexes with named index syntax while
continuing to use legacy table-property bloom filter definitions. Doris rejects
conflicting legacy and named bloom filter definitions on the same column.

Test Execution

  1. Build

    • Command: ./build.sh --be --fe
    • Result: success, Successfully build Doris
  2. FE unit tests

    • Command:
      ./run-fe-ut.sh --run org.apache.doris.catalog.ColumnBloomFilterMaterializationTest,org.apache.doris.catalog.CreateTableWithBloomFilterIndexTest,org.apache.doris.cloud.datasource.CloudInternalCatalogBloomFilterMaterializationTest,org.apache.doris.nereids.parser.NereidsParserTest,org.apache.doris.nereids.trees.plans.commands.IndexDefinitionTest,org.apache.doris.task.AgentTaskTest,org.apache.doris.alter.SchemaChangeHandlerTest
    • Result: Tests run: 191, Failures: 0, Errors: 0, Skipped: 0
  3. BE unit tests: bloom filter focused

    • Command:
      ./run-be-ut.sh --run --filter='*BloomFilter*:*bloom_filter*:*test_write_bf_with_finalize'
    • Result: Running 90 tests from 12 test suites, 90 tests passed
  4. BE unit tests: tablet metadata / schema / protobuf conversion

    • Command:
      ./run-be-ut.sh --run --filter='TabletSchemaTest.*:TabletMetaTest.*:PbConvert.*:TabletIndexTest.*:TabletSchemaIndexTest.*'
    • Result: Running 48 tests from 5 test suites, 48 tests passed
  5. Regression tests

    • Command:
      ./run-regression-test.sh --run -d bloom_filter_p0 -s test_bloom_filter
    • Result: Test 1 suites, failed 0 suites, fatal 0 scripts, skipped 0 scripts
    • Coverage note: verified legacy bloom filter DDL plus ALTER TABLE SET("bloom_filter_fpp"=...)
      by asserting BE BloomFilterIndexWriter::create receives 0.03, then 0.02 during schema
      change rewrite, then 0.03 again for newly inserted rowsets after the fpp restore.
    • Command:
      ./run-regression-test.sh --run -d bloom_filter_p0 -s test_bloom_filter_named_index
    • Result: Test 1 suites, failed 0 suites, fatal 0 scripts, skipped 0 scripts
    • Coverage note: verified named USING BLOOMFILTER inline DDL, standalone CREATE INDEX,
      DROP INDEX, conflict checks with legacy bloom_filter_columns, and table-level
      bloom_filter_fpp propagation to named bloom filter indexes.

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen

Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@hoshinojyunn

Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen

Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 0.00% (0/1) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 54.44% (21369/39252)
Line Coverage 38.08% (204441/536821)
Region Coverage 34.07% (160362/470727)
Branch Coverage 35.08% (70222/200197)

@hello-stephen

Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 0.00% (0/1) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 74.07% (28388/38326)
Line Coverage 58.03% (309688/533669)
Region Coverage 54.78% (259025/472870)
Branch Coverage 56.08% (112410/200452)

@hello-stephen

Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 62.82% (98/156) 🎉
Increment coverage report
Complete coverage report

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-H: Total hot run time: 29361 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 5a1b9a52952c829d0d33b45b6c092303a8fa86f3, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17803	4034	4022	4022
q2	2035	310	190	190
q3	10316	1434	835	835
q4	4680	467	344	344
q5	7482	869	578	578
q6	179	167	136	136
q7	757	856	632	632
q8	9330	1696	1637	1637
q9	5841	4510	4502	4502
q10	6755	1792	1541	1541
q11	443	273	241	241
q12	624	421	299	299
q13	18182	3391	2755	2755
q14	266	261	245	245
q15	q16	792	786	704	704
q17	1019	968	985	968
q18	7275	5826	5653	5653
q19	1320	1252	1045	1045
q20	489	390	263	263
q21	5885	2668	2463	2463
q22	439	364	308	308
Total cold run time: 101912 ms
Total hot run time: 29361 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4314	4239	4223	4223
q2	353	364	232	232
q3	4574	4966	4421	4421
q4	2065	2155	1395	1395
q5	4436	4267	4272	4267
q6	235	177	128	128
q7	1749	1735	1917	1735
q8	2560	2206	2196	2196
q9	8147	8213	7912	7912
q10	4814	4775	4338	4338
q11	577	413	385	385
q12	748	797	539	539
q13	3284	3579	2990	2990
q14	304	308	303	303
q15	q16	739	735	635	635
q17	1321	1342	1428	1342
q18	7758	7463	7229	7229
q19	1144	1081	1104	1081
q20	2234	2190	1936	1936
q21	5215	4607	4479	4479
q22	511	445	390	390
Total cold run time: 57082 ms
Total hot run time: 52156 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-DS: Total hot run time: 175787 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 5a1b9a52952c829d0d33b45b6c092303a8fa86f3, data reload: false

query5	4337	635	482	482
query6	432	182	167	167
query7	4951	586	306	306
query8	392	209	188	188
query9	8737	4020	4037	4020
query10	432	313	249	249
query11	5908	2338	2148	2148
query12	156	108	100	100
query13	1250	596	445	445
query14	6363	5418	5060	5060
query14_1	4421	4358	4397	4358
query15	209	200	185	185
query16	1058	512	478	478
query17	1143	712	595	595
query18	2559	486	354	354
query19	202	184	147	147
query20	113	116	108	108
query21	218	141	119	119
query22	13676	13544	13449	13449
query23	17332	16526	16154	16154
query23_1	16214	16286	16245	16245
query24	7459	1807	1294	1294
query24_1	1330	1339	1301	1301
query25	557	428	370	370
query26	1297	305	158	158
query27	2616	579	339	339
query28	4365	2027	2009	2009
query29	1041	598	473	473
query30	311	241	197	197
query31	1121	1074	956	956
query32	104	57	55	55
query33	517	320	250	250
query34	1174	1182	652	652
query35	745	782	678	678
query36	1380	1393	1233	1233
query37	150	101	90	90
query38	3184	3164	3062	3062
query39	923	910	896	896
query39_1	885	907	870	870
query40	219	119	100	100
query41	65	61	61	61
query42	96	95	96	95
query43	320	322	280	280
query44	1440	766	757	757
query45	198	191	182	182
query46	1065	1248	770	770
query47	2349	2380	2232	2232
query48	415	378	293	293
query49	622	456	344	344
query50	972	377	260	260
query51	4425	4340	4267	4267
query52	88	89	76	76
query53	242	268	200	200
query54	265	226	191	191
query55	76	74	70	70
query56	234	217	206	206
query57	1418	1431	1319	1319
query58	254	210	213	210
query59	1573	1654	1389	1389
query60	279	252	237	237
query61	216	142	147	142
query62	688	659	561	561
query63	232	187	190	187
query64	2502	759	596	596
query65	4872	4798	4775	4775
query66	1754	448	335	335
query67	29724	29767	29460	29460
query68	3216	1585	1016	1016
query69	411	295	265	265
query70	1058	968	955	955
query71	282	232	216	216
query72	2824	2617	2283	2283
query73	789	766	426	426
query74	5126	4977	4776	4776
query75	2618	2605	2237	2237
query76	2342	1179	805	805
query77	357	372	296	296
query78	12373	12489	11975	11975
query79	1427	1209	782	782
query80	609	523	413	413
query81	457	281	246	246
query82	594	156	126	126
query83	365	292	256	256
query84	266	153	126	126
query85	941	575	481	481
query86	360	302	291	291
query87	3395	3343	3196	3196
query88	3706	2775	2760	2760
query89	417	380	335	335
query90	1996	176	181	176
query91	171	156	128	128
query92	60	61	55	55
query93	1448	1510	919	919
query94	545	335	324	324
query95	673	463	335	335
query96	1034	794	331	331
query97	2721	2689	2564	2564
query98	214	206	194	194
query99	1148	1147	1024	1024
Total cold run time: 259886 ms
Total hot run time: 175787 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
ClickBench: Total hot run time: 25.23 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 5a1b9a52952c829d0d33b45b6c092303a8fa86f3, data reload: false

query1	0.01	0.01	0.00
query2	0.09	0.05	0.05
query3	0.26	0.14	0.12
query4	1.61	0.14	0.14
query5	0.24	0.23	0.22
query6	1.22	1.08	1.08
query7	0.04	0.01	0.01
query8	0.06	0.04	0.04
query9	0.37	0.34	0.31
query10	0.57	0.54	0.54
query11	0.20	0.15	0.15
query12	0.19	0.15	0.14
query13	0.48	0.46	0.48
query14	1.02	1.01	1.01
query15	0.60	0.59	0.60
query16	0.32	0.32	0.32
query17	1.11	1.08	1.06
query18	0.21	0.21	0.22
query19	2.03	2.00	1.95
query20	0.02	0.01	0.02
query21	15.44	0.23	0.14
query22	4.74	0.06	0.05
query23	16.13	0.32	0.13
query24	2.97	0.42	0.33
query25	0.11	0.04	0.05
query26	0.71	0.22	0.15
query27	0.03	0.05	0.03
query28	3.53	0.94	0.53
query29	12.52	4.33	3.47
query30	0.28	0.15	0.16
query31	2.77	0.60	0.31
query32	3.22	0.60	0.49
query33	3.24	3.23	3.22
query34	15.54	4.25	3.49
query35	3.51	3.50	3.54
query36	0.56	0.43	0.43
query37	0.09	0.07	0.06
query38	0.04	0.04	0.03
query39	0.04	0.02	0.02
query40	0.18	0.16	0.16
query41	0.08	0.03	0.02
query42	0.04	0.03	0.03
query43	0.04	0.04	0.04
Total cold run time: 96.46 s
Total hot run time: 25.23 s

@hoshinojyunn hoshinojyunn force-pushed the master branch 2 times, most recently from 39129a2 to 0e0cf67 Compare June 21, 2026 06:44
@hoshinojyunn

Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen

Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 100.00% (1/1) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 54.46% (21378/39252)
Line Coverage 38.07% (204375/536821)
Region Coverage 34.06% (160313/470727)
Branch Coverage 35.09% (70243/200197)

@hello-stephen

Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 74.36% (116/156) 🎉
Increment coverage report
Complete coverage report

@hello-stephen

Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (1/1) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 74.12% (28409/38326)
Line Coverage 58.05% (309814/533669)
Region Coverage 54.96% (259904/472870)
Branch Coverage 56.23% (112705/200452)

@hello-stephen

Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 76.28% (119/156) 🎉
Increment coverage report
Complete coverage report

@hoshinojyunn hoshinojyunn force-pushed the master branch 2 times, most recently from 80d0150 to 1be7525 Compare June 21, 2026 11:01
@hoshinojyunn

Copy link
Copy Markdown
Contributor Author

run buildall

@yx-keith

Copy link
Copy Markdown
Contributor

/review

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found three issues that should be fixed before merging.

Checkpoint conclusions:

  • Task goal and implementation: the named BLOOMFILTER syntax is wired through parser/catalog/schema-change paths, but materialization is incomplete for non-base materialized indexes.
  • Tests: good coverage for parser, FE validation, BE tablet metadata, and regression DDL; missing rollup/schema-change materialization and backup/restore signature coverage for named-only bloom filters.
  • Compatibility and persistence: named-only bloom_filter_fpp is not reflected in the table signature used by restore schema comparison.
  • Parallel paths: local/cloud schema-change and create-table/add-partition paths need the same effective bloom-filter-column handling.
  • Performance/concurrency/lifecycle: no separate blocking issue found in locking, lifecycle, or performance-sensitive code.
  • User focus: no additional user-provided focus points were present.

tColumns = (List<TColumn>) tCols;
} else {
tColumns = new ArrayList<>();
Set<String> namedBfColumns = Index.extractBloomFilterColumns(indexes);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Named bloom filter columns are derived only from indexes, but several existing non-base-index callers still pass null here. For example, SchemaChangeJobV2.createShadowIndexReplica() only passes indexes for the base index, while addShadowIndexToCatalog() writes the new index list into every changed shadow index meta when indexChange is true. ALTER TABLE ... ADD INDEX bf(v1) USING BLOOMFILTER on a table with a rollup containing v1 will therefore create the rollup shadow tablets without is_bloom_filter_column or bloom_filter_fpp, but the catalog then advertises the bloom filter index for that rollup. The same base-only indexes pattern exists in the cloud schema-change path and initial partition creation. Please pass the effective named bloom-filter column set to every materialized index schema that contains those columns, or fold named bloom-filter columns into the bfColumns argument, and add a rollup/schema-change test.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reasonable. The real bug is in schema change shadow-tablet materialization for non-base indexes.

Changed:

  • Added Index.collectBloomFilterColumns(...) and Index.filterBloomFilterColumnsBySchema(...)
  • In SchemaChangeJobV2 and CloudSchemaChangeJobV2, materialize per-shadow-schema shadowBfColumns and pass them down instead of relying on tabletIndexes != null
  • Added FE UTs for local/cloud schema change rollup shadow replicas and a rollup regression case

bfFpp = PropertyAnalyzer.analyzeBloomFilterFpp(properties);
if (bfColumns != null && bfFpp == 0) {
boolean hasLegacyBf = bfColumns != null;
boolean hasNamedBf = !Index.extractBloomFilterColumns(createTableInfo.getIndexes()).isEmpty();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes bfFpp meaningful for named-only bloom filters, but OlapTable.getSignature() still appends bfFpp only when the legacy bfColumns set is non-empty. RestoreJob uses that signature to decide whether an existing table has the same schema before restore, so two tables with the same named BLOOMFILTER index but different PROPERTIES("bloom_filter_fpp"=...) compare equal and restore can keep tablets built with the wrong FPP. Please include named bloom filters in the signature condition, for example by using the effective getCopiedBfColumns()/has-any-BF logic, and add coverage for named-only FPP in the signature/restore path.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reasonable. Named-only BLOOMFILTER tables should also make bfFpp part of the table signature.

Changed:

  • Updated OlapTable.getSignature(...) to use getCopiedBfColumns() instead of the legacy-only member field
  • Added OlapTableTest.testNamedBloomFilterFppAffectsSignature

} catch (Exception ex) {
throw new AnalysisException("invalid ngram bf index params:" + ex.getMessage(), ex);
}
} else if (indexType == IndexType.BLOOMFILTER) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This BLOOMFILTER branch runs only after the generic BITMAP/INVERTED/BLOOMFILTER/NGRAM_BF checks above, including the inverted-index V1 VARIANT rejection. As a result, a named BLOOMFILTER on a VARIANT column is rejected when inverted_index_storage_format resolves to V1/DEFAULT under those rules, even though the legacy bloom_filter_columns analyzer accepts VARIANT through Column.isSupportBloomFilter() and the new tests only cover the null format. Please gate the V1/VARIANT restriction to INVERTED or otherwise skip it for BLOOMFILTER, and add V1/default-format coverage.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reasonable. BLOOMFILTER should not inherit the inverted-index-only VARIANT + V1/DEFAULT rejection.

Changed:

  • Narrowed the VARIANT + V1/DEFAULT restriction in both IndexDefinition.checkColumn(...) overloads so it does not apply to BLOOMFILTER
  • Kept NGRAM_BF behavior unchanged
  • Added VARIANT + BLOOMFILTER + V1/DEFAULT coverage for both overloads

@hello-stephen

Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 75.64% (118/156) 🎉
Increment coverage report
Complete coverage report

@hello-stephen

Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (1/1) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 74.10% (28398/38326)
Line Coverage 57.98% (309410/533663)
Region Coverage 54.81% (259196/472861)
Branch Coverage 56.16% (112567/200450)

@hello-stephen

Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 82.69% (129/156) 🎉
Increment coverage report
Complete coverage report

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-H: Total hot run time: 29246 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 1be7525c64dd915ef0b0204cdf98429ee373ea0a, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17642	4091	4008	4008
q2	2015	327	186	186
q3	10318	1472	816	816
q4	4681	469	355	355
q5	7503	912	577	577
q6	184	171	138	138
q7	766	823	617	617
q8	9322	1705	1580	1580
q9	5908	4524	4448	4448
q10	6778	1787	1544	1544
q11	444	277	249	249
q12	625	442	306	306
q13	18132	3536	2770	2770
q14	267	260	244	244
q15	q16	790	774	704	704
q17	999	980	1040	980
q18	6900	5799	5658	5658
q19	1291	1333	1089	1089
q20	513	417	265	265
q21	5921	2726	2407	2407
q22	439	368	305	305
Total cold run time: 101438 ms
Total hot run time: 29246 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4430	4317	4391	4317
q2	340	359	243	243
q3	4654	4946	4395	4395
q4	2078	2168	1378	1378
q5	4493	4319	4324	4319
q6	224	175	126	126
q7	1788	2034	1853	1853
q8	2655	2326	2303	2303
q9	8207	8317	8084	8084
q10	4781	4842	4285	4285
q11	587	431	402	402
q12	771	773	543	543
q13	3354	3714	3046	3046
q14	299	299	283	283
q15	q16	733	757	686	686
q17	1372	1349	1357	1349
q18	8015	7333	7401	7333
q19	1159	1147	1091	1091
q20	2231	2209	1927	1927
q21	5393	4725	4540	4540
q22	516	456	394	394
Total cold run time: 58080 ms
Total hot run time: 52897 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-DS: Total hot run time: 175546 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 1be7525c64dd915ef0b0204cdf98429ee373ea0a, data reload: false

query5	4337	621	476	476
query6	434	201	175	175
query7	4890	592	295	295
query8	361	220	198	198
query9	8740	4100	4099	4099
query10	459	302	253	253
query11	5758	2333	2128	2128
query12	157	103	101	101
query13	1261	607	428	428
query14	6380	5419	5041	5041
query14_1	4414	4487	4518	4487
query15	206	195	170	170
query16	984	452	435	435
query17	935	688	542	542
query18	2426	465	338	338
query19	186	181	143	143
query20	107	104	101	101
query21	216	132	115	115
query22	13693	13558	13353	13353
query23	17401	16539	16083	16083
query23_1	16276	16265	16188	16188
query24	7548	1807	1331	1331
query24_1	1347	1346	1347	1346
query25	562	472	390	390
query26	1313	337	170	170
query27	2692	564	358	358
query28	4505	2054	2034	2034
query29	1101	640	502	502
query30	315	240	196	196
query31	1143	1078	966	966
query32	113	65	59	59
query33	535	323	262	262
query34	1178	1155	681	681
query35	755	793	676	676
query36	1359	1372	1271	1271
query37	154	105	91	91
query38	3222	3125	3043	3043
query39	933	914	893	893
query39_1	883	882	874	874
query40	221	127	104	104
query41	68	65	67	65
query42	99	97	97	97
query43	328	325	300	300
query44	1504	776	787	776
query45	198	191	177	177
query46	1091	1208	746	746
query47	2346	2330	2205	2205
query48	441	437	299	299
query49	633	482	360	360
query50	1031	369	278	278
query51	4325	4332	4205	4205
query52	88	92	79	79
query53	258	267	205	205
query54	287	227	218	218
query55	81	79	73	73
query56	237	227	235	227
query57	1390	1406	1310	1310
query58	255	221	222	221
query59	1608	1672	1443	1443
query60	291	245	242	242
query61	174	188	148	148
query62	720	657	589	589
query63	228	188	200	188
query64	2513	751	589	589
query65	4868	4792	4774	4774
query66	1761	463	328	328
query67	29160	29792	29634	29634
query68	3252	1503	1019	1019
query69	398	298	268	268
query70	1017	996	956	956
query71	289	231	210	210
query72	2952	2603	2325	2325
query73	879	808	447	447
query74	5132	5006	4741	4741
query75	2642	2597	2244	2244
query76	2316	1221	794	794
query77	366	387	281	281
query78	12455	12521	11858	11858
query79	1389	1196	717	717
query80	1268	488	385	385
query81	524	280	235	235
query82	634	157	120	120
query83	352	270	246	246
query84	312	146	116	116
query85	916	526	404	404
query86	424	303	291	291
query87	3402	3367	3209	3209
query88	3766	2810	2816	2810
query89	439	380	328	328
query90	1872	172	181	172
query91	172	166	133	133
query92	92	57	58	57
query93	1582	1431	942	942
query94	710	344	305	305
query95	686	465	350	350
query96	1058	794	353	353
query97	2711	2671	2517	2517
query98	213	202	200	200
query99	1170	1153	1028	1028
Total cold run time: 261249 ms
Total hot run time: 175546 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
ClickBench: Total hot run time: 25.24 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 1be7525c64dd915ef0b0204cdf98429ee373ea0a, data reload: false

query1	0.01	0.01	0.00
query2	0.10	0.06	0.05
query3	0.25	0.14	0.13
query4	1.63	0.14	0.13
query5	0.24	0.22	0.21
query6	1.26	1.08	1.09
query7	0.04	0.01	0.00
query8	0.05	0.04	0.03
query9	0.38	0.31	0.31
query10	0.57	0.57	0.57
query11	0.19	0.14	0.14
query12	0.18	0.14	0.15
query13	0.48	0.47	0.50
query14	1.02	1.02	1.01
query15	0.63	0.60	0.60
query16	0.31	0.32	0.30
query17	1.10	1.15	1.11
query18	0.23	0.21	0.21
query19	2.07	1.87	1.95
query20	0.02	0.01	0.01
query21	15.45	0.22	0.14
query22	4.85	0.05	0.05
query23	16.15	0.31	0.12
query24	2.92	0.42	0.32
query25	0.12	0.05	0.05
query26	0.71	0.20	0.15
query27	0.04	0.04	0.03
query28	3.61	0.87	0.53
query29	12.46	4.24	3.46
query30	0.28	0.15	0.15
query31	2.78	0.61	0.32
query32	3.22	0.60	0.49
query33	3.12	3.24	3.27
query34	15.54	4.20	3.53
query35	3.56	3.50	3.49
query36	0.56	0.44	0.43
query37	0.09	0.07	0.06
query38	0.06	0.03	0.03
query39	0.03	0.03	0.03
query40	0.18	0.17	0.16
query41	0.09	0.04	0.03
query42	0.04	0.03	0.03
query43	0.05	0.03	0.04
Total cold run time: 96.67 s
Total hot run time: 25.24 s

DCHECK_EQ(index.columns.size(), 1);
if (iequal(tcolumn.column_name, index.columns[0])) {
column->set_is_bf_column(true);
has_bf_columns = true;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why adding this line

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

has_bf_columns is used to set the bloom filter function (FPP) in the following text. In the current design, the FPP of the named BF index depends on the bloom_filter_fpp defined in the table properties. Therefore, when checking the indexes of tablet_meta, if an index of type BLOOMFILTER exists, has_bf_columns is set to true so that bloom_filter_fpp can be set correctly in the following text (a better approach would be to handle NGRAM_BF and BLOOMFILTER separately).

partitionId, shadowTablet,
tbl.getPartitionInfo().getTabletType(partitionId),
shadowSchemaHash, originKeysType, shadowShortKeyColumnCount, bfColumns,
shadowSchemaHash, originKeysType, shadowShortKeyColumnCount,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just format, do not changing this line

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

}
}
if (found == null) {
if (containsIgnoreCase(olapTable.getCopiedLegacyBfColumns(), indexName)) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why preventing dropping bf index?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function actually checks whether the user's DROP INDEX statement intends to delete the 'bf' column defined in the table properties. It prevents DROP INDEX from being misused with ALTER TABLE SET ("bloom_filter_column"=...) and prompts the user for the correct deletion method.

@yx-keith yx-keith left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Named BF is not materialized on non-base (rollup/MV) indexes — correctness bug
    SchemaChangeJobV2 passes indexes only for the base index:

// SchemaChangeJobV2#runPendingJob
List tabletIndexes = originIndexId == tbl.getBaseIndexId() ? indexes : null;
Since named BF columns are derived only from indexes, and the bfColumns stored on the schema-change job is legacy-only (in createJob, bfColumns ends up as originalLegacyBloomFilterColumns), a rollup/MV that contains the column gets neither signal. So ALTER TABLE ... ADD INDEX bf(v1) USING BLOOMFILTER on a table whose rollup contains v1 creates rollup shadow tablets without is_bf_column / bloom_filter_fpp, while the catalog still advertises the BF index for that rollup. The same base-only indexes pattern exists on the cloud schema-change and initial create-table / add-partition paths.

Suggested fix: fold the effective named BF columns into the bfColumns argument for every materialized index that contains them (or pass the relevant indexes to each), and add a rollup/schema-change materialization test.

  1. getSignature() omits named-only bfFpp — restore can keep wrong-FPP tablets

// OlapTable#getSignature
if (bfColumns != null && !bfColumns.isEmpty()) { // legacy field only
for (String bfCol : bfColumns) { sb.append(bfCol); }
sb.append(bfFpp);
}
For a named-only table the legacy bfColumns field is empty, so bfFpp is excluded from the signature. RestoreJob uses this signature to decide schema equality, so two tables with the same named USING BLOOMFILTER index but different PROPERTIES("bloom_filter_fpp"=...) compare equal and restore can retain tablets built with the wrong FPP.

Suggested fix: include named BF in the signature condition (e.g. use the effective getCopiedBfColumns() / has-any-BF logic), and add named-only FPP coverage on the signature/restore path.

  1. VARIANT + inverted-index-V1 rule wrongly rejects named BLOOMFILTER
    In IndexDefinition.checkColumn, the V1/DEFAULT VARIANT rejection lives inside the shared BITMAP/INVERTED/BLOOMFILTER/NGRAM_BF block and is not gated to INVERTED:

boolean notSupportInvertedIndexForVariant =
(invertedIndexFileStorageFormat == V1 || == DEFAULT)
&& (Config.isCloudMode() || !Config.enable_inverted_index_v1_for_variant);
if (colType.isVariantType() && notSupportInvertedIndexForVariant) { throw ...; }
So a named BLOOMFILTER on a VARIANT column is rejected when the format resolves to V1/DEFAULT, even though the legacy bloom_filter_columns analyzer accepts VARIANT via Column.isSupportBloomFilter(). The tests pass null for the format, which skips this branch and hides the inconsistency.

Suggested fix: gate the V1/VARIANT restriction to INVERTED (or skip it for BLOOMFILTER), and add V1/DEFAULT-format coverage.

Smaller points
CloudSchemaChangeJobV2.java — the change there is whitespace-only; please revert it to keep the diff surgical.
tablet_meta.cpp has_bf_columns = true is correct (it gates set_bf_fpp), but it would read cleaner to handle NGRAM_BF and BLOOMFILTER separately rather than sharing the flag.

Assumption / semantic boundary:
A named bloom filter index and a property-managed bloom filter index cannot be
defined on the same column. The two metadata forms are managed separately, and
users must use `DROP INDEX` for named bloom filter indexes and
`ALTER TABLE SET("bloom_filter_columns" = ...)` for property-managed bloom
filter indexes.

Problem Summary:
Previously Doris only supported bloom filter indexes through table properties such as
`"bloom_filter_columns"` and `"bloom_filter_fpp"`. That made bloom filter indexes
behave differently from other index types and prevented users from managing them with
named `INDEX` / `CREATE INDEX` / `DROP INDEX` syntax.

This change adds named `USING BLOOMFILTER` syntax in the Nereids parser and FE DDL
pipeline, keeps the existing table-property bloom filter behavior for legacy metadata,
and enforces that legacy bloom filter columns and named bloom filter indexes cannot be
defined on the same column. The schema change path now distinguishes legacy bloom
filter management from named bloom filter management, while FE->BE materialization
continues to apply the table-level bloom filter fpp to both forms.

The patch also fixes bloom filter materialization on shadow columns and ensures tablet
metadata marks named bloom filter columns correctly on the BE side. FE unit tests and
bloom filter regression tests are added to cover parser analysis, semantic validation,
schema change checks, FE->BE task generation, and named bloom filter DDL behavior.
@hoshinojyunn

Copy link
Copy Markdown
Contributor Author

run buildall

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants