feat: integrate openms-insight into quantms#28
Conversation
Implement 5 new Streamlit analysis pipeline pages powered by the openms-insight engine to handle downstream proteomics workflows: - Data Filtering (filtering.py): * Support Low Abundance, Low Repeatability, and Low Variance filters using Polars LazyFrame. - Missing Value Imputation (imputation.py): * Connect upstream filtered datasets using session state fallback. * Provide group-aware MAR (mean/median) and technical dropout MNAR (row/global minimum) methods. - Data Normalization & Scaling (normalization.py): * Build a 3-step preprocessing chain: Mathematical transformation, column-wise sample normalization (PQn, Quantile, Reference Feature), and row-wise scaling. - Statistical Inference (statistical.py): * Dynamically route test methods (limma-like, Welch, ANOVA) based on detected group counts. * Integrate multiple testing corrections (BH, Bonferroni). - GO Enrichment Analysis (enrichment.py): * Fetch foreground lists dynamically via P-value and Log2FC constraints. * Query MyGene.info API and render interactive Plotly charts across BP, CC, and MF tabs.
📝 WalkthroughWalkthroughThe PR changes abundance loading to return a pivot table and group map, adds filtering/imputation/normalization/statistical/enrichment pages, and rewires heatmap/PCA/volcano results to updated data sources and OpenMS Insight components. The results navigation also replaces Proteomics LFQ with Pathway Analysis. ChangesAbundance preprocessing, analysis, and results
Sequence Diagram(s)sequenceDiagram
participant FilteringPage as content/filtering.py
participant ImputationPage as content/imputation.py
participant NormalizationPage as content/normalization.py
participant StatisticalPage as content/statistical.py
participant EnrichmentPage as content/enrichment.py
participant SessionState as st.session_state
participant EnrichmentEngine as calculate_go_enrichment
FilteringPage->>SessionState: store filtered_df
ImputationPage->>SessionState: store imputed_df
NormalizationPage->>SessionState: store normalized_df
StatisticalPage->>SessionState: store statistics_df
EnrichmentPage->>SessionState: read statistics_df
EnrichmentPage->>EnrichmentEngine: calculate_go_enrichment(...)
EnrichmentEngine-->>EnrichmentPage: status and GO results
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 OpenGrep (1.23.0)content/results_pca.py┌──────────────┐ �[32m✔�[39m �[1mOpengrep OSS�[0m [00.16][ERROR]: unable to find a config; path content/results_volcano.py┌──────────────┐ �[32m✔�[39m �[1mOpengrep OSS�[0m [00.17][ERROR]: unable to find a config; path quantms_volcano_plot/manifest.json┌──────────────┐ �[32m✔�[39m �[1mOpengrep OSS�[0m [00.12][ERROR]: unable to find a config; path
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 8
🧹 Nitpick comments (5)
content/imputation.py (1)
65-134: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winUse an
@st.fragmentfor the imputation controls/results section.This page adds interactive widgets plus button-driven rendering without fragment scoping, so each change reruns the full page. As per coding guidelines,
**/*.py:Use@st.fragmentdecorator for interactive UI updates without full page reloads.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@content/imputation.py` around lines 65 - 134, The imputation controls/results block in content/imputation.py should be scoped into a Streamlit fragment so widget changes and the Apply Imputation button do not rerun the whole page. Wrap the interactive section that builds metadata_pl, renders the selectbox/radio controls, and handles the imputation execution around impute_mar and impute_smallest_value with an `@st.fragment-decorated` function, then call that fragment from the page so the UI updates stay localized.Source: Coding guidelines
content/normalization.py (1)
79-176: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winPut the normalization UI into an
@st.fragment.This workflow is entirely widget-driven, but it is still executed at page scope, so every interaction reruns the whole page. As per coding guidelines,
**/*.py:Use@st.fragmentdecorator for interactive UI updates without full page reloads.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@content/normalization.py` around lines 79 - 176, The normalization widget workflow in the current page-scope block should be moved into an `@st.fragment` so interactions only rerun that UI segment. Wrap the UI and button-driven pipeline logic around the selectboxes, st.text_input, and the "Apply Normalization Pipelines" flow in a fragment function, then call that fragment from the page body so the rest of the page is not re-executed on every widget change.Source: Coding guidelines
content/statistical.py (1)
85-158: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winExtract the statistical controls/results into an
@st.fragment.This page adds interactive analysis widgets without fragment scoping, so the entire page reruns on every control change. As per coding guidelines,
**/*.py:Use@st.fragmentdecorator for interactive UI updates without full page reloads.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@content/statistical.py` around lines 85 - 158, The statistical controls and results block in the main Streamlit flow reruns the entire page on every widget change; move the interactive UI and execution logic into an `@st.fragment` so updates are scoped locally. Wrap the sections that build metadata_pl, render the selectboxes, and handle the Run Statistical Analysis button/results inside a fragment function, keeping the existing calculate_statistical_tests and adjust_fdr_lazy calls together there.Source: Coding guidelines
content/filtering.py (1)
58-163: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winMove the interactive filter block into an
@st.fragment.The controls and result view are rendered at module scope, so every widget interaction reruns the whole page instead of just this analysis section. As per coding guidelines,
**/*.py:Use@st.fragmentdecorator for interactive UI updates without full page reloads.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@content/filtering.py` around lines 58 - 163, The filter UI and result rendering in the filtering section are still running at module scope, so every widget change reruns the whole page; wrap this interactive block in an `@st.fragment` to keep updates scoped. Move the controls, Apply Filter button, and filtered result display into a fragment function in content/filtering.py, and keep the existing filter_method, filter_low_abundance, filter_low_repeatability, filter_low_variance, and filtered_df logic inside that fragment so the behavior stays the same but only this section refreshes.Source: Coding guidelines
content/enrichment.py (1)
50-140: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winUse an
@st.fragmentfor the enrichment controls and result tabs.This page is another interactive analysis surface implemented at top level, so the whole page reruns on every control change. As per coding guidelines,
**/*.py:Use@st.fragmentdecorator for interactive UI updates without full page reloads.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@content/enrichment.py` around lines 50 - 140, The enrichment controls and result tabs are implemented directly in the top-level Streamlit flow, causing full-page reruns on every widget change. Refactor the UI block around the threshold inputs, the run button, and the tabbed GO results into a dedicated function and decorate it with `@st.fragment` so only that section updates interactively. Keep the existing logic in calculate_go_enrichment, the target_p_col/p_label setup, and the tabs rendering inside the fragment.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@content/enrichment.py`:
- Around line 38-39: The fallback navigation in the `st.page_link` call
currently points to the wrong page module, which can break the prerequisite flow
if that target does not exist. Update the link in `content/enrichment.py` to use
the same registered statistics page path that `app.py` registers for the
statistics screen, keeping the label and icon unchanged.
- Around line 45-54: The enrichment setup currently validates only ProteinName,
but the later call path also depends on log2FC and a valid p-value column. In
content/enrichment.py, extend the existing preflight checks near target_p_col
selection to verify final_statistics_report contains log2FC plus either p-adj or
p-value before reaching calculate_go_enrichment. If any required column is
missing, show a clear st.error and st.stop early so the page fails fast with a
useful message.
In `@content/filtering.py`:
- Around line 142-145: The filter update in filtering.py should also invalidate
dependent analysis state so downstream pages stop using stale data. In the
filtering flow where filtered_df is written in the filtering logic, clear or
remove the existing imputed_df, normalized_df, and statistics_df entries from
st.session_state whenever the filter result changes. Make the change in the same
place that updates filtered_df so normalization and statistical paths that read
from imputed_df/normalized_df in content/normalization.py and
content/statistical.py will recompute from the new filtered data instead of
reusing old cached values.
In `@content/imputation.py`:
- Around line 124-128: After re-imputing in the imputation flow, clear any
downstream derived session-state artifacts that depend on the old data, not just
overwrite imputed_df. Update the logic around imputed_lazy.collect().to_pandas()
and the st.session_state assignments so normalized_df and statistics_df are
removed or reset whenever imputed_df is refreshed, ensuring later steps like
statistical and enrichment processing do not consume stale outputs.
In `@content/normalization.py`:
- Around line 166-170: The normalization step updates normalized_df but leaves
any existing statistics_df in session state, so downstream enrichment can use
stale statistics after a new normalization run. Update the normalization flow in
the final checkpointing block that sets st.session_state["normalized_df"] to
also clear or invalidate st.session_state["statistics_df"] whenever the
normalized data changes, so content/enrichment.py will not reuse mismatched
statistics.
In `@content/statistical.py`:
- Around line 103-157: The `anova` option in the `selected_method` flow is
incompatible with the downstream `log2FC`-based schema used after
`calculate_statistical_tests` and by consumers like `content/enrichment.py`.
Update the 3+ group branch in `content/statistical.py` so `method_options` only
includes methods that still produce a single per-row `log2FC`/p-value contract,
or change the downstream contract and documentation if ANOVA must remain. Make
sure the selected method path, result table messaging, and any dependent
filtering logic stay consistent with the same output columns.
- Around line 100-115: The statistical test selector in the group_count == 2
branch currently exposes "paired" even though no pair metadata is captured, so
remove that option from method_options in the selectbox flow and keep only
unpaired methods like "limma_like" and "welch"; update any related help_text or
validation in content/statistical.py so the UI no longer suggests paired
analysis until explicit pair assignment support exists.
In `@src/common/results_helpers.py`:
- Around line 234-246: The sample grouping logic in the results helper is
truncating to only the first two groups, which drops any additional groups
before downstream processing. Update the grouping and sample selection in the
results-building flow around the unique symbols df, sample_group_df,
group1_samples, group2_samples, and all_samples so it preserves every group
present in the data instead of hard-coding groups[0] and groups[1]. Make sure
the pivot/multi-group path receives the full set of samples and groups, with no
silent filtering when more than two groups exist.
---
Nitpick comments:
In `@content/enrichment.py`:
- Around line 50-140: The enrichment controls and result tabs are implemented
directly in the top-level Streamlit flow, causing full-page reruns on every
widget change. Refactor the UI block around the threshold inputs, the run
button, and the tabbed GO results into a dedicated function and decorate it with
`@st.fragment` so only that section updates interactively. Keep the existing logic
in calculate_go_enrichment, the target_p_col/p_label setup, and the tabs
rendering inside the fragment.
In `@content/filtering.py`:
- Around line 58-163: The filter UI and result rendering in the filtering
section are still running at module scope, so every widget change reruns the
whole page; wrap this interactive block in an `@st.fragment` to keep updates
scoped. Move the controls, Apply Filter button, and filtered result display into
a fragment function in content/filtering.py, and keep the existing
filter_method, filter_low_abundance, filter_low_repeatability,
filter_low_variance, and filtered_df logic inside that fragment so the behavior
stays the same but only this section refreshes.
In `@content/imputation.py`:
- Around line 65-134: The imputation controls/results block in
content/imputation.py should be scoped into a Streamlit fragment so widget
changes and the Apply Imputation button do not rerun the whole page. Wrap the
interactive section that builds metadata_pl, renders the selectbox/radio
controls, and handles the imputation execution around impute_mar and
impute_smallest_value with an `@st.fragment-decorated` function, then call that
fragment from the page so the UI updates stay localized.
In `@content/normalization.py`:
- Around line 79-176: The normalization widget workflow in the current
page-scope block should be moved into an `@st.fragment` so interactions only rerun
that UI segment. Wrap the UI and button-driven pipeline logic around the
selectboxes, st.text_input, and the "Apply Normalization Pipelines" flow in a
fragment function, then call that fragment from the page body so the rest of the
page is not re-executed on every widget change.
In `@content/statistical.py`:
- Around line 85-158: The statistical controls and results block in the main
Streamlit flow reruns the entire page on every widget change; move the
interactive UI and execution logic into an `@st.fragment` so updates are scoped
locally. Wrap the sections that build metadata_pl, render the selectboxes, and
handle the Run Statistical Analysis button/results inside a fragment function,
keeping the existing calculate_statistical_tests and adjust_fdr_lazy calls
together there.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 3f986db5-e0e1-48bf-91e8-54d28dc2a8c8
📒 Files selected for processing (8)
app.pycontent/enrichment.pycontent/filtering.pycontent/imputation.pycontent/normalization.pycontent/results_abundance.pycontent/statistical.pysrc/common/results_helpers.py
| st.page_link( | ||
| "content/results_statistics.py", label="Go to Statistical Inference", icon="🔬" |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Point this fallback link at the registered statistics page.
app.py registers content/statistical.py, but this button sends users to content/results_statistics.py. If that file is absent, the prerequisite flow dead-ends here.
Proposed fix
st.page_link(
- "content/results_statistics.py", label="Go to Statistical Inference", icon="🔬"
+ "content/statistical.py", label="Go to Statistical Inference", icon="🔬"
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| st.page_link( | |
| "content/results_statistics.py", label="Go to Statistical Inference", icon="🔬" | |
| st.page_link( | |
| "content/statistical.py", label="Go to Statistical Inference", icon="🔬" |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@content/enrichment.py` around lines 38 - 39, The fallback navigation in the
`st.page_link` call currently points to the wrong page module, which can break
the prerequisite flow if that target does not exist. Update the link in
`content/enrichment.py` to use the same registered statistics page path that
`app.py` registers for the statistics screen, keeping the label and icon
unchanged.
| id_col = "ProteinName" | ||
| if id_col not in final_statistics_report.columns: | ||
| st.error(f"❌ Structural Error: Column '{id_col}' is missing from the active matrix context.") | ||
| st.stop() | ||
|
|
||
| # --- SECTION 1: Parameter Setup & Dynamic Cutoff Labels --- | ||
| st.subheader("Configure Enrichment Thresholds") | ||
|
|
||
| # Check if target p-value should be adjusted or raw based on previous selections (Fallback safely to 'p-adj') | ||
| target_p_col = "p-adj" if "p-adj" in final_statistics_report.columns else "p-value" |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Validate every column the enrichment call depends on.
This only checks ProteinName, but the execution path also requires log2FC and one of p-adj / p-value. Right now the page can pass an incompatible statistics table into calculate_go_enrichment and fail much later with a less useful error.
Proposed fix
-# Identify target identifier columns dynamically
-id_col = "ProteinName"
-if id_col not in final_statistics_report.columns:
- st.error(f"❌ Structural Error: Column '{id_col}' is missing from the active matrix context.")
- st.stop()
+required_cols = {"ProteinName", "log2FC"}
+missing_cols = required_cols - set(final_statistics_report.columns)
+if missing_cols:
+ st.error(
+ f"❌ Structural Error: Missing required columns: {', '.join(sorted(missing_cols))}."
+ )
+ st.stop()
+
+id_col = "ProteinName"
# --- SECTION 1: Parameter Setup & Dynamic Cutoff Labels ---
st.subheader("Configure Enrichment Thresholds")
-# Check if target p-value should be adjusted or raw based on previous selections (Fallback safely to 'p-adj')
-target_p_col = "p-adj" if "p-adj" in final_statistics_report.columns else "p-value"
+if "p-adj" in final_statistics_report.columns:
+ target_p_col = "p-adj"
+elif "p-value" in final_statistics_report.columns:
+ target_p_col = "p-value"
+else:
+ st.error("❌ Structural Error: Missing both 'p-adj' and 'p-value'.")
+ st.stop()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| id_col = "ProteinName" | |
| if id_col not in final_statistics_report.columns: | |
| st.error(f"❌ Structural Error: Column '{id_col}' is missing from the active matrix context.") | |
| st.stop() | |
| # --- SECTION 1: Parameter Setup & Dynamic Cutoff Labels --- | |
| st.subheader("Configure Enrichment Thresholds") | |
| # Check if target p-value should be adjusted or raw based on previous selections (Fallback safely to 'p-adj') | |
| target_p_col = "p-adj" if "p-adj" in final_statistics_report.columns else "p-value" | |
| required_cols = {"ProteinName", "log2FC"} | |
| missing_cols = required_cols - set(final_statistics_report.columns) | |
| if missing_cols: | |
| st.error( | |
| f"❌ Structural Error: Missing required columns: {', '.join(sorted(missing_cols))}." | |
| ) | |
| st.stop() | |
| id_col = "ProteinName" | |
| # --- SECTION 1: Parameter Setup & Dynamic Cutoff Labels --- | |
| st.subheader("Configure Enrichment Thresholds") | |
| if "p-adj" in final_statistics_report.columns: | |
| target_p_col = "p-adj" | |
| elif "p-value" in final_statistics_report.columns: | |
| target_p_col = "p-value" | |
| else: | |
| st.error("❌ Structural Error: Missing both 'p-adj' and 'p-value'.") | |
| st.stop() |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@content/enrichment.py` around lines 45 - 54, The enrichment setup currently
validates only ProteinName, but the later call path also depends on log2FC and a
valid p-value column. In content/enrichment.py, extend the existing preflight
checks near target_p_col selection to verify final_statistics_report contains
log2FC plus either p-adj or p-value before reaching calculate_go_enrichment. If
any required column is missing, show a clear st.error and st.stop early so the
page fails fast with a useful message.
| # Collect the evaluated lazy graph and convert back to Pandas for visualization | ||
| filtered_df = filtered_lazy.collect().to_pandas() | ||
| st.session_state["filtered_df"] = filtered_df | ||
|
|
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win
Invalidate downstream analysis state when the filter output changes.
After Line 144 writes a new filtered_df, the old imputed_df, normalized_df, and statistics_df remain live. content/normalization.py prefers imputed_df over filtered_df, and content/statistical.py prefers normalized_df/imputed_df, so later pages can keep running on stale pre-filter data.
Proposed fix
# Collect the evaluated lazy graph and convert back to Pandas for visualization
filtered_df = filtered_lazy.collect().to_pandas()
+ for key in ("imputed_df", "normalized_df", "statistics_df"):
+ st.session_state.pop(key, None)
st.session_state["filtered_df"] = filtered_df📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Collect the evaluated lazy graph and convert back to Pandas for visualization | |
| filtered_df = filtered_lazy.collect().to_pandas() | |
| st.session_state["filtered_df"] = filtered_df | |
| # Collect the evaluated lazy graph and convert back to Pandas for visualization | |
| filtered_df = filtered_lazy.collect().to_pandas() | |
| for key in ("imputed_df", "normalized_df", "statistics_df"): | |
| st.session_state.pop(key, None) | |
| st.session_state["filtered_df"] = filtered_df |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@content/filtering.py` around lines 142 - 145, The filter update in
filtering.py should also invalidate dependent analysis state so downstream pages
stop using stale data. In the filtering flow where filtered_df is written in the
filtering logic, clear or remove the existing imputed_df, normalized_df, and
statistics_df entries from st.session_state whenever the filter result changes.
Make the change in the same place that updates filtered_df so normalization and
statistical paths that read from imputed_df/normalized_df in
content/normalization.py and content/statistical.py will recompute from the new
filtered data instead of reusing old cached values.
| # Resolve lazy graph optimization tree and push to display data frame structure | ||
| imputed_df = imputed_lazy.collect().to_pandas() | ||
|
|
||
| # 💾 Save current output into Session State for down-stream processing (Normalization, Statistics) | ||
| st.session_state["imputed_df"] = imputed_df |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win
Clear derived normalization/statistics outputs after re-imputation.
When Line 128 overwrites imputed_df, the old normalized_df and statistics_df are left intact. content/statistical.py will keep preferring normalized_df, and content/enrichment.py can still read stale statistics_df, so the pipeline becomes internally inconsistent.
Proposed fix
# Resolve lazy graph optimization tree and push to display data frame structure
imputed_df = imputed_lazy.collect().to_pandas()
# 💾 Save current output into Session State for down-stream processing (Normalization, Statistics)
+ for key in ("normalized_df", "statistics_df"):
+ st.session_state.pop(key, None)
st.session_state["imputed_df"] = imputed_df📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Resolve lazy graph optimization tree and push to display data frame structure | |
| imputed_df = imputed_lazy.collect().to_pandas() | |
| # 💾 Save current output into Session State for down-stream processing (Normalization, Statistics) | |
| st.session_state["imputed_df"] = imputed_df | |
| # Resolve lazy graph optimization tree and push to display data frame structure | |
| imputed_df = imputed_lazy.collect().to_pandas() | |
| # 💾 Save current output into Session State for down-stream processing (Normalization, Statistics) | |
| for key in ("normalized_df", "statistics_df"): | |
| st.session_state.pop(key, None) | |
| st.session_state["imputed_df"] = imputed_df |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@content/imputation.py` around lines 124 - 128, After re-imputing in the
imputation flow, clear any downstream derived session-state artifacts that
depend on the old data, not just overwrite imputed_df. Update the logic around
imputed_lazy.collect().to_pandas() and the st.session_state assignments so
normalized_df and statistics_df are removed or reset whenever imputed_df is
refreshed, ensuring later steps like statistical and enrichment processing do
not consume stale outputs.
| # Finalize and collect pipeline query graph optimizations | ||
| normalized_df = processing_lazy.collect().to_pandas() | ||
|
|
||
| # 💾 Save processing checkpoint inside Session State for Downstream (Statistics Block) | ||
| st.session_state["normalized_df"] = normalized_df |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win
Invalidate stale statistics after normalization changes.
Line 170 updates normalized_df, but any existing statistics_df survives. Since content/enrichment.py reads statistics_df directly from session state, users can see enrichment results for a different normalization run until they manually rerun statistics.
Proposed fix
# Finalize and collect pipeline query graph optimizations
normalized_df = processing_lazy.collect().to_pandas()
# 💾 Save processing checkpoint inside Session State for Downstream (Statistics Block)
+ st.session_state.pop("statistics_df", None)
st.session_state["normalized_df"] = normalized_df📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Finalize and collect pipeline query graph optimizations | |
| normalized_df = processing_lazy.collect().to_pandas() | |
| # 💾 Save processing checkpoint inside Session State for Downstream (Statistics Block) | |
| st.session_state["normalized_df"] = normalized_df | |
| # Finalize and collect pipeline query graph optimizations | |
| normalized_df = processing_lazy.collect().to_pandas() | |
| # 💾 Save processing checkpoint inside Session State for Downstream (Statistics Block) | |
| st.session_state.pop("statistics_df", None) | |
| st.session_state["normalized_df"] = normalized_df |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@content/normalization.py` around lines 166 - 170, The normalization step
updates normalized_df but leaves any existing statistics_df in session state, so
downstream enrichment can use stale statistics after a new normalization run.
Update the normalization flow in the final checkpointing block that sets
st.session_state["normalized_df"] to also clear or invalidate
st.session_state["statistics_df"] whenever the normalized data changes, so
content/enrichment.py will not reuse mismatched statistics.
| if group_count == 2: | ||
| method_options = ["limma_like", "welch", "paired"] | ||
| help_text = "'limma_like' uses Empirical Bayes variance shrinking. 'welch' is for unequal variances. 'paired' is for dependent samples." | ||
| elif group_count >= 3: | ||
| method_options = ["limma_like", "anova"] | ||
| help_text = "'limma_like' supports multi-group design matrices. 'anova' computes standard row-wise One-way ANOVA." | ||
| else: | ||
| st.error("❌ Statistical testing requires at least 2 unique sample groups.") | ||
| st.stop() | ||
|
|
||
| selected_method = st.selectbox( | ||
| "Select Statistical Test", | ||
| options=method_options, | ||
| index=0, | ||
| help=help_text | ||
| ) |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Remove the paired-test option until pair metadata exists.
Nothing in this flow captures pair assignments, so paired can only pair samples by incidental column order. That makes the result statistically wrong for ordinary unpaired cohorts.
Proposed fix
if group_count == 2:
- method_options = ["limma_like", "welch", "paired"]
- help_text = "'limma_like' uses Empirical Bayes variance shrinking. 'welch' is for unequal variances. 'paired' is for dependent samples."
+ method_options = ["limma_like", "welch"]
+ help_text = "'limma_like' uses Empirical Bayes variance shrinking. 'welch' is for unequal variances."📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if group_count == 2: | |
| method_options = ["limma_like", "welch", "paired"] | |
| help_text = "'limma_like' uses Empirical Bayes variance shrinking. 'welch' is for unequal variances. 'paired' is for dependent samples." | |
| elif group_count >= 3: | |
| method_options = ["limma_like", "anova"] | |
| help_text = "'limma_like' supports multi-group design matrices. 'anova' computes standard row-wise One-way ANOVA." | |
| else: | |
| st.error("❌ Statistical testing requires at least 2 unique sample groups.") | |
| st.stop() | |
| selected_method = st.selectbox( | |
| "Select Statistical Test", | |
| options=method_options, | |
| index=0, | |
| help=help_text | |
| ) | |
| if group_count == 2: | |
| method_options = ["limma_like", "welch"] | |
| help_text = "'limma_like' uses Empirical Bayes variance shrinking. 'welch' is for unequal variances." | |
| elif group_count >= 3: | |
| method_options = ["limma_like", "anova"] | |
| help_text = "'limma_like' supports multi-group design matrices. 'anova' computes standard row-wise One-way ANOVA." | |
| else: | |
| st.error("❌ Statistical testing requires at least 2 unique sample groups.") | |
| st.stop() | |
| selected_method = st.selectbox( | |
| "Select Statistical Test", | |
| options=method_options, | |
| index=0, | |
| help=help_text | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@content/statistical.py` around lines 100 - 115, The statistical test selector
in the group_count == 2 branch currently exposes "paired" even though no pair
metadata is captured, so remove that option from method_options in the selectbox
flow and keep only unpaired methods like "limma_like" and "welch"; update any
related help_text or validation in content/statistical.py so the UI no longer
suggests paired analysis until explicit pair assignment support exists.
| elif group_count >= 3: | ||
| method_options = ["limma_like", "anova"] | ||
| help_text = "'limma_like' supports multi-group design matrices. 'anova' computes standard row-wise One-way ANOVA." | ||
| else: | ||
| st.error("❌ Statistical testing requires at least 2 unique sample groups.") | ||
| st.stop() | ||
|
|
||
| selected_method = st.selectbox( | ||
| "Select Statistical Test", | ||
| options=method_options, | ||
| index=0, | ||
| help=help_text | ||
| ) | ||
|
|
||
| with col2: | ||
| st.markdown("### 🛡️ 2. Multiple Testing Correction (FDR)") | ||
| selected_fdr = st.selectbox( | ||
| "Select FDR Adjustment Strategy", | ||
| options=["BH", "Bonferroni", "None"], | ||
| index=0, | ||
| help="'BH' (Benjamini-Hochberg) controls False Discovery Rate. 'Bonferroni' is strict Family-Wise Error Rate control." | ||
| ) | ||
|
|
||
| # --- SECTION 3: Statistical Query Execution --- | ||
| st.markdown("<br>", unsafe_allow_html=True) | ||
| if st.button("Run Statistical Analysis", type="primary"): | ||
|
|
||
| # Convert active pandas dataframe into polars lazyframe graph | ||
| stats_lazy = pl.from_pandas(base_df).lazy() | ||
|
|
||
| try: | ||
| # Execute Chain 1: Calculate core statistics (Adds log2FC, stat, p-value) | ||
| stats_lazy = calculate_statistical_tests( | ||
| quantification_data=stats_lazy, | ||
| metadata=metadata_pl, | ||
| method=selected_method | ||
| ) | ||
|
|
||
| # Execute Chain 2: Adjust Multiple Testing (Adds p-adj) | ||
| stats_lazy = adjust_fdr_lazy( | ||
| quantification_data=stats_lazy, | ||
| strategy=selected_fdr | ||
| ) | ||
|
|
||
| # Resolve lazy graph optimization tree and bring back to pandas memory | ||
| statistics_df = stats_lazy.collect().to_pandas() | ||
|
|
||
| # 💾 Save processing checkpoint inside Session State for Downstream (e.g., Volcano plot, Volcano/Heatmap UI) | ||
| st.session_state["statistics_df"] = statistics_df | ||
|
|
||
| st.success(f"Successfully calculated **{selected_method}** test with **{selected_fdr}** FDR correction!") | ||
|
|
||
| # Display the finalized statistics table view | ||
| st.subheader("Statistical Analysis Results") | ||
| st.markdown(f"Generated framework containing columns: `ProteinName`, `log2FC`, `stat`, `p-value`, `p-adj`, `PeptideSequence`") |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win
The ANOVA path does not match the downstream log2FC contract.
For 3+ groups, this page allows anova, but Line 157 documents a log2FC output and content/enrichment.py filters on |log2FC|. A one-way ANOVA does not produce a single fold-change per protein, so this branch is incompatible with the current downstream schema.
🧰 Tools
🪛 Ruff (0.15.18)
[error] 157-157: f-string without any placeholders
Remove extraneous f prefix
(F541)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@content/statistical.py` around lines 103 - 157, The `anova` option in the
`selected_method` flow is incompatible with the downstream `log2FC`-based schema
used after `calculate_statistical_tests` and by consumers like
`content/enrichment.py`. Update the 3+ group branch in `content/statistical.py`
so `method_options` only includes methods that still produce a single per-row
`log2FC`/p-value contract, or change the downstream contract and documentation
if ANOVA must remain. Make sure the selected method path, result table
messaging, and any dependent filtering logic stay consistent with the same
output columns.
| groups = sorted(df["Group"].unique()) | ||
|
|
||
| if len(groups) < 2: | ||
| return None | ||
|
|
||
| group1, group2 = groups[:2] | ||
|
|
||
| # Compute statistics per protein | ||
| stats_rows = [] | ||
| for protein, protein_df in df.groupby("ProteinName"): | ||
| g1_vals = protein_df[protein_df["Group"] == group1]["Intensity"].values | ||
| g2_vals = protein_df[protein_df["Group"] == group2]["Intensity"].values | ||
|
|
||
| if len(g1_vals) < 2 or len(g2_vals) < 2: | ||
| pval = np.nan | ||
| else: | ||
| _, pval = ttest_ind(g1_vals, g2_vals, equal_var=False) | ||
|
|
||
| mean_g1 = np.mean(g1_vals) if len(g1_vals) > 0 else np.nan | ||
| mean_g2 = np.mean(g2_vals) if len(g2_vals) > 0 else np.nan | ||
|
|
||
| log2fc = np.log2(mean_g2 / mean_g1) if mean_g1 > 0 else np.nan | ||
|
|
||
| stats_rows.append({ | ||
| "ProteinName": protein, | ||
| "log2FC": log2fc, | ||
| "p-value": pval, | ||
| }) | ||
|
|
||
| stats_df = pd.DataFrame(stats_rows) | ||
|
|
||
| if not stats_df.empty: | ||
| mask = stats_df["p-value"].notna() | ||
| if mask.any(): | ||
| _, p_adj, _, _ = multipletests(stats_df.loc[mask, "p-value"], method="fdr_bh") | ||
| stats_df.loc[mask, "p-adj"] = p_adj | ||
| else: | ||
| stats_df["p-adj"] = np.nan | ||
|
|
||
| # Order samples by group (group2 first, then group1) | ||
| # 3. Define ordering and build sample arrays | ||
| sample_group_df = df[["Sample", "Group"]].drop_duplicates() | ||
| group2_samples = sample_group_df[sample_group_df["Group"] == group2]["Sample"].tolist() | ||
| group1_samples = sample_group_df[sample_group_df["Group"] == group1]["Sample"].tolist() | ||
| all_samples = group2_samples + group1_samples | ||
|
|
||
| # Build pivot table | ||
| group1_samples = sample_group_df[sample_group_df["Group"] == groups[0]][ | ||
| "Sample" | ||
| ].tolist() | ||
| group2_samples = sample_group_df[sample_group_df["Group"] == groups[1]][ | ||
| "Sample" | ||
| ].tolist() | ||
| all_samples = group1_samples + group2_samples |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win
Don't truncate the pivot table to the first two groups.
all_samples is built from groups[0] and groups[1] only, so any third-or-later group is dropped from pivot_df. That silently removes data before it reaches the multi-group paths introduced in this PR.
Proposed fix
- groups = sorted(df["Group"].unique())
+ groups = sorted(df["Group"].unique())
if len(groups) < 2:
return None
# 3. Define ordering and build sample arrays
sample_group_df = df[["Sample", "Group"]].drop_duplicates()
- group1_samples = sample_group_df[sample_group_df["Group"] == groups[0]][
- "Sample"
- ].tolist()
- group2_samples = sample_group_df[sample_group_df["Group"] == groups[1]][
- "Sample"
- ].tolist()
- all_samples = group1_samples + group2_samples
+ all_samples = []
+ for group in groups:
+ all_samples.extend(
+ sample_group_df.loc[sample_group_df["Group"] == group, "Sample"].tolist()
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| groups = sorted(df["Group"].unique()) | |
| if len(groups) < 2: | |
| return None | |
| group1, group2 = groups[:2] | |
| # Compute statistics per protein | |
| stats_rows = [] | |
| for protein, protein_df in df.groupby("ProteinName"): | |
| g1_vals = protein_df[protein_df["Group"] == group1]["Intensity"].values | |
| g2_vals = protein_df[protein_df["Group"] == group2]["Intensity"].values | |
| if len(g1_vals) < 2 or len(g2_vals) < 2: | |
| pval = np.nan | |
| else: | |
| _, pval = ttest_ind(g1_vals, g2_vals, equal_var=False) | |
| mean_g1 = np.mean(g1_vals) if len(g1_vals) > 0 else np.nan | |
| mean_g2 = np.mean(g2_vals) if len(g2_vals) > 0 else np.nan | |
| log2fc = np.log2(mean_g2 / mean_g1) if mean_g1 > 0 else np.nan | |
| stats_rows.append({ | |
| "ProteinName": protein, | |
| "log2FC": log2fc, | |
| "p-value": pval, | |
| }) | |
| stats_df = pd.DataFrame(stats_rows) | |
| if not stats_df.empty: | |
| mask = stats_df["p-value"].notna() | |
| if mask.any(): | |
| _, p_adj, _, _ = multipletests(stats_df.loc[mask, "p-value"], method="fdr_bh") | |
| stats_df.loc[mask, "p-adj"] = p_adj | |
| else: | |
| stats_df["p-adj"] = np.nan | |
| # Order samples by group (group2 first, then group1) | |
| # 3. Define ordering and build sample arrays | |
| sample_group_df = df[["Sample", "Group"]].drop_duplicates() | |
| group2_samples = sample_group_df[sample_group_df["Group"] == group2]["Sample"].tolist() | |
| group1_samples = sample_group_df[sample_group_df["Group"] == group1]["Sample"].tolist() | |
| all_samples = group2_samples + group1_samples | |
| # Build pivot table | |
| group1_samples = sample_group_df[sample_group_df["Group"] == groups[0]][ | |
| "Sample" | |
| ].tolist() | |
| group2_samples = sample_group_df[sample_group_df["Group"] == groups[1]][ | |
| "Sample" | |
| ].tolist() | |
| all_samples = group1_samples + group2_samples | |
| groups = sorted(df["Group"].unique()) | |
| if len(groups) < 2: | |
| return None | |
| # 3. Define ordering and build sample arrays | |
| sample_group_df = df[["Sample", "Group"]].drop_duplicates() | |
| all_samples = [] | |
| for group in groups: | |
| all_samples.extend( | |
| sample_group_df.loc[sample_group_df["Group"] == group, "Sample"].tolist() | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/common/results_helpers.py` around lines 234 - 246, The sample grouping
logic in the results helper is truncating to only the first two groups, which
drops any additional groups before downstream processing. Update the grouping
and sample selection in the results-building flow around the unique symbols df,
sample_group_df, group1_samples, group2_samples, and all_samples so it preserves
every group present in the data instead of hard-coding groups[0] and groups[1].
Make sure the pivot/multi-group path receives the full set of samples and
groups, with no silent filtering when more than two groups exist.
There was a problem hiding this comment.
🧹 Nitpick comments (4)
requirements.txt (1)
152-153: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winDrop the duplicate
polarsrequirement.
polarsis already pinned with a floor at Line 144. Keeping a second bare entry here is redundant and makes future version updates easier to miss.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@requirements.txt` around lines 152 - 153, Remove the redundant bare polars entry from the requirements list, since polars is already specified earlier with a version floor. Keep the pinned occurrence only and update the dependency block so the requirements remain singular and easier to maintain.content/results_volcano.py (2)
44-84: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick winWrap the threshold controls and render path in
@st.fragment.Both sliders currently rerun the full page and rebuild the component on every change. A fragment keeps those interactions scoped to the plot.
As per coding guidelines,
**/*.py: Use@st.fragmentdecorator for interactive UI updates without full page reloads.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@content/results_volcano.py` around lines 44 - 84, The threshold sliders and VolcanoPlot render path are triggering full page reruns instead of a scoped update. Move the UI block in results_volcano.py into a Streamlit fragment by wrapping the section that creates fc_thresh, p_thresh, VolcanoPlot, and the call through volcano_plot_component in a function decorated with `@st.fragment`, so slider changes only refresh the plot area and not the entire page.Source: Coding guidelines
61-73: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick winSet an adjusted-p-value Y label explicitly.
This plot is driven by
p-adj, but the generatedquantms_volcano_plot/manifest.jsonstill labels the Y axis as-log10(p-value). That will misstate what the chart is showing unless you override the label here.Possible fix
volcano_plot_component = VolcanoPlot( cache_id="quantms_volcano_plot", data=volcano_pl_lazy, log2fc_column="log2FC", pvalue_column="p-adj", label_column="ProteinName", + y_label="-log10(adjusted p-value)", up_color="`#E74C3C`", down_color="`#3498DB`", ns_color="`#95A5A6`", show_threshold_lines=True, threshold_line_style="dash", )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@content/results_volcano.py` around lines 61 - 73, The VolcanoPlot setup is using p-adj as the p-value source but still leaves the generated Y-axis label misleading. Update the VolcanoPlot constructor in results_volcano.py to explicitly set the Y-axis label for adjusted p-values, using the component’s label override option rather than the default -log10(p-value) text. Keep the existing cache_id, pvalue_column, and other settings unchanged.content/results_heatmap.py (1)
45-97: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick winWrap the slider-driven render path in
@st.fragment.Changing
top_nreruns the full page, including data prep and component setup. Moving this block into a fragment keeps the update local and matches the repo pattern.As per coding guidelines,
**/*.py: Use@st.fragmentdecorator for interactive UI updates without full page reloads.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@content/results_heatmap.py` around lines 45 - 97, The `top_n` slider-driven heatmap render path in `results_heatmap.py` is causing a full-page rerun instead of a local UI update. Wrap the block that builds `var_series`, `top_proteins`, `heatmap_z`, `melted_df`, and renders `Heatmap`/`state_manager` in an `@st.fragment`-decorated function so the `st.slider` interaction updates only that section. Keep the existing logic and symbols like `top_n`, `Heatmap`, and `state_manager`, but move them into the fragment to match the repo’s interactive update pattern.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@content/results_heatmap.py`:
- Around line 45-97: The `top_n` slider-driven heatmap render path in
`results_heatmap.py` is causing a full-page rerun instead of a local UI update.
Wrap the block that builds `var_series`, `top_proteins`, `heatmap_z`,
`melted_df`, and renders `Heatmap`/`state_manager` in an
`@st.fragment`-decorated function so the `st.slider` interaction updates only
that section. Keep the existing logic and symbols like `top_n`, `Heatmap`, and
`state_manager`, but move them into the fragment to match the repo’s interactive
update pattern.
In `@content/results_volcano.py`:
- Around line 44-84: The threshold sliders and VolcanoPlot render path are
triggering full page reruns instead of a scoped update. Move the UI block in
results_volcano.py into a Streamlit fragment by wrapping the section that
creates fc_thresh, p_thresh, VolcanoPlot, and the call through
volcano_plot_component in a function decorated with `@st.fragment`, so slider
changes only refresh the plot area and not the entire page.
- Around line 61-73: The VolcanoPlot setup is using p-adj as the p-value source
but still leaves the generated Y-axis label misleading. Update the VolcanoPlot
constructor in results_volcano.py to explicitly set the Y-axis label for
adjusted p-values, using the component’s label override option rather than the
default -log10(p-value) text. Keep the existing cache_id, pvalue_column, and
other settings unchanged.
In `@requirements.txt`:
- Around line 152-153: Remove the redundant bare polars entry from the
requirements list, since polars is already specified earlier with a version
floor. Keep the pinned occurrence only and update the dependency block so the
requirements remain singular and easier to maintain.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ffb4a024-5385-44da-b83e-14d74207c3d2
⛔ Files ignored due to path filters (3)
quantms_protein_heatmap/preprocessed/level_0.parquetis excluded by!**/*.parquetquantms_protein_heatmap/preprocessed/level_1.parquetis excluded by!**/*.parquetquantms_volcano_plot/preprocessed/volcanoData.parquetis excluded by!**/*.parquet
📒 Files selected for processing (8)
app.pycontent/results_heatmap.pycontent/results_pca.pycontent/results_volcano.pyquantms_protein_heatmap/manifest.jsonquantms_volcano_plot/manifest.jsonrequirements.txtsrc/common/results_helpers.py
🚧 Files skipped from review as they are similar to previous changes (1)
- src/common/results_helpers.py
Implement 5 new Streamlit analysis pipeline pages powered by the openms-insight engine to handle downstream proteomics workflows:
Data Filtering (filtering.py):
Missing Value Imputation (imputation.py):
Data Normalization & Scaling (normalization.py):
Statistical Inference (statistical.py):
GO Enrichment Analysis (enrichment.py):
Summary by CodeRabbit
polarsto required dependencies.