Skip to content

Commit 3e61c8e

Browse files
mdr223tli2chjuncnvitaglianogsivaprasadsudhir
authored
Add Support For Azure OpenAI Models; Deprecate Llama 3.2 3B (#293)
* Fix broken dependencies (#227) * Move DataRecord Internal Fields to Have Leading Underscore (#229) * update README * 1. support add_columns in Dataset; 2. support run().to_df(); 3. add demo in df-newinterface.py (#78) * Support add_columns in Dataset. Support demo in df-newinterface.py Currently we have to do records, _ = qr3.run() outputDf = DataRecord.to_df(records) I'll try to make qr3.run().to_df() work in another PR. * ruff check --fix * Support run().to_df() Update run() to DataRecordCollection, so that it will be easier for use to support more features for run() output. We support to_df() in this change. I'll send out following commits to update other demos. * run check --fix * fix typo in DataRecordCollection * Update records.py * fix tiny bug in mab processor. The code will run into issue if we don't return any stats for this function in ``` max_quality_record_set = self.pick_highest_quality_output(all_source_record_sets) if ( not prev_logical_op_is_filter or ( prev_logical_op_is_filter and max_quality_record_set.record_op_stats[0].passed_operator ) ``` * update record.to_df interface update to record.to_df(records: list[DataRecord], project_cols: list[str] | None = None) which is consistent with other function in this class. * Update demo for the new execute() output format * better way to get plan from output.run() * fix getting plan from DataRecordCollection. people used to get plan from execute() of streaming processor, which is not a good practice. I update plan_str to plan_stats, and they need to get physical plan from processor. Consider use better ways to provide executed physical plan to DataRecordCollection, possibly from stats. * Update df-newinterface.py * update code based on comments from Matt. 1. add cardinality param in add_columns 2. remove extra testdata files 3. add __iter__ in DataRecordCollection to help iter over streaming output. * see if copilot just saved me 20 minutes * fix package name * use sed to get version from pyproject.toml * bump project version; keep docs behind to test ci pipeline * bumping docs version to match code version * use new __iter__ method in demos where possible * add type hint for output of __iter__; use __iter__ in unit tests * Update download-testdata.sh (#89) Added enron-tiny.csv * Clean up the retrieve API (#79) * Clean up the retrieve operator interface * fix comments * Update to the new to_df() API * Code update for #84 (#101) * Create chat.rst (#96) * Create chat.rst * Update pyproject.toml Hotfix for chat * Update conf.py Hotfix for chat.rst * code update for #84 This implementation basically resolves #84. One implementation is different from the #84: .add_columns( cols=[ {"name": "sender", "type": "string", "udf": compute_sender}, ... ] ) If add_columns() uses cols, udf, types as params, it will make this function confusing again. Instead, if users need to specify different udfs for different columns, they should just call add_columns() multiple times for different columns. * changed types to make use of Python type system; updated use of types in tests; updated docs and README * update test to match no longer allowing None default --------- Co-authored-by: Gerardo Vitagliano <vitaglianog@gmail.com> Co-authored-by: Matthew Russo <mdrusso@mit.edu> * Skip an operator if this is a duplicate op instead of raise error (#102) * Create chat.rst (#96) * Create chat.rst * Update pyproject.toml Hotfix for chat * Update conf.py Hotfix for chat.rst * Skip an operator when it doesn't need any logicalOP instead of raise error #Final Effects 1. Dataset() init only has one responsibility: wrap a datasource to a Dataset. I think this is a better interface. 2. No extra convert() will be added to the plan. 3. When users add the same op multiple times dataset.convert(File).convert(File), the system will just dedup the same op instead of raise error. #Issue Currently Dataset(src, schema) initiation has 2 responsibilities: 1. read source 2. convert source to schema. When we use default schema for Dataset init(source, schema=DefaultSchema) for users, the code works like: 1. Read source to schema that DataSource provides. This schema is derived by system, so the users don't know (don't need to know). 2. Convert Source schema to DefaultSchema. So everytime, the system will make one more convert call to convert SourceSchema to DefaultSchema, which is definitely wrong. #Solution 1. We use schema from Datasource if exists, which is reasonable. 2. If we do 1, then we'll get a dataset node that no actual op as its input_schema ==output_schema, so I updated a line in optimizer to just skip the node if it doesn't do anything instead raiseerror. #Real Examples ##Before Generated plan: 0. MarshalAndScanDataOp -> PDFFile 1. PDFFile -> LLMConvertBonded -> DefaultSchema (contents, filename, text_conte) -> (value) Model: Model.GPT_4o Prompt Strategy: PromptStrategy.COT_QA 2. DefaultSchema -> MixtureOfAgentsConvert -> ScientificPaper (value) -> (contents, filename, paper_auth) Prompt Strategy: None Proposer Models: [GPT_4o] Temperatures: [0.0] Aggregator Model: Model.GPT_4o Proposer Prompt Strategy: chain-of-thought-mixture-of-agents-proposer Aggregator Prompt Strategy: chain-of-thought-mixture-of-agents-aggregation 3. ScientificPaper -> LLMFilter -> ScientificPaper (contents, filename, paper_auth) -> (contents, filename, paper_auth) Model: Model.GPT_4o Filter: The paper mentions phosphorylation of Exo1 4. ScientificPaper -> MixtureOfAgentsConvert -> Reference (contents, filename, paper_auth) -> (reference_first_author, refere) Prompt Strategy: None Proposer Models: [GPT_4o] Temperatures: [0.8] Aggregator Model: Model.GPT_4o Proposer Prompt Strategy: chain-of-thought-mixture-of-agents-proposer Aggregator Prompt Strategy: chain-of-thought-mixture-of-agents-aggregation ##After Generated plan: 0. MarshalAndScanDataOp -> PDFFile 1. PDFFile -> LLMConvertBonded -> ScientificPaper (contents, filename, text_conte) -> (contents, filename, paper_auth) Model: Model.GPT_4o Prompt Strategy: PromptStrategy.COT_QA 2. ScientificPaper -> LLMFilter -> ScientificPaper (contents, filename, paper_auth) -> (contents, filename, paper_auth) Model: Model.GPT_4o Filter: The paper mentions phosphorylation of Exo1 3. ScientificPaper -> MixtureOfAgentsConvert -> Reference (contents, filename, paper_auth) -> (reference_first_author, refere) Prompt Strategy: None Proposer Models: [GPT_4o] Temperatures: [0.8] Aggregator Model: Model.GPT_4o Proposer Prompt Strategy: chain-of-thought-mixture-of-agents-proposer Aggregator Prompt Strategy: chain-of-thought-mixture-of-agents-aggregation * make equality check for new field names a bit more explicit * fix fixture usage * update all plans within code base to explicitly convert when needed; and removed unnecessary schemas for reading from datasource --------- Co-authored-by: Gerardo Vitagliano <vitaglianog@gmail.com> Co-authored-by: Matthew Russo <mdrusso@mit.edu> * Refactor demos to use .sem_add_columns or .add_columns instead of convert(), remove Schema from demos when possible. (#104) * Create chat.rst (#96) * Create chat.rst * Update pyproject.toml Hotfix for chat * Update conf.py Hotfix for chat.rst * code update for #84 This implementation basically resolves #84. One implementation is different from the #84: .add_columns( cols=[ {"name": "sender", "type": "string", "udf": compute_sender}, ... ] ) If add_columns() uses cols, udf, types as params, it will make this function confusing again. Instead, if users need to specify different udfs for different columns, they should just call add_columns() multiple times for different columns. * use field_values instead of field_types as field_values have the actual values, use field_values instead of field_types as field_values have the actual values, since field_values have the actual key-value pairs, while field_types are just contain fields and their types. records[0].schema is the schema of the output, which doesn't mean we already populate the schema into record. * Remove .convert() and use .sem_add_columns or .add_columns instead This change is based on #101 and #102, please review them first then this change. 1. This is to refactor all demos to use .sem_add_columns or .add_columns, and remove .convert(). 2. Remove Schema from demos, except demos using ValidationDataSource and dataset.retrieve() that need schema now. We can refactor these cases later. * ruff check --fix * fix unittest * demos fixed and unit tests running * fix add_columns --> sem_add_columns in demo * udpate quickstart to reflect code changes; shorten text as much as possible * passing unit tests * remove convert() everywhere * fixes to correct errors in demos; update quickstart and docs --------- Co-authored-by: Gerardo Vitagliano <vitaglianog@gmail.com> Co-authored-by: Matthew Russo <mdrusso@mit.edu> * Simplify Datasource (#103) ## Summary of PR changes **Note 1:** I did not change anything related to val_datasource (including tangential functions like Dataset._set_data_source()) as that will all be modified in a subsequent PR to reflect our discussion re: validation data. **Note 2:** I have completely commented out datamanager.py and config.py; for now I am willing to leave the code around in case we desperately need it for PalimpChat. However, my hope is that PalimpChat can be tweaked to work without the data manager and those files can be deleted before merging dev into main **Note 3:** Despite the branch name, fixing the progress managers will be part of a separate PR. - Collapsed all four `DataSource` classes down to a single `DataReader` class - Limit the number of methods the user needs to implement to just `__len__()` and `__getitem__()` - (Switched from using `get_item() --> __getitem__()` in `DataReader`) - Provided `DataReader` directly to scan operators (also renamed `DataSourcePhysicalOp --> ScanPhysicalOp` - Removed `DataDirectory()` from `src/` entirely; this included commenting out things which made use of the cache (e.g. caching computed `DataRecords` and codegen examples) - Got rid of `dataset_id` everywhere (which tracks with the previous bullet) - Removed the `Config` class which was a relic of a bygone era (and also intertwined with the `DataDirectory()`) - Updated all demos to use `import palimpzest as pz` to make the import statement(s) more welcoming - Fixed one bug resulting from converts now producing union schemas. Instead of including the `output_schema` in an operators' `get_id_params()` we simply report the `generated_fields`. - Changed `source_id --> source_idx` everywhere (this eliminated some weird renaming logic) - Finally, I added a large set of documentation for the DataSource class(es) * Multi-LLM Refinement Pipeline for Query Output Validation (#118) * Multi-LLM Refinement Pipeline for Query Output Validation (#92) ## Summary of PR This PR contains the work to add a new `CriticConvert` physical operator to PZ. At a high-level, this operator runs a bonded convert, and then asks a critic model if the answer produced by the bonded convert can be improved upon. The original output and the critique are then fed into a refinement model, which produces the improved output. The work to implement this includes: 1. Defining the physical operator in `src/palimpzest/query/operators/critique_and_refine_convert.py` 2. Adding an implementation rule for this physical operator in `src/palimpzest/query/optimizer/rules.py` 3. Adding boolean flag(s) to enable allowing / disallowing this physical optimization 4. Adding base prompts for the critique and refinement generations One other change which this work spawned was an attempt to improve the management and construction of our prompts -- and to decouple this logic from the `BaseGenerator` class. On the management side, I split our single `prompts.py` file into a set of files. On the construction side, I created a `PromptFactory` class which templates prompts based on the `prompt_strategy` and input record. The `PromptFactory` is not a perfect solution, but I think it is a step in the right direction. Finally, I fixed an error which previously filtered out `RAGConvert` operators from being considered by the `Optimizer`, and I made 2-3 more miscellaneous small tweaks. --------- Co-authored-by: Yash Agarwal <yash94404@gmail.com> Co-authored-by: Yash Agarwal <yashaga@Yashs-Air.attlocal.net> * MkDocs Site for Palimpzest API Documentation (#116) ## Summary of PR Changes 1. Changed `docs` to use [MkDocs](https://www.mkdocs.org/) instead of Sphinx 2. Created initial `Getting Started` content 3. Created placeholders for `User Guide` content (to follow in a subsequent PR) 4. Added autogenerated docs for our most user-facing code (we will need to add docstrings to our code in a subsequent PR) 5. Made small tweaks to `src/` to allow users to specify policy using kwargs in `.run()` 6. Renamed the `testdata/enron-tiny/` files so that they're not so damn weird --------- Co-authored-by: Yash Agarwal <yash94404@gmail.com> Co-authored-by: Yash Agarwal <yashaga@Yashs-Air.attlocal.net> * remove registration of sources from CI; only check version bump if there is a code change * remove filter for only checking version bump when src files changed * Rename `nocache` --> `cache` everywhere (#128) * first commit * Removed myenv * added to git ignore * addressed the comments in review * flip one minor comment * minor spacing fix * fix spaces in a few more spots --------- Co-authored-by: Bari LeBari <barilebari@dhcp-10-29-207-160.dyn.MIT.EDU> Co-authored-by: muhamed <muhamed@mit.edu> Co-authored-by: Matthew Russo <mdrusso@mit.edu> * adding citation (and making 'others' explicit) (#136) * Make Generator thread-safe (#139) * fix moa prompt * fix moa prompt aggregator * update version * make generator thread-safe * update generator to return messages * address comments * Begin Process of Improving Index Abstraction(s) in PZ (#138) * quick and dirty implementation which tracks retrieve costs * bug fixes and currently unused index code * add default search func which I forgot to implement and add chromadb to pyproject.toml * leaving TODO * hotfix to add cost for retrieve operation * another hotfix to add ragatouille dependency * Add logger for PZ (#134) * add logger for PZ 1. When verbose=True, we save all logs to log_file and print them on console; 2. when verbose=False, we only save ERROR+ log to file and print ERROR+. I just add logging to somewhere I think might be important for the execution, we always can add/remove for more or less. Also I might update the logging message based on my later annotation work. But this PR should setup the logging mechanism for now. * ruff fix * update code based on comments 1. not logging output_records 2. not logging plan_stats 3. make the files to ".pz_logs" --------- Co-authored-by: Matthew Russo <mdrusso@mit.edu> * fix merge bug (#141) * ruff fix * update log dir and fix tiny bug * fix merge bug * Use a singleton API client for operators (#140) * fix moa prompt * fix moa prompt aggregator * update version * make generator thread-safe * update generator to return messages * address comments * create a singleton API client * fix linting * fix logging in generators * also create parent dir. if missing * CUAD benchmark (#143) * fix moa prompt * fix moa prompt aggregator * update version * make generator thread-safe * update generator to return messages * address comments * create a singleton API client * fix linting * fix logging in generators * fix CUAD benchmarlk * fix type * minor fixes * Limit the Scope of Logging within the Optimizer (#144) * making it possible to set log level based on env. variable; adding time limit on seven filters test * deleting instead of commenting out * Remove Conventional LLM Convert; Update Bonded LLM Convert retry logic (#145) * use NullHandler in __init__ and let application control logging config (#146) * use NullHandler in __init__ and let application control logging config * ruff fix * Fix Progress Manager and Simplify `execute_plan` methods (#148) * modifying ProgressManager class to allow for dynamically adding tasks * beginning to use new progress manager * initial rewrite of execute_plan methods with new progress manager * unit tests passing * trim a few lines * unit tests passing; changes applied everywhere; MAB and Random coming in a separate PR * enable final operator to show progress in parallel * address comments * The great deletion (#149) * Adding Preliminary Work on Abacus and MAB Sentinel Execution (#147) * updating models to avoid llama3 * fix parsing bugs and some generation errors * don't require json for proposer and code synth generations; fix prompt format instruction for proposers * fix typo/bug * fix bugs in generator prep for field_answers; fix bug in filter impl.; other improvements * adding new file for abacus workload * fix len * fix errors with dataset copy; prompt construction; and more * remove JSON instruction from MOA proposer * fixed bugs in optimizer configuration, llama 3.3 generation, and filter generation * clean up demos; fix missing base prompt from map * add one more missing base prompt * prepare demo for full run; get embedding cost info from RAGConvert; use reasoning output from Critique * add script to generate text-embedding-3-small reaction embeddings * write to .chroma * run full scale generation * compute embeddings slowly and add progress bar * add sleep * fix import * add total iters * create embeddings before ingesting * fix index start and finish * load embeddings and insert directly * make chroma use cosine sim.; finish initial search fcn. for biodex workload; naming tweak in rag convert * capturing gen stats in Retrieve * added UDF map operator; rewrote biodex pipeline to match docetl impl.; switched to using __name__ for functions instead of str() * add optimizations back in * write data to csv in demo * limit to same model choice(s) as docetl and lotus * fix punctuation error(s) * try run without filter * remove unused demo file * remove print * remove prints * remove costed_phys_op_ids which were used for debugging * try slightly diff. approach * remove temp changes while branch is in PR review * remove depends_on for map * fix iteration bug in sentinel processors * one more hotfix * fix more errors w/SentinelPlanStats and sentinel processors * remove logger lib to reduce confusion (#159) * Update research.md (#160) AISD @ NAACL 2025 * Add Pneuma-Palimpzest Integration Demo (#158) * Add Pneuma demo * Remove dataset semantic column addition * Fix progress managers episode 2 attack of the clones (#156) * modifying ProgressManager class to allow for dynamically adding tasks * beginning to use new progress manager * initial rewrite of execute_plan methods with new progress manager * unit tests passing * trim a few lines * unit tests passing; changes applied everywhere; MAB and Random coming in a separate PR * enable final operator to show progress in parallel * initial work to refactor sentinel processors * passing unit tests * checking in minor changes * remove use of setup_logger inside library * stuff seems to be working * big print * turn off rag for test * try debugging exception * checking in code before changes to scoring * finished initial refactoring of mab sentinel execution strategy * get random sampling execution working with changes * passing unit tests * nosentinel progress looks good * eyeball test is working for progress bars * remove the old gods * revert small change * pull up progress manager logic in parallel execution * catch errors in generating embeddings * fix comments * Merging in Changes for Sentinel Progress Bars; Split Convert (off by default); `demos/enron-demo.py`; and MMQA Benchmark (#163) * modifying ProgressManager class to allow for dynamically adding tasks * beginning to use new progress manager * initial rewrite of execute_plan methods with new progress manager * unit tests passing * trim a few lines * unit tests passing; changes applied everywhere; MAB and Random coming in a separate PR * enable final operator to show progress in parallel * initial work to refactor sentinel processors * passing unit tests * checking in minor changes * remove use of setup_logger inside library * stuff seems to be working * big print * turn off rag for test * try debugging exception * checking in code before changes to scoring * finished initial refactoring of mab sentinel execution strategy * get random sampling execution working with changes * passing unit tests * nosentinel progress looks good * eyeball test is working for progress bars * remove the old gods * revert small change * pull up progress manager logic in parallel execution * adding prints to generator; turn progress off in favor of verbose for now * catch errors in generating embeddings * inspect frontier updates * remove args.workload * fix num_inputs in selectivity computation * pdb in score * fixed score fn issue * use execution cache to avoid unnecessary computation; use sentinel stats for updating frontier * fix progress counter * debug * fix empty stats * only count stats from newly computed results * fix tuple unpacking * only update sample counts for llm ops * de-dup duplicate record * ugh * dont forget to increment * plz * more plz * increment * recycle ops back onto reservoir so they may be reconsidered in the future * remove pdb * add progress to script args * try without rag * use term recall * just check in on term recall * make it easier to turn off progress * remove pdb * try to get re-rank to keep all inputs * try to generate more reactions * track total LLM calls * 10x parallelism * try retrieve directly on fulltext * up max workers * adding enron-demo w/optimization * remove config option * adding recall and precision to output * allow operators to be recycled back onto frontier * revert to using reactions instead of fulltext for similarity * better cycling of off-frontier operators * safety check on reservoir ops * remove pdb * fixing 5 results per query * investigate sampling behavior * check on seeds * remove pdb * test SplitConvert * debug chunking * fix bug in rag and split convert * run with chunks * test chunking logic * fix chunking logic * sum list * remove split merge for now * minor fixes to CUAD script * add embedding scripts for mmqa tables and image titles * address issue with empty titles and title collisions * prepare script for using clip embeddings for images * fix bug * get full space of possible extensions * debug * weird bug fix? * more debug * fix idiotic mistake * handle corrupted images and minor things * add another corrupted image * another one * anotha * more bad images * last disallow file * prepare cuad for runs * specify execution strategy * up samples * add sentinel execution strategy to output name * adding plan str and more stats * specify no prior * verbose=False * fix comment; comment out prints * make split merge optional for now * addressing comments * applying syntax changes to pneuma demo and supporting strings within retrieve * bump version; fix lint; fix docs * more docs tweaks; tweaking dependencies * fix install issues * one more version fix * one more version fix * one more version fix * one more version fix * last try * change runner python version * actually changing runner python version * increase time limit for runners * increase time limit for runners * Merge in Changes From Final Abacus Work (WIP) (#173) * modifying ProgressManager class to allow for dynamically adding tasks * beginning to use new progress manager * initial rewrite of execute_plan methods with new progress manager * unit tests passing * trim a few lines * unit tests passing; changes applied everywhere; MAB and Random coming in a separate PR * enable final operator to show progress in parallel * initial work to refactor sentinel processors * passing unit tests * checking in minor changes * remove use of setup_logger inside library * stuff seems to be working * big print * turn off rag for test * try debugging exception * checking in code before changes to scoring * finished initial refactoring of mab sentinel execution strategy * get random sampling execution working with changes * passing unit tests * nosentinel progress looks good * eyeball test is working for progress bars * remove the old gods * revert small change * pull up progress manager logic in parallel execution * adding prints to generator; turn progress off in favor of verbose for now * catch errors in generating embeddings * inspect frontier updates * remove args.workload * fix num_inputs in selectivity computation * pdb in score * fixed score fn issue * use execution cache to avoid unnecessary computation; use sentinel stats for updating frontier * fix progress counter * debug * fix empty stats * only count stats from newly computed results * fix tuple unpacking * only update sample counts for llm ops * de-dup duplicate record * ugh * dont forget to increment * plz * more plz * increment * recycle ops back onto reservoir so they may be reconsidered in the future * remove pdb * add progress to script args * try without rag * use term recall * just check in on term recall * make it easier to turn off progress * remove pdb * try to get re-rank to keep all inputs * try to generate more reactions * track total LLM calls * 10x parallelism * try retrieve directly on fulltext * up max workers * adding enron-demo w/optimization * remove config option * adding recall and precision to output * allow operators to be recycled back onto frontier * revert to using reactions instead of fulltext for similarity * better cycling of off-frontier operators * safety check on reservoir ops * remove pdb * fixing 5 results per query * investigate sampling behavior * check on seeds * remove pdb * test SplitConvert * debug chunking * fix bug in rag and split convert * run with chunks * test chunking logic * fix chunking logic * sum list * remove split merge for now * minor fixes to CUAD script * add embedding scripts for mmqa tables and image titles * address issue with empty titles and title collisions * prepare script for using clip embeddings for images * fix bug * get full space of possible extensions * debug * weird bug fix? * more debug * fix idiotic mistake * handle corrupted images and minor things * add another corrupted image * another one * anotha * more bad images * last disallow file * prepare cuad for runs * specify execution strategy * up samples * add sentinel execution strategy to output name * adding plan str and more stats * specify no prior * verbose=False * fix comment; comment out prints * make split merge optional for now * addressing comments * applying syntax changes to pneuma demo and supporting strings within retrieve * add prints * debug sample sets * checking in code before tweaks to mab * state of repo after running final Abacus experiments * revert to opt-profiling-data * removing print statement * remove prints * final fixes * removing ragatouille dependency * fix ruff lint checks * bump version * passing tests locally * remove pdb * fix complaint about match * Move Abacus Research Scripts into Separate Folder (#175) * re-organizing abacus research-related scripts * fix model selection and other tweaks * add data download script * bump version * remove scripts from root * removing python files which were merged back in from main * Fixed Issue(s) with Aggregate Operator Computation for Movie Queries (WIP) (#182) * queries 1-4 working for movies * removing RandomSampling * Create `Context` Class + `compute` and `search` operators (#186) * checking in changes * refactored Dataset * checking in * checking in * checking in * queries extract final answer now * checking in changes w/search operator * adding changes to agents * add isinstance checks to all executors * removing script * remove tools; include in future PR * Remove `pz.Schema` in Favor of Using `pydantic.BaseModel` (#188) * made changes throughout codebase and updated unit tests * checking in; debugging failure with image use case * simple demo / paper demos working * eliminate caching features (#195) * removing all code synthesis (#198) * removing all code synthesis * remove unused import * Using LiteLLM to Manage Generator Clients / Completion APIs (#200) * use LiteLLM for generators * remove unused function; add TODO * Added Anthropic Support; Simplified Rules; Removed Redundant Model Helpers (#202) * changes after simplifying rules * passing unit tests; removed unnecessary model helpers * simplified primitives slightly * fixing the assertion which used FieldInfo instead of FieldInfo.annotation (#204) * add support for o4-mini, gemini-2.5-pro, gemini-2.0-flash, llama-4-maverick (#205) * Adding Semantic Join Operator (#206) * initial changes to support validator class; fixed bug in generator for images * adding validator based optimization * validator agent example working * using o1 model; made validation more efficient * added initial nested loops join implementation * passing tests * unit tests passing * unit tests passing * enron-demo.py working * join demos in place * parallel join and other bugfixes (#207) * audio-demo (#208) * remove pdb * adding option to only use gemini models in audio demo * adding parallelism; fixed bug w/unique_logical_op_id (#209) * fixed issue which removed pipelined execution of operators in parallel setting (#210) * Movie bugfixes (#211) * fixed error in cost computation for gemini models; tested join on movie queries * make join count monotonic * removing progress bar updates for join for now * adding reasoning effort (#212) * made progress manager more efficient; made join op calculations accurate (#213) * make groupby ignore None values * make it possible to specify schema for MemoryDataset; reasoning model fixes * adding audio-only match in substitution (#214) * quick fix for audio prompt missing in MoA * support passing in gemini/vertex credentials path; fix minor bugs in audio generation (#216) * adding Distinct operator to PZ (#217) * masking filepaths for sembench; fix audio pricing (#218) * make GroupBySig a pz. import * remove email demo * reproduce abacus results * add notes about deprecation to scripts for generating priors * remove unsupported demos * sem_add_columns -> sem_map * Dev staging (#220) * edit cuad abacus scripts to use loacl data * edit cuad abacus scripts to use local data * edit cuad abacus scripts to use local data * fix: cuad data loader doesn't work via huggingface anymore (#215) * edit cuad abacus scripts to use loacl data * edit cuad abacus scripts to use local data * edit cuad abacus scripts to use local data --------- Co-authored-by: mdr223 <mdrusso@mit.edu> --------- Co-authored-by: Shreya Shankar <ss.shankar505@gmail.com> * adding early support for vllm models * changes to appease linter * remove models now that we have access to gpt-5 * only perform time check on local; CI runners are slow * Support google api and desc (#222) * support shreya models and re-support desc * adding gpt-5-nano to gpt-5 models * bump version * fixed merge error * fixing bug where id column in schema overrides DataRecord.id --------- Co-authored-by: Jun <130543538+chjuncn@users.noreply.github.com> Co-authored-by: Gerardo Vitagliano <vitaglianog@gmail.com> Co-authored-by: Sivaprasad Sudhir <sivaprasad2626@gmail.com> Co-authored-by: Yash Agarwal <yash94404@gmail.com> Co-authored-by: Yash Agarwal <yashaga@Yashs-Air.attlocal.net> Co-authored-by: Bari Bo LeBari <143016395+lilbarbar@users.noreply.github.com> Co-authored-by: Bari LeBari <barilebari@dhcp-10-29-207-160.dyn.MIT.EDU> Co-authored-by: muhamed <muhamed@mit.edu> Co-authored-by: Tranway1 <tranway@qq.com> Co-authored-by: Luthfi Balaka <luthfibalaka@gmail.com> Co-authored-by: Shreya Shankar <ss.shankar505@gmail.com> * Add Optimizations for Filter and Join Operators (#230) * rename files to reflect that they will contain filter and map physical operators * passing map unit tests * passing filter tests * finished tests * adding tests for joins and initial embedding join * adding vllm test * fixed embedding join * filter for filepaths instead of assert * add embedding cost * fixed full hashes bug with deep copy * bump version * undo linting change * Reorder bug (#232) * fixing map/filter/join tests for CI which doesn't have GEMINI access; adding test for real estate bug * added exploration to re-order converts * separate lack of gemini from ci tests * Data Record Refactor (#233) * Refactor DataRecord to hold data in the BaseModel member instead of separately. * Some type fixes * local unit tests passing * enforce data record id uses list of schema fields * remove unused code from copy * use function instead of class internals --------- Co-authored-by: Tianyu Li <litianyu@mit.edu> * Updating Website to Use Docusaurus (#234) * adding docusaurus website; still haven't updated doc content and home page * fix links at bottom of page * updated pages for website; docs are still not auto-rendered * updating ci pipelines * update path to package * update node version * update package * fix build commands * fix trigger * fix runner and import * fix some DataRecord inits * switch to running llms w/separate flag b/c one test can fail due to bad generation * changes to be more flexible on types for abacus scripts * guessing at fix for build path * removing old website * remove commented ci code * remove mkdocs from pyproject * remove prints * fix location of CNAME file * Opt fixes (#236) * fixed errors in optimizer * added palimpchat page * passing unit tests * also relax types on train datasets * bump version * try lowercasing c * fixed route * eliminate slowdown from stringifying sentinel plan(s) * bump version * allow enron demo to swap filters w/convert * remove print statements in validator and fix bug introduced for bytes fields * bump version * adding min and max * fixing assertion error * fix no reasoning prompt templating issue(s) * add semantic aggregation operator * bump version * fix mock call in unit test * add google analytics tracking * Updated Website User Guide(s); Renamed `retrieve()` --> `sem_topk()` (#244) * checking in in-flight changes * adding code for unmatched records in left/right/outer joins * optimization stuck * new mmqa script is functional * minor bugfixes * fix naive estimates with new operators * updated website user guides; renamed retrieve --> top-k * fix defaults for join op * bumping version * fix documentation links * Add Cost-Based Sample Budget; Fix RAGConvert/Filter for `str | Any` Types (#247) * checking in in-flight changes * adding code for unmatched records in left/right/outer joins * optimization stuck * new mmqa script is functional * minor bugfixes * fix naive estimates with new operators * updated website user guides; renamed retrieve --> top-k * add cost-based sample budget; fix rag convert and filter for str | Any fields * Fix missing comma causing vLLM completions to break (#246) * bumping version * Final Changes from Revision for Abacus (#250) * checking in in-flight changes * adding code for unmatched records in left/right/outer joins * optimization stuck * new mmqa script is functional * minor bugfixes * fix naive estimates with new operators * updated website user guides; renamed retrieve --> top-k * add cost-based sample budget; fix rag convert and filter for str | Any fields * pushing local mmqa experiment * try n=20 * preparing final runs for table 2 * fix thread safety issue w/EmbeddingJoin * adding full ablation study * bugfixes in operators * adding final revision work from local * updated readme * adding changes from berners-lee * remove comments * fix linting and bump version * Blebari task 131 (#241) * . * . * minor tweaks * add embedding costs to RecordOpStats * minor tweaks * change comment --------- Co-authored-by: Bari LeBari <barilebari@dhcp-10-29-128-127.dyn.mit.edu> Co-authored-by: Bari LeBari <barilebari@Baris-MacBook-Pro.local> Co-authored-by: Matthew Russo <mdrusso@mit.edu> * adding real-estate-eval-100 to download script * adding real-estate-demo * jczhang add model checks (#254) * adding checks that user has support for models they need * check if available models is empty * trying to resolve dependency * bump version * gemini studio api issue (#257) * recreating the issue * fixing model provider for google AI studio * add try-except back --------- Co-authored-by: Matthew Russo <mdrusso@mit.edu> * bump version * fix model check * Fix no reasoning (#270) * enforce that setting reasoning effort to None turns of reasoning prompts; fix config copy error * bump version * update constants to reflect the cached-input token costs * update GenerationStats * update GenerationStats to include cache token/cost * fix typo * update stats in GenerationStats * prompt cahing implementation * split cache tokens into read and creation * restructure prompt caching into PromptCacheManager class * update CacheManager class * caching demo * add claude sonnet 4.0 (temporary) * fix pretty print error for anthropic * propagate cache-related stats from end-to-end * fix bug for gemini model * claude-3-7 deprecated * fix formatting issues * fix formatting issues * fixing comments * update token/cost logic to be disjoint for input and cache * update demo * Generalize Support for LiteLLM Models #265 (#272) * model_info (Model -> ConfiguredModel in constants) - 265 * predictor function for unknown spec * update full list of API keys * add gemini3 and gpt5.2 to constants * return models based on opt obj when models is None * reorganize functions in model info/helper * add tests and update model references and imports * move validation from config to query processor * add json file for model score/latency and update predictor function * update model references and imports * update dependencies and related test cases * update Model to have both string and enum * model_info -> model_helper * update model usage in query config * rollback import changes for CuratedModel -> Model * ModelProvider class * update all switch cases to ModelProvider when applicable * reverted CuratedModel changes * add test cases * add additional test cases * fix formatting issues * add prompt caching stats for #262 * restructure Model class * fix Model enum issue * add sorting logic to model class * use singular json file for info fetching * expand model list and updates curated_model_info file * restructure model info fetching, update Model class and test cases * script to update pz_models_information and update get_optimal_models * is_deepseek_model * add audio cache read/creation * remove claude sonnet 3.5 (retired) * add deepseek-chat * add .json files to pyproject.toml so that is packaged too * revert uvicorn dependency * some small tweaks * passing tests --------- Co-authored-by: joycequ <joycequ@mit.edu> Co-authored-by: Matthew Russo <mdrusso@mit.edu> * fixed model function calls * clean up duplicate code to help with summing field stats * update fields for classes in models.py, update usage in generators.py * add test generation file * add test generation file * generator messages * update anthropic stats * update input/cache token stats * remove generator messages from github repo * update generator test cases and implement initial gemini wrapper class * delete output audio tokens and update gemini client class * ruff lint for test cases * fix gemini reasoning effort bug * fix cost and image issues * incorporate all pr comments * make anthropic version more flexible * Revert "make anthropic version more flexible" This reverts commit 8eeed67. * floatify everything * all but two tests passing * bump version and relax tests * Local Model Execution (vLLM) #266 (#282) * local vllm execution implementation * update vllm local specs (predictors) * more robust detection of local model capabilities * fix formatting * test script formatting update * adding placeholder for vllm cache tokens * remove prints * remove print * reverted type * fix type annotation * tests passing --------- Co-authored-by: Matthew Russo <mdrusso@mit.edu> * Allowing other provider than OpenAI for embeddings (#283) * Removing hard-coded TEXT_EMBEDDING_3_SMALL in RAG and JOIN operators * remove whitespace * fixed embedding access in RAGFilter * fix id/op_params for RAG ops and EmbeddingJoin; update rules to enforce CLIP cannot be used for text-only * fix value * unit tests passing --------- Co-authored-by: Matthew Russo <mdrusso@mit.edu> * fixed issue #286 and bumped version * fix linter errors * quest evals * adding support for azure openai models (#292) * adding support for azure openai models * added warning in comment * bump version * fix typos * fix linter error --------- Co-authored-by: Tianyu Li <litianyu@mit.edu> Co-authored-by: Jun <130543538+chjuncn@users.noreply.github.com> Co-authored-by: Gerardo Vitagliano <vitaglianog@gmail.com> Co-authored-by: Sivaprasad Sudhir <sivaprasad2626@gmail.com> Co-authored-by: Yash Agarwal <yash94404@gmail.com> Co-authored-by: Yash Agarwal <yashaga@Yashs-Air.attlocal.net> Co-authored-by: Bari Bo LeBari <143016395+lilbarbar@users.noreply.github.com> Co-authored-by: Bari LeBari <barilebari@dhcp-10-29-207-160.dyn.MIT.EDU> Co-authored-by: muhamed <muhamed@mit.edu> Co-authored-by: Tranway1 <tranway@qq.com> Co-authored-by: Luthfi Balaka <luthfibalaka@gmail.com> Co-authored-by: Shreya Shankar <ss.shankar505@gmail.com> Co-authored-by: Griffin Roupe <31631417+frostyfan109@users.noreply.github.com> Co-authored-by: Bari LeBari <barilebari@dhcp-10-29-128-127.dyn.mit.edu> Co-authored-by: Bari LeBari <barilebari@Baris-MacBook-Pro.local> Co-authored-by: Jerry Zhang <122544742+xqlcn@users.noreply.github.com> Co-authored-by: joycequ <joycequ2016@gmail.com> Co-authored-by: joycequu <65379523+joycequu@users.noreply.github.com> Co-authored-by: joycequ <joycequ@mit.edu> Co-authored-by: SoTrx <11771975+SoTrx@users.noreply.github.com>
1 parent 738b698 commit 3e61c8e

17 files changed

Lines changed: 504 additions & 26 deletions

File tree

evals/quest/eval.py

Lines changed: 159 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,159 @@
1+
import argparse
2+
import copy
3+
import json
4+
import os
5+
import random
6+
import time
7+
8+
import palimpzest as pz
9+
10+
11+
def prepare_docs_for_query(items: list, gt_docs: list) -> list:
12+
items = copy.copy(items)
13+
random.shuffle(items)
14+
final_items = [doc for doc in items if doc["title"] in gt_docs]
15+
while len(final_items) < 1000 and len(items) > 0:
16+
item = items.pop(0)
17+
if item not in final_items:
18+
final_items.append(item)
19+
return final_items
20+
21+
22+
def palimpzest_run_query(query: dict, documents: list) -> list[str]:
23+
gt_docs = query["docs"]
24+
items = prepare_docs_for_query(documents, gt_docs)
25+
26+
schema = [
27+
{"name": "title", "type": str, "desc": "Document title"},
28+
{"name": "text", "type": str, "desc": "Document content"},
29+
]
30+
31+
dataset = pz.MemoryDataset(
32+
id="quest-docs",
33+
vals=items,
34+
schema=schema,
35+
)
36+
37+
query_text = query["query"]
38+
plan = dataset.sem_filter(
39+
f'This document is relevant to the entity-seeking query: "{query_text}". '
40+
"Return True if the document helps answer the query, False otherwise.",
41+
depends_on=["text"],
42+
).project(["title"])
43+
44+
config = pz.QueryProcessorConfig(
45+
policy=pz.MaxQuality(),
46+
execution_strategy="parallel",
47+
progress=True,
48+
)
49+
output = plan.run(config)
50+
execution_stats = output.execution_stats
51+
time_secs = execution_stats.total_execution_time if execution_stats else 0.0
52+
cost = execution_stats.total_execution_cost if execution_stats else 0.0
53+
return [record["title"] for record in output], time_secs, cost
54+
55+
56+
def main():
57+
parser = argparse.ArgumentParser(description="Evaluate Palimpzest on QUEST")
58+
parser.add_argument(
59+
"--domain",
60+
type=str,
61+
required=True,
62+
choices=["films", "books"],
63+
help="The domain to evaluate.",
64+
)
65+
parser.add_argument(
66+
"--queries",
67+
type=str,
68+
required=True,
69+
help="Path to the file containing the queries (e.g. test.jsonl).",
70+
)
71+
parser.add_argument(
72+
"--documents",
73+
type=str,
74+
default="data/documents.jsonl",
75+
help="Path to documents.jsonl (QUEST format: title, text per line).",
76+
)
77+
parser.add_argument(
78+
"--limit",
79+
type=int,
80+
default=None,
81+
help="Limit number of queries to evaluate (for debugging).",
82+
)
83+
parser.add_argument(
84+
"--seed",
85+
type=int,
86+
default=42,
87+
help="Random seed for document shuffling.",
88+
)
89+
args = parser.parse_args()
90+
91+
random.seed(args.seed)
92+
93+
if not os.path.exists(args.documents):
94+
raise FileNotFoundError(
95+
f"Documents file not found: {args.documents}\n"
96+
)
97+
with open(args.documents) as f:
98+
documents = [json.loads(line) for line in f]
99+
100+
queries = []
101+
with open(args.queries) as f:
102+
for line in f:
103+
d = json.loads(line)
104+
if d["metadata"]["domain"] == args.domain:
105+
queries.append(d)
106+
107+
if args.limit:
108+
queries = queries[: args.limit]
109+
110+
results = []
111+
for i, query in enumerate(queries):
112+
print(f"[{i + 1}/{len(queries)}] Executing query: {query['query']}")
113+
pred_docs, cur_time, cur_cost = palimpzest_run_query(query, documents)
114+
115+
gt_docs = query["docs"]
116+
preds = set(pred_docs)
117+
labels = set(gt_docs)
118+
119+
tp = sum(1 for pred in preds if pred in labels)
120+
fp = len(preds) - tp
121+
fn = sum(1 for label in labels if label not in preds)
122+
123+
precision = tp / (tp + fp) if (tp + fp) > 0 else 0
124+
recall = tp / (tp + fn) if (tp + fn) > 0 else 0
125+
f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
126+
127+
result = {
128+
"query": query["query"],
129+
"predicted_docs": pred_docs,
130+
"ground_truth_docs": gt_docs,
131+
"precision": precision,
132+
"recall": recall,
133+
"f1_score": f1,
134+
"time": cur_time,
135+
"cost": cur_cost
136+
}
137+
results.append(result)
138+
139+
ts = int(time.time())
140+
out_path = f"results_{args.domain}_{ts}.json"
141+
with open(out_path, "w") as f:
142+
json.dump(results, f, indent=4)
143+
print(f"\nResults saved to {out_path}")
144+
145+
n = len(results)
146+
avg_precision = sum(r["precision"] for r in results) / n
147+
avg_recall = sum(r["recall"] for r in results) / n
148+
avg_f1 = sum(r["f1_score"] for r in results) / n
149+
avg_time = sum(r["time"] for r in results) / n
150+
avg_cost = sum(r["cost"] for r in results) / n
151+
152+
print(f"Average Precision: {avg_precision:.4f}")
153+
print(f"Average Recall: {avg_recall:.4f}")
154+
print(f"Average F1 Score: {avg_f1:.4f}")
155+
print(f"Average Time: {avg_time:.4f}s")
156+
print(f"Average Cost: {avg_cost:.4f}$")
157+
158+
if __name__ == "__main__":
159+
main()

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[project]
22
name = "palimpzest"
3-
version = "1.4.0"
3+
version = "1.5.0"
44
description = "Palimpzest is a system which enables anyone to process AI-powered analytical queries simply by defining them in a declarative language"
55
readme = "README.md"
66
requires-python = ">=3.12"

scripts/capture_litellm_stats.py

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818
- Google/Gemini: gemini-2.5-flash (all seven modality combinations)
1919
- OpenAI: gpt-4o-2024-08-06 (text, image, text+image)
2020
- OpenAI: gpt-4o-audio-preview (text+audio, audio)
21+
- Azure: gpt-4o via Azure OpenAI (text, image, text+image)
2122
2223
Output files are saved to: scripts/litellm_stats/
2324
"""
@@ -168,6 +169,10 @@ def get_captured_data(self) -> dict[str, Any]:
168169
"text-image-audio",
169170
],
170171
},
172+
"azure": {
173+
"model": Model.AZURE_GPT_4o,
174+
"supported_modalities": ["text-only", "image-only", "text-image"],
175+
},
171176
}
172177

173178

@@ -305,7 +310,7 @@ def call_litellm_api(
305310

306311
# Apply provider-specific caching configuration
307312
# Messages from generator_messages already have cache_control markers for Anthropic
308-
if model.is_provider_openai() and cache_key:
313+
if (model.is_provider_openai() or model.is_provider_azure()) and cache_key:
309314
# OpenAI: Use prompt_cache_key for sticky routing to the same cache shard
310315
# https://platform.openai.com/docs/guides/prompt-caching
311316
completion_kwargs["extra_body"] = {"prompt_cache_key": cache_key}
@@ -395,7 +400,7 @@ def capture_stats_for_provider(
395400
"""
396401
# Generate a unique cache key for OpenAI (ensures both requests hit the same cache shard)
397402
# Reference: capture_provider_stats.py and PromptManager.__init__
398-
openai_cache_key = f"pz-test-{uuid.uuid4().hex[:12]}" if provider in ("openai", "openai-audio") else None
403+
openai_cache_key = f"pz-test-{uuid.uuid4().hex[:12]}" if provider in ("openai", "openai-audio", "azure") else None
399404

400405
print(" First request...")
401406
first_stats = call_litellm_api(messages, model, provider, cache_key=openai_cache_key)

scripts/capture_provider_stats.py

Lines changed: 82 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@
1717
- Google/Vertex AI: gemini-2.5-flash (all seven modality combinations)
1818
- OpenAI: gpt-4o-2024-08-06 (text, image, text+image)
1919
- OpenAI: gpt-4o-audio-preview (text+audio, audio)
20+
- Azure: gpt-4o-2024-08-06 via Azure OpenAI (text, image, text+image)
2021
2122
Output files are saved to: tests/pytest/scripts/provider_stats/
2223
"""
@@ -108,6 +109,10 @@ def detect_image_media_type(base64_data: str) -> str:
108109
"text-image-audio",
109110
],
110111
},
112+
"azure": {
113+
"model": "gpt-4o-2024-08-06",
114+
"supported_modalities": ["text-only", "image-only", "text-image"],
115+
},
111116
}
112117

113118

@@ -437,6 +442,77 @@ def call_openai_api(messages: list[dict], model: str, cache_key: str | None = No
437442
}
438443

439444

445+
# NOTE: this function was generated speculatively and has not been tested, so it may have errors
446+
def call_azure_api(messages: list[dict], model: str, cache_key: str | None = None) -> dict[str, Any]:
447+
"""
448+
Call Azure OpenAI API directly and return usage statistics.
449+
450+
Uses the same message format as OpenAI, but routes through Azure endpoints.
451+
452+
Args:
453+
messages: List of message dicts
454+
model: Model name (deployment name)
455+
cache_key: Optional prompt_cache_key for sticky routing to same cache shard
456+
457+
Returns dict with:
458+
- completion_tokens
459+
- prompt_tokens
460+
- prompt_tokens_details (cached_tokens, text_tokens, image_tokens, audio_tokens)
461+
- total_tokens
462+
"""
463+
import openai
464+
465+
api_key = os.environ.get("AZURE_API_KEY") or os.environ.get("AZURE_OPENAI_API_KEY")
466+
azure_endpoint = os.environ.get("AZURE_API_BASE")
467+
api_version = os.environ.get("AZURE_API_VERSION", "2024-12-01-preview")
468+
469+
if not api_key:
470+
raise ValueError("AZURE_API_KEY or AZURE_OPENAI_API_KEY must be set")
471+
if not azure_endpoint:
472+
raise ValueError("AZURE_API_BASE must be set")
473+
474+
client = openai.AzureOpenAI(
475+
api_key=api_key,
476+
azure_endpoint=azure_endpoint,
477+
api_version=api_version,
478+
)
479+
480+
openai_messages = transform_messages_for_openai(messages)
481+
482+
kwargs = {"model": model, "messages": openai_messages, "temperature": 0.0}
483+
484+
# Add prompt_cache_key for caching (ensures requests route to same cache shard)
485+
if cache_key:
486+
kwargs["extra_body"] = {"prompt_cache_key": cache_key}
487+
488+
response = client.chat.completions.create(**kwargs)
489+
490+
# Extract complete usage stats
491+
usage_dict = {}
492+
if response.usage:
493+
usage_dict = response.usage.model_dump()
494+
495+
# Get response text safely
496+
try:
497+
response_text = response.choices[0].message.content[:200] if response.choices and response.choices[0].message.content else None
498+
except Exception:
499+
response_text = None
500+
501+
# Serialize the full response
502+
try:
503+
raw_response = response.model_dump()
504+
except Exception:
505+
raw_response = str(response)
506+
507+
return {
508+
"provider": "azure",
509+
"model": model,
510+
"usage": usage_dict,
511+
"response_content": response_text,
512+
"raw_response": raw_response,
513+
}
514+
515+
440516
def call_anthropic_api(messages: list[dict], model: str) -> dict[str, Any]:
441517
"""
442518
Call Anthropic API directly and return usage statistics.
@@ -602,12 +678,14 @@ def capture_stats_for_provider(
602678
- first_request: stats from first request
603679
- second_request: stats from second request (should show cache hits)
604680
"""
605-
# Generate a unique cache key for OpenAI (ensures both requests hit the same cache shard)
606-
openai_cache_key = f"pz-test-{uuid.uuid4().hex[:12]}" if provider in ("openai", "openai-audio") else None
681+
# Generate a unique cache key for OpenAI/Azure (ensures both requests hit the same cache shard)
682+
openai_cache_key = f"pz-test-{uuid.uuid4().hex[:12]}" if provider in ("openai", "openai-audio", "azure") else None
607683

608684
print(" First request...")
609685
if provider == "openai" or provider == "openai-audio":
610686
first_stats = call_openai_api(messages, model, cache_key=openai_cache_key)
687+
elif provider == "azure":
688+
first_stats = call_azure_api(messages, model, cache_key=openai_cache_key)
611689
elif provider == "anthropic":
612690
first_stats = call_anthropic_api(messages, model)
613691
elif provider == "gemini":
@@ -625,6 +703,8 @@ def capture_stats_for_provider(
625703
print(" Second request (should show cache hits)...")
626704
if provider == "openai" or provider == "openai-audio":
627705
second_stats = call_openai_api(messages, model, cache_key=openai_cache_key)
706+
elif provider == "azure":
707+
second_stats = call_azure_api(messages, model, cache_key=openai_cache_key)
628708
elif provider == "anthropic":
629709
second_stats = call_anthropic_api(messages, model)
630710
elif provider == "gemini":

scripts/generate_test_messages.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@
1212
- OpenAI-Audio: audio-only, text-audio
1313
- Gemini: all 7 modality combinations
1414
- Vertex AI: all 7 modality combinations
15+
- Azure: text-only, image-only, text-image
1516
1617
Output files are saved to: tests/pytest/data/generator_messages/
1718
Format: {modality}_{provider}.json (e.g., text-only_anthropic.json)
@@ -244,6 +245,10 @@ class OutputSchema(BaseModel):
244245
"text-image", "text-audio", "image-audio", "text-image-audio",
245246
],
246247
},
248+
"azure": {
249+
"model": Model.AZURE_GPT_4o,
250+
"supported_modalities": ["text-only", "image-only", "text-image"],
251+
},
247252
}
248253

249254

scripts/update_model_info.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,7 @@
6868
# API key environment variable mapping
6969
API_KEY_MAPPING = {
7070
"openai": "OPENAI_API_KEY",
71+
"azure": "AZURE_API_KEY",
7172
"anthropic": "ANTHROPIC_API_KEY",
7273
"vertex_ai": "GOOGLE_APPLICATION_CREDENTIALS",
7374
"gemini": "GEMINI_API_KEY",
@@ -126,7 +127,7 @@ def extract_provider(model_id: str) -> str:
126127
model_lower = model_id.lower()
127128

128129
# OpenAI
129-
if any(x in model_lower for x in ["gpt", "o1-", "o3-", "dall-e", "whisper"]):
130+
if any(x in model_lower for x in ["gpt", "o1-", "o3-", "o4-", "dall-e", "whisper"]):
130131
return "openai"
131132

132133
# Anthropic

0 commit comments

Comments
 (0)