mirror of https://github.com/james-m-jordan/morphik-core.git synced 2025-05-09 19:32:38 +00:00

Adityavardhan Agrawal 1792275cb8

Co-authored-by: Arnav Agrawal <aa779@cornell.edu>

2025-04-20 16:34:29 -07:00

2.9 KiB

Raw Permalink Blame History

SciER Evaluation for Morphik

This directory contains scripts for evaluating different language models' entity and relation extraction capabilities against the SciER (Scientific Entity Recognition) dataset.

Overview

The evaluation workflow is split into two parts:

Graph Creation: Generate a knowledge graph using your configured model in morphik.toml
Graph Evaluation: Evaluate the created graph against the ground truth annotations

This separation allows you to test different models by changing the configuration in morphik.toml between runs.

The evaluation uses the SciER test dataset which can be found at https://github.com/edzq/SciER/blob/main/SciER/LLM/test.jsonl.

Setup

Make sure you have a local Morphik server running
Set up your OpenAI API key for evaluation:
```
export OPENAI_API_KEY=your_key_here
```

Running the Evaluation

Step 1: Configure Your Model

Edit morphik.toml to specify the model you want to test:

[graph]
model = "openai_gpt4o" # Reference to a key in registered_models
enable_entity_resolution = true

Step 2: Create a Knowledge Graph

Run the graph creation script:

python scier_evaluation.py --model-name gpt4o

The script will output a graph name like scier_gpt4o_12345678 when complete.

Step 3: Evaluate the Knowledge Graph

Evaluate the created graph:

python evaluate_result.py --graph-name scier_gpt4o_12345678

Step 4: Test Different Models

To compare models:

Change the model in morphik.toml
Repeat steps 2-3 with a different --model-name
Compare the resulting metrics and visualizations

Command Arguments

scier_evaluation.py

--model-name: Name to identify this model in results (default: "default_model")
--limit: Maximum number of documents to process (default: 57)
--run-id: Unique identifier for the run (default: auto-generated)

evaluate_result.py

--graph-name: Name of the graph to evaluate (required)
--model-name: Name for the evaluation results (default: "existing_model_openai")
--similarity-threshold: Threshold for semantic similarity matching (default: 0.70)
--embedding-model: OpenAI embedding model to use (default: "text-embedding-3-small")

Results

The evaluation generates:

CSV files with precision, recall, and F1 metrics
Visualizations comparing entity and relation extraction performance
Entity and relation count comparisons

Results are saved in a directory with the format: scier_results_{model_name}_{run_id}/

Tips for Model Comparison

Use descriptive model names in the --model-name parameter
Keep the same similarity threshold across evaluations
Compare models using the generated visualization charts
Look at both entity and relation extraction metrics

2.9 KiB Raw Permalink Blame History