mirror of https://github.com/james-m-jordan/morphik-core.git synced 2025-05-09 19:32:38 +00:00

Go to file

add support for PostgreSQL and pgvector (#15 )

Co-authored-by: Adityavardhan Agrawal <aa729@cornell.edu>

2025-01-04 08:14:52 -05:00

.github/workflows

update requirements and add pre-commit logic

2024-12-26 22:26:16 +05:30

core

add support for PostgreSQL and pgvector (#15 )

2025-01-04 08:14:52 -05:00

sanity_checks

add a video parser + formatting changes (#4 )

2024-12-26 11:34:24 -05:00

sdks/python

Add open telemetry and shell (#5 )

2024-12-30 23:52:25 -05:00

__init__.py

basic ingestion using unstructured and k-nearest retrieval works

2024-11-14 23:18:37 -05:00

.dockerignore

deploy to fly

2024-11-17 16:51:18 -05:00

.env.example

Add PostgreSQL support (#13 )

2025-01-04 08:11:09 -05:00

.gitignore

Add local file system for storage (#10 )

2024-12-31 06:25:51 -05:00

.pre-commit-config.yaml

change black formatter to support 100 line-length

2024-12-29 12:46:17 +05:30

config.toml

add support for PostgreSQL and pgvector (#15 )

2025-01-04 08:14:52 -05:00

databridge.toml

add support for PostgreSQL and pgvector (#15 )

2025-01-04 08:14:52 -05:00

dockerfile

sdk and querying in api works

2024-12-03 21:46:25 -05:00

fly.toml

add s3 uploading and fly deploy

2024-11-18 10:45:07 -05:00

generate_local_uri.py

add a video parser + formatting changes (#4 )

2024-12-26 11:34:24 -05:00

LICENSE

system changes

2024-11-22 20:58:17 -05:00

printer.py

reformat files

2024-12-29 12:48:41 +05:30

pytest.ini

add pytest.ini and init in tests

2024-11-23 13:56:45 -05:00

quick_setup.py

add support for PostgreSQL and pgvector (#15 )

2025-01-04 08:14:52 -05:00

README.md

add setup script, update readme accordingly

2024-12-23 13:22:21 -05:00

requirements.txt

add support for PostgreSQL and pgvector (#15 )

2025-01-04 08:14:52 -05:00

shell.py

use local unstructured by default (#12 )

2025-01-01 09:18:23 -05:00

start_server.py

bug fixes and end-to-end testing

2024-12-17 21:40:38 -05:00

README.md

DataBridge

DataBridge is an extensible, open-source document processing and retrieval system designed for building document-based applications. It provides a modular architecture for integrating document parsing, embedding generation, and vector search capabilities.

Features
Starting the Server
Quick Start
Architecture
- Current Integrations
- Adding New Components
API Documentation
- Key Endpoints
License
Contributing

Features

🔌 Extensible Architecture: Modular design for easy component extension or replacement
🔍 Vector Search: Semantic search capabilities
🔐 Authentication: JWT-based auth with developer and end-user access modes
📊 Components: Document Parsing (Unstructured API), Vector Store (MongoDB Atlas), Embedding Model (OpenAI), Storage (AWS S3)
🚀 Python SDK: Simple client SDK for quick integration

Starting the Server

Clone the repository:

git clone https://github.com/databridge-org/databridge-core.git

Setup your python environment (Python 3.12 supported, but other versions may work):

cd databridge-core
python -m venv .venv
source .venv/bin/activate

Install the required dependencies:

pip install -r requirements.txt

Set up your environment variables, using the .env.example file as a reference, and creating a .env file in the project directory:

cp .env.example .env

Run the quick setup script to create the database, s3 bucket, and vector index:

python quick_setup.py

Generate a local URI:

python generate_local_uri.py

Copy the output and save it for use with the client SDK.

Start the server:

python start_server.py

Tip: Visit http://localhost:8000/docs for the complete OpenAPI documentation.

Quick Start

Ensure the server is running, then use the SDK to ingest and query documents.

Install the SDK:

pip install databridge-client

Use the SDK:

import asyncio
from databridge import DataBridge

async def main():
    # Initialize client
    db = DataBridge("your_databridge_uri_here", is_local=True)
    files = ["annual_report_2022.pdf", "marketing_strategy.docx" ,"product_launch_presentation.pptx", "company_logo.png"]
    
    for file in files:
      await db.ingest_file(
          file=file,
          file_name=file,
          metadata={"category": "Company Related"} # Optionally add any metadata
      )
    
    # Query documents
    results = await db.query(
        query="What did our target market say about our product?",
        return_type="chunks",
        filters={"category": "Company Related"}
    )

    print(results)

asyncio.run(main())

For other examples checkout our documentation!

Architecture

DataBridge uses a modular architecture with the following base components that can be extended or replaced:

Current Integrations

Document Parser: Unstructured API integration for intelligent document processing
- Extend BaseParser to add new parsing capabilities
Vector Store: MongoDB Atlas Vector Search integration
- Extend BaseVectorStore to add new vector stores
Embedding Model: OpenAI embeddings integration
- Extend BaseEmbeddingModel to add new embedding models
Storage: AWS S3 integration
- Storage utilities can be modified in utils/

Adding New Components

Implement the relevant base class from core/
Register your implementation in the service configuration
Update environment variables if needed

API Documentation

Once the server is running, visit http://localhost:8000/docs for the complete OpenAPI documentation.

Key Endpoints

POST /ingest: Ingest new documents
POST /query: Query documents using semantic search
GET /documents: List all documents
GET /document/{doc_id}: Get specific document details

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

We welcome contributions! Please open an issue or submit a pull request.

Built with ❤️ by DataBridge.

README.md

DataBridge

Table of Contents

Features

Starting the Server

Quick Start

Architecture

Current Integrations

Adding New Components

API Documentation

Key Endpoints

License

Contributing