
Co-authored-by: Adityavardhan Agrawal <aa729@cornell.edu>
DataBridge
DataBridge is an extensible, open-source document processing and retrieval system designed for building document-based applications. It provides a modular architecture for integrating document parsing, embedding generation, and vector search capabilities.
Table of Contents
Features
- 🔌 Extensible Architecture: Modular design for easy component extension or replacement
- 🔍 Vector Search: Semantic search capabilities
- 🔐 Authentication: JWT-based auth with developer and end-user access modes
- 📊 Components: Document Parsing (Unstructured API), Vector Store (MongoDB Atlas), Embedding Model (OpenAI), Storage (AWS S3)
- 🚀 Python SDK: Simple client SDK for quick integration
Starting the Server
- Clone the repository:
git clone https://github.com/databridge-org/databridge-core.git
- Setup your python environment (Python 3.12 supported, but other versions may work):
cd databridge-core
python -m venv .venv
source .venv/bin/activate
- Install the required dependencies:
pip install -r requirements.txt
- Set up your environment variables, using the
.env.example
file as a reference, and creating a.env
file in the project directory:
cp .env.example .env
- Run the quick setup script to create the database, s3 bucket, and vector index:
python quick_setup.py
- Generate a local URI:
python generate_local_uri.py
Copy the output and save it for use with the client SDK.
- Start the server:
python start_server.py
Tip: Visit http://localhost:8000/docs
for the complete OpenAPI documentation.
Quick Start
Ensure the server is running, then use the SDK to ingest and query documents.
- Install the SDK:
pip install databridge-client
- Use the SDK:
import asyncio
from databridge import DataBridge
async def main():
# Initialize client
db = DataBridge("your_databridge_uri_here", is_local=True)
files = ["annual_report_2022.pdf", "marketing_strategy.docx" ,"product_launch_presentation.pptx", "company_logo.png"]
for file in files:
await db.ingest_file(
file=file,
file_name=file,
metadata={"category": "Company Related"} # Optionally add any metadata
)
# Query documents
results = await db.query(
query="What did our target market say about our product?",
return_type="chunks",
filters={"category": "Company Related"}
)
print(results)
asyncio.run(main())
For other examples checkout our documentation!
Architecture
DataBridge uses a modular architecture with the following base components that can be extended or replaced:
Current Integrations
- Document Parser: Unstructured API integration for intelligent document processing
- Extend
BaseParser
to add new parsing capabilities
- Extend
- Vector Store: MongoDB Atlas Vector Search integration
- Extend
BaseVectorStore
to add new vector stores
- Extend
- Embedding Model: OpenAI embeddings integration
- Extend
BaseEmbeddingModel
to add new embedding models
- Extend
- Storage: AWS S3 integration
- Storage utilities can be modified in
utils/
- Storage utilities can be modified in
Adding New Components
- Implement the relevant base class from
core/
- Register your implementation in the service configuration
- Update environment variables if needed
API Documentation
Once the server is running, visit http://localhost:8000/docs
for the complete OpenAPI documentation.
Key Endpoints
POST /ingest
: Ingest new documentsPOST /query
: Query documents using semantic searchGET /documents
: List all documentsGET /document/{doc_id}
: Get specific document details
License
This project is licensed under the MIT License - see the LICENSE file for details.
Contributing
We welcome contributions! Please open an issue or submit a pull request.
Built with ❤️ by DataBridge.