mirror of
https://github.com/james-m-jordan/morphik-core.git
synced 2025-05-09 19:32:38 +00:00
DataBridge
DataBridge is an extensible, open-source document processing and retrieval system designed for building document-based applications. It provides a modular architecture for integrating document parsing, embedding generation, and vector search capabilities.
Features
- 🔌 Extensible Architecture: Built with modularity in mind - easily extend or replace any component:
- Document Parsing: Currently integrated with Unstructured API
- Vector Store: Currently using MongoDB Atlas
- Embedding Model: Currently using OpenAI
- Storage: Currently using AWS S3
- 🔍 Vector Search: Semantic search capabilities
- 🔐 Authentication: JWT-based auth with developer and end-user access modes
- 📊 Metadata: Rich metadata filtering and organization
- 🚀 Python SDK: Simple client SDK for quick integration
Quick Start
- Install the SDK:
pip install databridge-client
- Set up your environment variables:
MONGODB_URI=your_mongodb_connection_string
OPENAI_API_KEY=your_openai_api_key
UNSTRUCTURED_API_KEY=your_unstructured_api_key
JWT_SECRET_KEY=your_jwt_secret
AWS_ACCESS_KEY=your_aws_access_key
AWS_SECRET_ACCESS_KEY=your_aws_secret_key
- Start the server:
python start_server.py
- Use the SDK:
import asyncio
from databridge import DataBridge
async def main():
# Initialize client
db = DataBridge("databridge://owner_id:auth_token@your-domain.com")
# Ingest a document
doc_id = await db.ingest_document(
content="Your document content",
metadata={"title": "My Document"}
)
# Query documents
results = await db.query(
query="What is...",
k=4 # Number of results
)
await db.close()
asyncio.run(main())
Architecture
DataBridge uses a modular architecture with the following base components that can be extended or replaced:
Current Integrations
- Document Parser: Unstructured API integration for intelligent document processing
- Extend
BaseParser
to add new parsing capabilities
- Extend
- Vector Store: MongoDB Atlas Vector Search integration
- Extend
BaseVectorStore
to add new vector stores
- Extend
- Embedding Model: OpenAI embeddings integration
- Extend
BaseEmbeddingModel
to add new embedding models
- Extend
- Storage: AWS S3 integration
- Storage utilities can be modified in
utils/
- Storage utilities can be modified in
Adding New Components
- Implement the relevant base class from
core/
- Register your implementation in the service configuration
- Update environment variables if needed
API Documentation
Once the server is running, visit http://localhost:8000/docs
for the complete OpenAPI documentation.
Key Endpoints
POST /ingest
: Ingest new documentsPOST /query
: Query documents using semantic searchGET /documents
: List all documentsGET /document/{doc_id}
: Get specific document details
License
This project is licensed under the MIT License - see the LICENSE file for details.
Contributing
We welcome contributions! Please open an issue or submit a pull request.
Built with ❤️ by DataBridge.
Languages
Python
80.8%
TypeScript
18.9%
Shell
0.2%
CSS
0.1%