mirror of
https://github.com/james-m-jordan/morphik-core.git
synced 2025-05-09 19:32:38 +00:00
update readme
This commit is contained in:
parent
2a4fd05096
commit
aca3437c90
176
README.md
176
README.md
@ -1,139 +1,69 @@
|
|||||||
# DataBridge
|
# DataBridge Core
|
||||||
|
|
||||||
DataBridge is an extensible, open-source document processing and retrieval system designed for building document-based applications. It provides a modular architecture for integrating document parsing, embedding generation, and vector search capabilities.
|
DataBridge is a powerful document processing and retrieval system designed for building intelligent document-based applications. It provides a robust foundation for semantic search, document processing, and AI-powered document interactions.
|
||||||
|
|
||||||
## Table of Contents
|
## Core Features
|
||||||
- [Features](#features)
|
|
||||||
- [Starting the Server](#starting-the-server)
|
|
||||||
- [Quick Start](#quick-start)
|
|
||||||
- [Architecture](#architecture)
|
|
||||||
- [Current Integrations](#current-integrations)
|
|
||||||
- [Adding New Components](#adding-new-components)
|
|
||||||
- [API Documentation](#api-documentation)
|
|
||||||
- [Key Endpoints](#key-endpoints)
|
|
||||||
- [License](#license)
|
|
||||||
- [Contributing](#contributing)
|
|
||||||
|
|
||||||
## Features
|
- 🔍 **Semantic Search & Retrieval**
|
||||||
|
- Intelligent chunk-based document splitting
|
||||||
|
- Two-stage ranking with vector similarity and neural reranking
|
||||||
|
- Advanced filtering and metadata support
|
||||||
|
- Configurable similarity thresholds and result limits
|
||||||
|
|
||||||
- 🔌 **Extensible Architecture**: Modular design for easy component extension or replacement
|
- 📄 **Document Processing**
|
||||||
- 🔍 **Vector Search**: Semantic search capabilities
|
- Support for PDFs, Word documents, text files, and more
|
||||||
- 🔐 **Authentication**: JWT-based auth with developer and end-user access modes
|
- Intelligent text extraction with structure preservation
|
||||||
- 📊 **Components**: Document Parsing (Unstructured API), Vector Store (MongoDB Atlas), Embedding Model (OpenAI), Storage (AWS S3)
|
- Video content parsing with transcription and metadata extraction
|
||||||
- 🚀 **Python SDK**: Simple client SDK for quick integration
|
- Automatic chunk generation and embedding
|
||||||
|
- Metadata and access control management
|
||||||
|
|
||||||
## Starting the Server
|
- 🔌 **Extensible Architecture**
|
||||||
|
- Modular design with swappable components
|
||||||
|
- Support for custom parsers and embedding models
|
||||||
|
- Flexible storage backends (S3, local, etc.)
|
||||||
|
- Vector store integrations (PostgreSQL with pgvector)
|
||||||
|
|
||||||
1. Clone the repository:
|
- 🔐 **Security & Access Control**
|
||||||
```bash
|
- Fine-grained document access control
|
||||||
git clone https://github.com/databridge-org/databridge-core.git
|
- Reader/Writer/Admin permission levels
|
||||||
```
|
- JWT-based authentication
|
||||||
|
- API key management
|
||||||
|
|
||||||
2. Setup your python environment (Python 3.12 supported, but other versions may work):
|
- 💻 **Deployment Options**
|
||||||
```bash
|
- Full local deployment support with Ollama for embeddings
|
||||||
cd databridge-core
|
- Cloud deployment with managed services
|
||||||
python -m venv .venv
|
- Hybrid deployment options
|
||||||
source .venv/bin/activate
|
- Docker container support
|
||||||
```
|
|
||||||
|
|
||||||
3. Install the required dependencies:
|
## Key Endpoints
|
||||||
```bash
|
|
||||||
pip install -r requirements.txt
|
|
||||||
```
|
|
||||||
|
|
||||||
4. Set up your environment variables, using the `.env.example` file as a reference, and creating a `.env` file in the project directory:
|
- **Document Operations**
|
||||||
|
- `POST /ingest/text`: Ingest text content
|
||||||
|
- `POST /ingest/file`: Ingest file (PDF, DOCX, video, etc.)
|
||||||
|
- `GET /documents`: List all documents
|
||||||
|
- `GET /documents/{doc_id}`: Get document details
|
||||||
|
- `DELETE /documents/{doc_id}`: Delete a document
|
||||||
|
|
||||||
```bash
|
- **Search & Retrieval**
|
||||||
cp .env.example .env
|
- `POST /retrieve/chunks`: Search document chunks
|
||||||
```
|
- `POST /retrieve/docs`: Search complete documents
|
||||||
|
- `POST /query`: Generate completions using context
|
||||||
|
- `GET /documents/{doc_id}/chunks`: Get document chunks
|
||||||
|
|
||||||
5. Run the quick setup script to create the database, s3 bucket, and vector index:
|
- **System Operations**
|
||||||
```bash
|
- `GET /health`: System health check
|
||||||
python quick_setup.py
|
- `GET /usage/stats`: Get usage statistics
|
||||||
```
|
- `GET /usage/recent`: Get recent operations
|
||||||
|
- `POST /api-keys`: Generate API keys
|
||||||
|
|
||||||
6. Generate a local URI:
|
## Documentation
|
||||||
```bash
|
|
||||||
python generate_local_uri.py
|
|
||||||
```
|
|
||||||
Copy the output and save it for use with the client SDK.
|
|
||||||
|
|
||||||
7. Start the server:
|
For detailed information about installation, usage, and development:
|
||||||
```bash
|
|
||||||
python start_server.py
|
|
||||||
```
|
|
||||||
*Tip*: Visit `http://localhost:8000/docs` for the complete OpenAPI documentation.
|
|
||||||
|
|
||||||
## Quick Start
|
- [Installation Guide](https://databridge.gitbook.io/databridge-docs/getting-started/installation)
|
||||||
|
- [Quick Start Guide](https://databridge.gitbook.io/databridge-docs/getting-started/quickstart)
|
||||||
Ensure the server is running, then use the SDK to ingest and query documents.
|
- [API Reference](https://databridge.gitbook.io/databridge-docs/api-reference/overview)
|
||||||
|
- [Architecture Overview](https://databridge.gitbook.io/databridge-docs/architecture/overview)
|
||||||
1. Install the SDK:
|
|
||||||
```bash
|
|
||||||
pip install databridge-client
|
|
||||||
```
|
|
||||||
2. Use the SDK:
|
|
||||||
```python
|
|
||||||
import asyncio
|
|
||||||
from databridge import DataBridge
|
|
||||||
|
|
||||||
async def main():
|
|
||||||
# Initialize client
|
|
||||||
db = DataBridge("your_databridge_uri_here", is_local=True)
|
|
||||||
files = ["annual_report_2022.pdf", "marketing_strategy.docx" ,"product_launch_presentation.pptx", "company_logo.png"]
|
|
||||||
|
|
||||||
for file in files:
|
|
||||||
await db.ingest_file(
|
|
||||||
file=file,
|
|
||||||
file_name=file,
|
|
||||||
metadata={"category": "Company Related"} # Optionally add any metadata
|
|
||||||
)
|
|
||||||
|
|
||||||
# Query documents
|
|
||||||
results = await db.query(
|
|
||||||
query="What did our target market say about our product?",
|
|
||||||
return_type="chunks",
|
|
||||||
filters={"category": "Company Related"}
|
|
||||||
)
|
|
||||||
|
|
||||||
print(results)
|
|
||||||
|
|
||||||
asyncio.run(main())
|
|
||||||
```
|
|
||||||
|
|
||||||
For other examples <!-- -like how to make xyz in 10 lines of code- --> checkout our [documentation](https://databridge.gitbook.io/databridge-docs)!
|
|
||||||
|
|
||||||
## Architecture
|
|
||||||
|
|
||||||
DataBridge uses a modular architecture with the following base components that can be extended or replaced:
|
|
||||||
|
|
||||||
### Current Integrations
|
|
||||||
|
|
||||||
- **Document Parser**: Unstructured API integration for intelligent document processing
|
|
||||||
- Extend `BaseParser` to add new parsing capabilities
|
|
||||||
- **Vector Store**: MongoDB Atlas Vector Search integration
|
|
||||||
- Extend `BaseVectorStore` to add new vector stores
|
|
||||||
- **Embedding Model**: OpenAI embeddings integration
|
|
||||||
- Extend `BaseEmbeddingModel` to add new embedding models
|
|
||||||
- **Storage**: AWS S3 integration
|
|
||||||
- Storage utilities can be modified in `utils/`
|
|
||||||
|
|
||||||
### Adding New Components
|
|
||||||
|
|
||||||
1. Implement the relevant base class from `core/`
|
|
||||||
2. Register your implementation in the service configuration
|
|
||||||
3. Update environment variables if needed
|
|
||||||
|
|
||||||
## API Documentation
|
|
||||||
|
|
||||||
Once the server is running, visit `http://localhost:8000/docs` for the complete OpenAPI documentation.
|
|
||||||
|
|
||||||
### Key Endpoints
|
|
||||||
|
|
||||||
- `POST /ingest`: Ingest new documents
|
|
||||||
- `POST /query`: Query documents using semantic search
|
|
||||||
- `GET /documents`: List all documents
|
|
||||||
- `GET /document/{doc_id}`: Get specific document details
|
|
||||||
|
|
||||||
## License
|
## License
|
||||||
|
|
||||||
@ -145,4 +75,4 @@ We welcome contributions! Please open an issue or submit a pull request.
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
Built with ❤️ by DataBridge.
|
Built with ❤️ by DataBridge
|
||||||
|
Loading…
x
Reference in New Issue
Block a user