Architecture
Data Model Rules
- Scholar tracking is user-scoped. Each user manages their own list of tracked scholars. Validate mapping/join tables; never assume global links between users and Scholar IDs.
- Publications are global, deduplicated records. Deduplicate via Scholar cluster ID and normalized fingerprinting prior to database insertion.
- State lives on the link. Read/unread, favorites, and visibility state exist exclusively on the scholar-publication link table, not the global publication table.
Domain Service Boundaries
All business logic resides in app/services/<domain>/. Flat files in the app/services/ root are strictly prohibited.
Ingestion (app/services/ingestion/)
Run orchestration, continuation queue, scheduler integration, and scrape safety policy enforcement. The ScholarIngestionService drives the primary data acquisition loop.
Key modules:
application.py- Main ingestion orchestratorscheduler.py- Background tick loop, queue batch processingconstants.py- Safety policy constants and floor valuesfingerprints.py- Publication fingerprinting for deduplicationtypes.py- Ingestion result types and state enums
Scholar Parsing (app/services/scholar/)
Fail-fast Google Scholar HTML parsing and source fetch adapters. Handles paginated HTML feeds, extracts publication blocks via regex and DOM invariants (e.g., gsc_vcd_cib).
Key modules:
parser.py- HTML parser for publication extractionparser_utils.py- Parsing helpers and DOM selectorssource.py- HTTP fetch adapters with browser headersprofile_rows.py- Profile metadata extractionauthor_rows.py- Author citation row parsingstate_detection.py- Blocked/CAPTCHA/rate-limit detection
Scholar Management (app/services/scholars/)
Scholar CRUD, profile image management, and name-search controls with rate limiting.
Key modules:
application.py- Scholar lifecycle operationsuploads.py- Image upload handlingsearch_hints.py- Name search with caching and cooldownsvalidators.py- Input validation for scholar creation
Publications (app/services/publications/)
Listing, read-state management, favorite toggles, enrichment scheduling, deduplication, and PDF queue management.
Key modules:
application.py- Publication service facadelisting.py- Filtered listing with pagination (modes: all/unread/latest)queries.py- Database query builderscounts.py- Aggregation counts for dashboarddedup.py- Duplicate detection and mergingenrichment.py- Identifier and metadata enrichment orchestrationpdf_queue.py- PDF resolution queue policypdf_resolution_pipeline.py- Multi-source PDF resolution (Unpaywall, arXiv)types.py- Publication DTOs and response types
Publication Identifiers (app/services/publication_identifiers/)
Multi-identifier resolution engine. A single publication can have multiple identifiers (DOI, arXiv ID, PMID, PMCID). This decoupled approach replaced the earlier hardcoded DOI-only model.
Key modules:
application.py- Identifier gathering orchestrationnormalize.py- Identifier normalization and validation
arXiv (app/services/arxiv/)
Typed API client with global DB-backed throttle, query cache, and in-flight request coalescing.
Key modules:
client.py- HTTP client for arXiv export APIgateway.py- Gateway with advisory lock, caching, and coalescingcache.py- Query cache with TTL and max-entry pruningrate_limit.py- Global rate limiter viaarxiv_runtime_statetableguards.py- Load-shedding guards (skip when DOI/arXiv evidence exists)parser.py- Atom XML response parser
Safety features:
- Requests are globally serialized via a PostgreSQL advisory lock and shared runtime row (
arxiv_runtime_state) - Identical request payloads are fingerprinted and cached in
arxiv_query_cache_entrieswith TTL + max-entry pruning - Concurrent identical misses are coalesced in-process (one outbound call serves all waiters)
Crossref (app/services/crossref/)
DOI lookup via Crossref REST API with bounded pacing and configurable batch limits.
Unpaywall (app/services/unpaywall/)
Open-access PDF resolution via Unpaywall API, with HTML-based PDF link discovery as a fallback.
Key modules:
application.py- Unpaywall service facadepdf_discovery.py- HTML page scraping for PDF link candidates
OpenAlex (app/services/openalex/)
Metadata matching via OpenAlex API for supplementary identifier resolution.
Key modules:
client.py- OpenAlex API clientmatching.py- Fuzzy title/author matching
Runs (app/services/runs/)
Run history tracking and continuation queue operations.
Key modules:
application.py- Run lifecycle managementqueue_service.py- Continuation queue operations (retry, drop, clear)queue_queries.py- Queue item queries
Portability (app/services/portability/)
Import/export workflows for scholar data with full publication state preservation.
Key modules:
application.py- Import/export orchestrationexporting.py- Scholar export serializationpublication_import.py- Publication import with deduplicationscholar_import.py- Scholar import with link reconstructionnormalize.py- Payload normalization and validation
Database Operations (app/services/dbops/)
Integrity checking, link repair, and near-duplicate repair operations exposed via admin API and CLI scripts.
Key modules:
__init__.py-collect_integrity_report,run_publication_link_repairnear_duplicate_repair.py- Near-duplicate publication detection and merging
Data Integration Flow
graph TD
UI[Frontend Vue App] --> API[FastAPI Backend]
API --> Scheduler[Background Async Scheduler]
Scheduler -->|1. Fetch HTML| Scholar[Google Scholar HTML Parser]
Scholar -->|2. Extract Metadata| IdentifierModule[Identifier Gathering]
IdentifierModule -->|Search arXiv API| API1(arXiv)
IdentifierModule -->|Search Crossref API| API2(Crossref)
IdentifierModule -->|3. Save Identifiers| DB[(PostgreSQL)]
Scheduler -->|4. Resolve PDF| PDFPipeline[PDF Resolution Pipeline]
DB --> |Identified DOIs| PDFPipeline
PDFPipeline -->|Search Open Access APIs| Unpaywall(Unpaywall API)
Unpaywall --> |Acquire PDF URL| PDFWorker[PDF Download Worker]
PDFWorker --> |Store PDF Metadata| DB
API Layer
Routes live in app/api/routers/. All responses under /api/v1 use a strict envelope format. See API Reference for the full contract.
Middleware Stack
Applied in app/main.py:
- CSRF Protection - Token-based CSRF validation
- Session Middleware - Cookie-based session management (itsdangerous signing)
- Request Logging - Structured request/response logging with configurable skip paths
- Security Headers - Configurable HTTP security headers and CSP