====== Paperless-ngx 文件管理系統(Docker) ======
* 之前就一直找尋可以快速搜尋 File Server 內檔案內文關鍵字的系統, 最近看到這套 Paperless-ngx 還具有OCR的功能, 連掃描產生的 PDF 內文都可以解析出內文, 真的就很符合我希望使用的情境.
* 安裝環境 :
* VM : 4 vCores / 8G RAM / 32G(SSD)+500G(HDD)
* OS : [[tech/alpine_docker|Alpine3 + Docker Compose]]
* 配置 : 將 500G 掛在 /data 目錄上, 作為存放資料使用
===== 安裝方式 =====
- 下載 docker-compose.env 與 docker-compose.yml
wget https://raw.githubusercontent.com/paperless-ngx/paperless-ngx/dev/docker/compose/docker-compose.env -O docker-compose.env
wget https://raw.githubusercontent.com/paperless-ngx/paperless-ngx/dev/docker/compose/docker-compose.postgres.yml -O docker-compose.yml
- 修改 docker-compose.env
vi docker-compose.env
* 增加繁體中文 OCR 辨識功能
PAPERLESS_OCR_LANGUAGES=chi-tra chi-tra-vert
* 修改網址 Exp. docs.my.ichiayi.com
PAPERLESS_URL=https://docs.my.ichiayi.com
* 修改時區
PAPERLESS_TIME_ZONE=Asia/Taipei
* 修改預設 OCR為繁體中文+英文
PAPERLESS_OCR_LANGUAGE=chi_tra+eng
- 設定 Reverse Proxy(Option) Exp. docs.my.ichiayi.com -> http 172.16.0.220 8000
- 修改 docker-compose.yml 來支援 Office 格式, 以及增加 time out 時間, 資料存放到 /data
vi docker-compose.yml
services:
broker:
container_name: broker
image: docker.io/library/redis:7
restart: unless-stopped
volumes:
- redisdata:/data
db:
container_name: db
image: docker.io/library/postgres:15
restart: unless-stopped
volumes:
- pgdata:/var/lib/postgresql/data
environment:
POSTGRES_DB: paperless
POSTGRES_USER: paperless
POSTGRES_PASSWORD: paperless
webserver:
container_name: webserver
image: ghcr.io/paperless-ngx/paperless-ngx:latest
restart: unless-stopped
depends_on:
- db
- broker
- gotenberg
- tika
ports:
- "8000:8000"
volumes:
- data:/usr/src/paperless/data
- media:/usr/src/paperless/media
- ./export:/usr/src/paperless/export
- ./consume:/usr/src/paperless/consume
env_file: docker-compose.env
environment:
PAPERLESS_REDIS: redis://broker:6379
PAPERLESS_DBHOST: db
PAPERLESS_TIKA_ENABLED: 1
PAPERLESS_TIKA_GOTENBERG_ENDPOINT: http://gotenberg:3000
PAPERLESS_TIKA_ENDPOINT: http://tika:9998
gotenberg:
container_name: gotenberg
image: docker.io/gotenberg/gotenberg:7.10
restart: unless-stopped
# The gotenberg chromium route is used to convert .eml files. We do not
# want to allow external content like tracking pixels or even javascript.
command:
- "gotenberg"
- "--chromium-disable-javascript=true"
- "--chromium-allow-list=file:///tmp/.*"
- "--uno-listener-start-timeout=90s"
- "--api-timeout=900s"
tika:
container_name: tika
image: ghcr.io/paperless-ngx/tika:latest
restart: unless-stopped
volumes:
data:
driver: local
driver_opts:
type: 'none'
o: 'bind'
device: '/data/web-data'
media:
driver: local
driver_opts:
type: 'none'
o: 'bind'
device: '/data/web-media'
pgdata:
driver: local
driver_opts:
type: 'none'
o: 'bind'
device: '/data/db-data'
redisdata:
driver: local
driver_opts:
type: 'none'
o: 'bind'
device: '/data/broker-data'
- 建立 /data 內各個資料目錄
mkdir -p /data/web-data
mkdir -p /data/web-media
mkdir -p /data/db-data
mkdir -p /data/broker-data
- 第一次抓取 docker images
docker compose pull
- 建立第一位 Paperless 管理者帳號
docker compose run --rm webserver createsuperuser
- 啟動 Paperless 服務
docker compose up -d
===== 參考網址 =====
* https://docs.paperless-ngx.com/setup/
* https://docs.paperless-ngx.com/configuration/#PAPERLESS_OCR_LANGUAGE
* [[https://github.com/paperless-ngx/paperless-ngx/blob/main/docker/compose/docker-compose.sqlite-tika.yml | 想要支援 Office 格式, 就需要在 docker-compose.yml 內增加 gotenberg 與 tika 兩個服務]]
* [[https://github.com/paperless-ngx/paperless-ngx/discussions/4627 | 分析文件檔案時出現 503 可調整 gotenberg 的 timeout 時間以及增加 CPU 與 RAM 的資源]]
{{tag>docs 檔案管理 ocr docker}}