You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
Robert Labudda 33ca3655f4 when converting bytes to str, assume utf8 2 weeks ago
doc Accept spaces in tag names if surrounded by block quotes during auto processing 2 months ago
filecabinet when converting bytes to str, assume utf8 2 weeks ago
tests Accept spaces in tag names if surrounded by block quotes during auto processing 2 months ago
.gitignore gitignore 2 years ago
LICENSE Cleaning up 2 years ago
MANIFEST.in very basic first shot at web interface to list documents 1 year ago
README.md typo 1 year ago
example.conf Example configuration for the new index 2 years ago
requirements-doc.txt read metadata from MS doc files 2 years ago
requirements-ocr.txt OCR with tesserwrap 2 years ago
requirements-pdf.txt Use pdfminer.six instead of pdfminer 1 year ago
requirements-web.txt Download/open document parts from the web interface 1 year ago
requirements.txt clean-up 2 years ago
setup.py Include term in the setup 2 months ago

README.md

filecabinet

filecabinet is a minimal document management system for your computer. It has metadata per document and supports fulltext search in various document types.

This readme explains the simple, single-user, local deployment scenario. If you want to use this for multiple users, please read doc/multiuser.md.

Here’s what the web interface looks like:

Installing

To install filecabinet, you should clone the repository, install all requirements, and then filecabinet itself:

$ git clone https://git.spacepanda.se/bold-kitty/filecabinet.git
$ cd filecabinet
$ pip install --user -r requirements.txt
$ pip install --user .

The following additional requirements are optional:

In case you prefer the web interface over the command line shell, you will need this:

$ pip install --user -r requirements-web.txt

If you want to use PDF metadata extraction, you should also install these requirements:

$ pip install --user -r requirements-pdf.txt

For office file metadata (and fulltext search), these requirements are also necessary (and it’s a good idea to have OpenOffice installed):

$ pip install --user -r requirements-doc.txt

If you have scanned documents and want optical character recognition, you will need to install tesseract and this:

$ pip install --user -r requirements-ocr.txt

Quick start

After the installation, you can copy the example.conf to the user configuration directory (usually that’s ~/.config/) and name it filecabinet.conf.

Then you should edit the file to create a cabinet folder, where your documents will be stored, for example in ~/Documents/cabinet like this:

# ~/.config/filecabinet.conf

[cabinet1]
name = My File Cabinet
path = ~/Documents/cabinet

Now it’s time to add some documents to the cabinet:

$ filecabinet add Document/my-document.pdf

To inspect inspect all files, you can either start the web interface:

$ filecabinet web --browser

The --browser option will make sure that the website is immediately opened in your webbrowser.

Or you can use the shell, if you so prefer:

$ filecabinet shell

Try help to see the available commands and see below in the Shell section for more help.

Adding Documents

In order to add documents to the filecabinet, you have to copy them into your cabinet’s incoming folder.

Once they are in that folder, you have to tell filecabinet that there are new files to pick up. You can do that either in the shell with the pickup command or in your commandline with

$ filecabinet pickup

If configured, filecabinet will run optical character recognition (OCR) on pictures and PDF. It will use other tricks to try to extract as much metadata (and the full text) as it can.
Then the document is copied into the cabinet folder and marked as new.

An alternative way is to add the document through the add parameter:

$ filecabinet add that-file.pdf the-other-file.doc

To indicate that all files belong to the same document, the --same-document parameter (or short -s) can be used:

$ filecabinet add -s page1.pdf page2.txt

Searching

Both web interface and shell support the same search terms and mechanisms listed here.

Searching for tags is done case-insensitive and is done using tag: or #. For example if you're looking for a document that's tagged with banana, you can search for it by #banana or tag:banana.

Searching new documents is accomplished by searching for :new:yes or :new:y. If you only want to find documents that are not new, you can also search for :new:no. Unless specified, a search will ignore whether or not a document is new.

You can search for documents by date range using :before: and :after: with dates in the form yyyy-mm-dd. These dates are exclusive. If you are looking for a document with a date between February 14 and 21 in the year 2018, you can search like this: :after:2018-02-13 :before:2018-02-22.

By default documents that are deleted are ignored in searches or listings. You can search through deleted documents by searching for :deleted:yes.

You can search for any metadata value, like title, author, or language, by searching with the metadata name and a colon like title:gravity.

Everything else that does not match the special search terms will be used in the fulltext search.

Every search term is a case-insensitive regular expression. So you can search for title:(gr|shm)avity.

If you want to search for terms with whitespaces, you can use quotes: title:"brain surgery".

Example:

The title contains "brain", is from author "Gumby" and it was set to some time before August 2005: title:brain author:gumby :before:2015-08-01

Looking for a newly added document with the title "The Larch": title:larch :new:yes

The Shell

This is what the shell looks like:

The shell has only a minimal built-in help. Try entering help!

Opening Documents

To open documents from the shell with the open command, you have to configure a script to open files with. A strong recommendation is rifle from the ranger filemanager:

# ~/.config/filecabinet.conf

[Shell]
document_opener = rifle

Editing Metadata

From within the shell you can edit the metadata of documents with the edit command. This will, unless configured otherwise, try to use your configured text editors (see environment variables $VISUAL and $EDITOR).
You can override that behaviour by specifying your own editor in the configuration file:

# ~/.config/filecabinet.conf

[Shell]
document_editor = nano

If you decide to use a graphical editor, make sure it does not return until you are done editing. gedit should be doing that by default, but for example Sublime Text must be set up with the --wait flag and kate must receive the --block flag:

[Shell]
document_editor = kate --block
# document_editor = subl3 --wait

Searching

The shell allows searching with the list or find command and some search terms.

> list author:gumby

OCR

filecabinet can use Tesseract OCR to do character recognition on pictures and scanned PDFs, so you can search the text of images.

In order for that to work, you have to install Tesseract and some language packages, depending on the languages of the documents you wish to scan.

As the last step you should enable OCR in your configuration file:

# ~/.config/filecabinet.conf
[OCR]
enabled = yes
languages = eng, fra

Make sure you have the corresponding language data packages installed! Otherwise filecabinet will just die.

Cabinet Directory Structure

Assuming a cabinet is set up at ~/cabinet, the directory structure is:

~/cabinet
 |
 +-- incoming
 |
 +-- documents
      |
      +-- <partial document id>
           |
           +-- <full document id>
                |
                +-- document.yaml
                |
                +-- <version number>
                     |
                     +-- version.yaml
                     |
                     +-- <part id>.<ext>