Contents
|
External-filters Repository
Share your favorite external-filters.xml !
Structure of external-filters.xml
External filters are managed and configured by a system-wide configuration file external-filters.xml. This file gets installed in sysconfdir/beagle (usually sysconfdir is usually /etc/ or /usr/etc or /usr/local/etc). The structure of the file is:
<?xml version="1.0" encoding="utf-8"?> <external-filters> <!-- Add filters here --> </external-filters>
To add an external command based filter, add the details within the external-filters tag. The details are added by a filter tag and looks like the following:
<filter> <mimetype>mimetype</mimetype> <extension>extension</extension> <command>command</command> <arguments>arguments</arguments> </filter>
where,
- mimetype - The mime type handled by this filter. You may have 0 or more of these for any filter. E.g. text/plain.
- extension - The file extension handled by this filter. You may have 0 or more of these for any filter. E.g. .txt (notice the dot in extension).
- command - The filename of the command to run. Do not put any command line arguments in this. This item is required. E.g. cat.
- arguments - Any arguments to pass into the given command. The special token "%s" means the filename to be passed in. This item is required. E.g. %s.
Some sample filters are given below:
- TeX filter
- DVI filter
- Filter for Abiword supported formats
- Postscript filter
- GZipped postscript filter
- Djvu filter
- FB2 filter
- Rar filter
- Lyx filter
Simple TeX filter
- Author: Stephan Hegel
- Description: untex to remove LaTeX commands from input
- Dependencies: untex
<filter> <mimetype>text/x-tex</mimetype> <extension>.tex</extension> <command>untex</command> <arguments>-gascii %s</arguments> </filter>
Simple DVI filter
- Author: Dav
- Description: dvi to text using the "-q" option of dvi2tty
- Dependencies: dvi2tty
<filter> <mimetype>application/x-dvi</mimetype> <extension>.dvi</extension> <command>dvi2tty</command> <arguments>-q %s</arguments> </filter>
Abiword Filter
- Author: John Stowers
- Description: Indexes any document which abiword can open (msword, abw, odf, etc)
- Dependencies: abiword, the abiconvert script
- Note: For an unknown reason the external filter script is not loaded on ubuntu so I have only tested this as far as I can. Can someone else confirm if this works and post any comments to Ubuntu bug 39839. Feel free to change this as you see fit.
- Edit: Place the external-filters.xml file in /usr/etc/beagle (if beagle is installed in /usr/lib/beagle) or /usr/local/etc/beagle (if beagle is installed in /usr/local/lib/beagle).
Place this in your path and name it abiconvert. It will convert any document which abiword can open to raw text,
Conversion Script
#!/bin/bash #Check a file to convert was supplied if [ $# -eq 0 ] then echo "Usage: `basename $0` file_to_convert" exit 1 fi #create a temporary textfile that abiword will place the converted document into tempfile=`tempfile --suffix=.txt` #convert the document (abiword always returns 0 wether this was successful or not) abiword --to=$tempfile "$1" #if the tempfile contains some data then abiword must have converted something if [ -s $tempfile ] then data=`cat $tempfile` echo $data fi #tidy up rm $tempfile
Filter Blob
Place this in the external-filters.xml file. Add additional mimetypes to add support for other abiword compatible formats.
<filter> <mimetype>application/msword</mimetype> <mimetype>application/x-mswrite</mimetype> <extension>.doc</extension> <command>abiconvert</command> <arguments> %s</arguments> </filter>
Simple Postscript filter
- Author: Ben Lee
- Description: ps2ascii to extract text from postscript
- Dependencies: ps2ascii
<filter> <mimetype>application/postscript</mimetype> <extension>.ps</extension> <extension>.ai</extension> <extension>.eps</extension> <command>ps2ascii</command> <arguments>%s</arguments> </filter>
Simple Gziped Postscript filter
- Author: Juergen Rinas
- Description: gzip and ps2ascii to extract text from Gziped postscript
- Dependencies: gzip, ps2ascii
<filter> <mimetype>application/x-gzpostscript</mimetype> <extension>.ps.gz</extension> <command>gzpostscriptfilter</command> <arguments>%s</arguments> </filter>
Conversion Script:
#! /bin/sh # filename: gzpostscriptfilter gzip -dc "$1" | ps2ascii
Simple Djvu filter
- Author: Ben Lee
- Description: djvutxt to extract text from Djvu files
- Dependencies: djvutxt
<filter> <mimetype>image/vnd.djvu</mimetype> <extension>.djvu</extension> <extension>.djv</extension> <command>djvutxt</command> <arguments>%s</arguments> </filter>
Simple FB2 filter
- Author: Penkov Vladimir
- Description: Extract text from popular russian book format FictionBook2
- Dependencies: perl 5.8, unzip
save this in file /usr/local/bin/fb2tty.pl and do chmod +x /usr/local/bin/fb2tty.pl
#!/usr/bin/perl
use Encode 'decode';
#do we need to use unzip?
$filenamePar = $ARGV[0];
my $filename;
if ($filenamePar =~ m/.zip$/) {
$filename = "unzip -p $filenamePar |";
}
else {
$filename = $filenamePar;
}
#get file encoding
open(F, $filename) or die "Coudn't open file $filename: $!";
@lines = <F>;
close(F);
$text = join(" ", @lines);
$text =~ /<\?xml.*?encoding="(.*?)".*?\?>/;
$enc = $1;
if ("x$enc" eq "x") {
$enc = "utf8";#defaults to utf-8
}
$lines=();
$text="";
#parse source
open (F, $filename) or die "Coudn't open file $filename: $!";
@lines = <F>;
close(F);
$text = join(" ", @lines);
$text = decode("$enc", $text);
$text =~ s/<binary.*?>.*?<\/binary>//g; #remove binary data
$text =~ s/<.*?>/ /g; #remove all xml tags
print $text;
<filter> <extension>.fb2</extension> <extension>.fb2.zip</extension> <command>fb2tty.pl</command> <arguments>%s</arguments> </filter>
Rar filter
- Author: Debajyoti Bera
- Description: extracts file names from rar archives
- Dependencies: rar
<filter> <mimetype>application/x-rar</mimetype> <extension>.rar</extension> <command>rar</command> <arguments>lb %s</arguments> </filter>
Lyx Filter
- Author: James Wilson
- Description: extracts text from Lyx files.
<filter> <mimetype>application/x-lyx</mimetype> <extension>.lyx</extension> <command>cat</command> <arguments>%s</arguments> </filter>
Another Lyx Filter
- Author: Nick Daly
- Description: Extracts text from Lyx files, adapted from the Abiword filter by John Stowers.
Place the conversion script in your path, and call it "lyx2stdout", don't forget to make it executable.
Conversion Script
#!/bin/bash
#Check a file to convert was supplied
if [ $# -eq 0 ]
then
echo "Usage: `basename $0` file_to_convert"
exit 1
fi
# lyx will store the result of the output to this file
tempfile=${1%lyx}txt
# convert the document
lyx -e text $1
# if the tempfile contains some data then lyx must have converted something
if [ -s $tempfile ]
then
cat $tempfile
fi
#tidy up
rm $tempfile
Filter
<filter> <mimetype>application/x-lyx</mimetype> <extension>.lyx</extension> <command>lyx2stdout</command> <arguments>%s</arguments> </filter>
Lua Filter
- Author: Gabriel Z. M. Ramos
- Description: Extracts text from Lua files
<filter> <mimetype>text/plain</mimetype> <mimetype>application/x-lua</mimetype> <mimetype>text/x-lua</mimetype> <extension>.lua</extension> <command>cat</command> <arguments>%s</arguments> </filter>
