Thomas_Renard/AttSearchTool/CacheGenerator

The cache generator

This tool looks in your Wiki's data-pages paths for attachments. If there is a new PDF in some path it creates a searchcache directory for this page and renders a text file with all words in the PDF sorted and uniq:

#
#    Copyright (c) 2003 Thomas Renard CyBaer42@web.de>
#    All rights reserved, see COPYING for details.
#
#    This script extracts word lists from attachments
#
#    $Id$ 

WIKIROOT=/your/wiki/root/here
CACHEREF=$WIKIROOT/cacheref
 
for i in `find $WIKIROOT/data/pages/*/attachments/ -newer $CACHEREF -and -type f -print 2>/dev/null`
do
        file $i | grep PDF >/dev/null
        if [ "$?" == "0" ]
        then
            j=`echo $i|sed "s/attachments.*/searchcache/g"`
            k=`echo $i|sed "s/attachments/searchcache/g"`
            mkdir -p $j
            pstotext $i| sed "s/[[:space:]]/\n/g"| \
            sed "s/[\"\':,!0><=\.\;\^\*\|\-\+]//g" | \
            tr A-Z a-z | \
            sort|uniq >$k
        fi
done
 
touch $CACHEREF

This script should be run via cron. The last sed filters some characters I did not like to have. Maybe this can be made a little bit smoother in future releases. $CACHEREF is a flag to check if an attachment has changed since the last run of this script.

I try to render M$ Word documents via wvText with the next release. It is the same like the PDF-stuff except using wvText instead of pstotext and checking for Word Documents with the file $i

Remarks and Questions

Wouldn't it be easier and more convenient if this script is only run if a pdf file has been attached?

MoinMoin: Thomas_Renard/AttSearchTool/CacheGenerator (last edited 2007-10-29 19:06:47 by localhost)