The cache generator
This tool looks in your Wiki's data-pages paths for attachments. If there is a new PDF in some path it creates a searchcache directory for this page and renders a text file with all words in the PDF sorted and uniq:
# # Copyright (c) 2003 Thomas Renard CyBaer42@web.de> # All rights reserved, see COPYING for details. # # This script extracts word lists from attachments # # $Id$ WIKIROOT=/your/wiki/root/here CACHEREF=$WIKIROOT/cacheref for i in `find $WIKIROOT/data/pages/*/attachments/ -newer $CACHEREF -and -type f -print 2>/dev/null` do file $i | grep PDF >/dev/null if [ "$?" == "0" ] then j=`echo $i|sed "s/attachments.*/searchcache/g"` k=`echo $i|sed "s/attachments/searchcache/g"` mkdir -p $j pstotext $i| sed "s/[[:space:]]/\n/g"| \ sed "s/[\"\':,!0><=\.\;\^\*\|\-\+]//g" | \ tr A-Z a-z | \ sort|uniq >$k fi done touch $CACHEREF
This script should be run via cron. The last sed filters some characters I did not like to have. Maybe this can be made a little bit smoother in future releases. $CACHEREF is a flag to check if an attachment has changed since the last run of this script.
I try to render M$ Word documents via wvText with the next release. It is the same like the PDF-stuff except using wvText instead of pstotext and checking for Word Documents with the file $i
Remarks and Questions
Wouldn't it be easier and more convenient if this script is only run if a pdf file has been attached?