About AR.Lucene
AR.Lucene is a simple, free utility from Applied Relevance that allows you to explore Lucene.Net indexes. It provides some similar functionality as Luke and Lucli, except it is for Lucene.Net only.
AR.Lucene interacts with a Lucene index. You can:
- List all documents
- Search for documents
- List all fields
- Display search results
- Show count of documents
- Find and delete duplicate documents
AR.Lucene is easy to install and use (for a Windows command line utility, anyway). All input is through command-line arguements and output is plain text to the console, making it something of a RESTful command line utility.
AR.Lucene is very basic, but if there is enough interest, new features will be added over time.
System Requirements
AR.Lucene is a Microsoft .Net 2.0 application that will run on modern Microsoft operating systems.
It has only been tested with Lucene.Net 2.3.1. It is possible, but not guaranteed that it will work with newer indexes and indexes created with the Java version of Lucene.
Installation
First, download the AR.Lucene executable arlucene.exe. AR.Lucene is NOT an installer - it is a command line executable without a Windows interface. You should download it to your local hard drive and run it from the command prompt. If you try to run it directly from your browser, you will get an error message.
AR.Lucene is a stand-alone Windows executable. To install it, simply download it and put the executable in a convenient place on your hard drive. For best results, place it somewhere on your search path. For information on setting your system path, see http://support.microsoft.com/kb/310519.
Usage
First, you need a Lucene index. AR.Lucene cannot yet add documents to a Lucene index, so you will have to build one some other way for now. For best results, the index should have been created with a current version of Lucene.Net.
Help
To find out the current options of AR.Lucene, add -help or -? to the command line:
>arlucene -? ARLucene version 1.0.0.0 Copyright (c) 2009 Applied Relevance. Options: -?, -help Displays this help text -e, -erasedups Erase duplicate entries in the index. -f, -fields Display the specified field values. You may specify multiple -f options to include multiple fields. -m, -maxdocs Maximum number of documents to return from the query. Default is 10. 0 means return all matching documents Commands: -a, List all documents. If you really want all -alldocuments documents make sure to set MaxDocs to 0. -d, -duplicates List duplicate documents based on the given -f field -l, -listfields List all field names in the index. -q, -query The query string to search for. Required: -c, -collection The path to the Lucene index directory to open.
Specifying the Index Path
All commands except -help require a path to a valid Lucene index. Use the -c option to specify the path to the index (collection).
> arlucene -c c:\ardata\arindex
Errors:
* At least one of the option "a", "d", "l", "q" must be specified
Listing Fields in the index
To list available fields, use the -L option
>arlucene -c c:\ardata\arindex -l
Opening collection c:\ardata\arindex
DC.FORMAT
DC.TYPE
DC.TYPE.INTERACTIVITYLEVEL
DC.IMAGESEARCH.CODE
DC.VODCAST.URL
urn:schemas.microsoft.com:fulltextqueryinfo:xmlfilter:rss/version
DC.SUBJECT.MISSION
DC.SUBJECT
DC.IDENTIFIER
DC.AUDIENCE.LEVEL
DC.RIGHTS
DC.IMGGALLERY.TNURL
DC.IMAGESEARCH.IMAGE_URL
DC.VODCAST.LABEL
LANGUAGE
DC.PODCAST.LABEL
Path
Querying the Index
To search the index, use the -q query command.
>arlucene -c c:\ardata\arindex -q nasa
Opening collection c:\ardata\arindex
Rank: 1
Score: 0.2328081
ID: 1339
AUTHOR: 65001
CMS DOCUMENT ID: 87028
CONTENT-TYPE: text/html; charset=UTF-8
DC.DATE.MODIFIED: 2008-02-12
<EOD>
...
Rank: 10
Score: 0.2055809
ID: 1412
AUTHOR: 65001
CMS DOCUMENT ID: 72082
CONTENT-TYPE: text/html; charset=UTF-8
DC.AUDIENCE: General Public, Informal Education, Press and Media, Parents, Students, Teachers
DC.CONTRIBUTOR: Brian Dunbar
<EOD>
Your search for body:nasa found 4557 documents in 31 ms.
Retrieved top 10 documents
Setting the number of documents to retrieve
It can take a long time to retrieve a large result set. If your query returns thousands of documents, you could wait for hours for all the results to be streamed to your terminal window. The -m maxdocs option allows you to set the maximum number of documents to retrieve from the search results. The default value is 10. Specify a value of 0 to bring back all results.
Specifying Fields to retrieve
By default, the q command returns all fields for the first 10 documents. You can specify which fields to return in the results with the -f (field) option. More than one field may be specified by adding additional -f field options to the command parameters.
>arlucene -c c:\ardata\arindex -q data -m 3 -f TITLE -f AUTHOR -f DC.SUBJECT
Opening collection c:\ardata\arindex
Rank: 1
Score: 0.2035873
ID: 452
TITLE: NASA - News - Highlights
AUTHOR: 65001
DC.SUBJECT: news, events
<EOD>
Rank: 2
Score: 0.2035873
ID: 567
TITLE: NASA - Missions Index Page
AUTHOR: 65001
DC.SUBJECT: NULL
<EOD>
Rank: 3
Score: 0.2035873
ID: 578
TITLE: NASA - About NASA
AUTHOR: 65001
DC.SUBJECT: NULL
<EOD>
Your search for body:data found 4032 documents in 15 ms.
Retrieved top 3 documents
Find duplicate documents
AR.Lucene can find and optionally delete duplicate documents based on any field in the index. The default field name is "Path". The -d (duplicates) option looks for documents with the same value for the given field. You can specify the field to compare with the -f field option.
If you specify more than one field to compare, AR.Lucene will use the first one only.
>arlucene -c c:\ardata\arindex -d
Opening collection c:\ardata\arindex
http://www.nasa.gov/sitemap/sitemap_nasa.html (2)
http://www.nasa.gov/audience/forkids/kidsclub/flash/index.html (2)
...
C:\play\M Divisions Orientation Manual.pdf (8)
C:\play\mng_rc2_tp_email_jun2006.pdf (17)
C:\play\DMSDR1S-#3158827-v11-M Divisions Orientation Manual.pdf (4)
C:\play\powerpoint.ppt (6)
C:\play\word.doc (2)
Found 4961 documents
Found 212 duplicates.
Deleting Duplicate Documents
Duplicates located with the -d option can be deleted from the index with the -e erase option. There is no guarantee about the order in which the extra documents will be deleted.
WARNING: Duplicate documents are deleted with extreme predjudice. You are given no warning and no opportunity to cancel the process. You probably want to make a backup of the index before using the -e option. This is the only warning you will receive.
>arlucene -c c:\ardata\arindex -d -e
Opening collection c:\ardata\arindex
http://www.nasa.gov/sitemap/sitemap_nasa.html (2)
http://www.nasa.gov/audience/forkids/kidsclub/flash/index.html (2)
http://www.nasa.gov/centers/kennedy/multimedia/index.html (2)
http://www.nasa.gov/centers/glenn/technology/index.html (2)
http://www.nasa.gov/centers/kennedy/stationpayloads/index.html (2)
...
Deleted document 4955
Deleted document 4956
Deleted document 4958
Deleted document 4959
Deleted document 4960
Found 4961 documents
Found 212 duplicates.
Deleted 249 documents.






