indexed_text - rapid retrieval of text by line number in any order from a file
Binary file formats
indexed_text rapidly retrieves a specified line or contiguous range of lines from a preindexed text file. It may also be used to construct the index. This supports a very light sort of data set retrieval by fixed integer keys from very large text files. Memory mapping is employed so the accessed file must fit into the available memory or a fatal error will occur.
indexed_text may be obtained as part of the drm_tools package from: http://sourceforge.net/projects/drmtools/
-in <FILE> Text file to process. Must be a real file, not a pipe.
-out <FILE> Where to write text from -get. If omitted or FILE is - emit to stdout.
-index Create an index file FILE.ldx. May not be combined with -get.
-get <LIST> Read line numbers to emit from LIST, one line number per input line. "-" reads from stdin. The default mode is equivalent to -get "-".
-n <NUMBER> Number of lines to emit for each -get. Default is 1. This count may also be specified by a second number in LIST. A value of 0 emits all lines from the starting position specified in the LIST file to the end of the text file. Negative values emit preceding lines. Both 1 and -1 emit just one line. Ranges may not extend outside of the text file.
-bs <N> Set the size of the output buffer in bytes. Default is 65536.
-h -help --help -? --?? Print the help message. (Default - do not print help message.)
-hformat Print a description of the binary files produced by this program
-hexamples Print examples. (Default - do not print examples.)
-i Emit version, copyright, license and contact information.( Default - do not emit information.)
Binary file format of the index file produced by this program:
NAME.ldx line index. lines unsigned 8 byte integer. The total number of lines in the source file. bytes unsigned 8 byte integer. The total number of bytes in the source file. los unsigned byte: Line offset size in bytes, value 2,4,or 8. padding 7 reserved bytes. lines + 1 records Each record contains the corresponding line offset stored in los bytes. The extra line holds an offset 1 byte past the end of the last byte in the file.
LIST files are text made up of any of these types of records:
! anything a comment. Number Number anything start, count, a comment Number (not a number) start, a comment Number start (blank line) ignored
% indexed_text -in afile Create an index afile.ldx for text file afile.
% indexed_text -in afile -n 1 <<’EOF’ 25 12 2 100 -2 1000 EOF If the index file afile.ldx exists emit lines 25, 12, 13, 99, 100, and 1000 from afile to stdout.
GNU General Public License 2
Copyright (C) 2016 David Mathog and Caltech.
David Mathog, Biology Division, Caltech <email@example.com>
|drm_tools||indexed_text (1)||1.0.1 JAN 21 2016|