binorder - sort fixed length records in one file, merge presorted
records from multiple files, generate test data, search sorted
files by key, and other fixed length record operations.
binorder collates or rearranges fixed length binary records. One or more keys define the collating order. Operating modes are sort, merge, cut, test data generation, search, combine, filter to unique or duplicated keys, or filter to a key and the number of times it was observed. If the quantity of data is small enough to fit entirely into memory it may be collated using just the sort mode. Larger quantities of data may be collated in two steps: use the sort mode repeatedly to collate as much data as will fit in memory and store the result in a file; use the merge mode to combine the intermediate result files into a single final output file. The keys used must be the same at both steps. Data must be stored in big endian order (most significant byte first) before it is compared.
binorder may be obtained as part of the drm_tools package from: http://sourceforge.net/projects/drmtools/
On many operating systems binorder must be compiled with large file support if it is to read or write files above a few gigabytes in size. If compiled with the gcc compiler add the command line switches -D_LARGE_FILE_SOURCE -D_FILE_OFFSET_BITS=64 to include large file support. The -i option will show whether or not large files are supported.
Specify a single mode option, one of:
-sort Read the entire input file into memory, collate records into the order defined by the keys, and write the specified range to the output. The command line parameter -x must also be specified if data is read from stdin.
-bsort Read the input file into memory one block at a time, collate records into the order defined by the keys, and write each block to a different output file. The command line parameter -x sets the block size (in words) and must also be specfied. -out must be specified and must contain a template for the block file names like: example_%d . The template must result in a unique name for block numbers 1 to N, where N depends on the quantity of input and the block size. May not be combined with -b or -e.
-merge Read the pre-sorted records from the input files, collate them into the order defined by the keys, and write the specified range to the output. May not be combined with -b or -e.
-cut Read the input file and pass records -b through -e to the output. Fields defined by keys are passed from the input to the output, in the order of the keys. If a key has r or l qualiers then bytes in that field are reversed. If -swap is specified then all the bytes in all fields are reversed whether or not r or l qualiers were specified. If no key is explicitly defined then an implicit key comprising the entire record is used.
-get <LIST> Retrieve record(s) from the input file using the record indices in LIST. LIST contains unsigned integers in the native binary format and may contain other information. Use -sr to specify LIST’s record size and -swap if the indices’ byte order is reversed. The records in the input file may be in any order. If an index matches the -miss parameter the record returned is all zero bytes. If keys are specified the first will be used as the record index. Any additional keys will be copied from the LIST file to the output in the order specified before the retrieved record. The -gr option allows record ranges, instead of single records, to be retrieved. May not be combined with -b or -e.
-search <RECORDS> Find record(s) in the input file whose search key matches those read from the RECORDS file. Returns for each search key some combination of: keys, the position of the first matching record, the number of matches. Positions and matches are stored in the native uint64_t format. For misses the count is returned as zero and the position is set to 2^64-1 unless -miss specifies a different value. The method employs a binary search, so the input file records must have already been sorted into ascending order. The keys in RECORDS may be in any order. Requires -sr, -scond and two keys - the first is the reference key location in the input file record, the second in the location of the search key in the RECORDS file record. Keys beyond the 2nd define other fields in the RECORDS file records which may be passed through to the output with the proper -scond parameter. May not be combined with -b or -e.
-ssearch <RECORDS> Find record(s) in the input file whose search key matches those read from the RECORDS file. Same results as -search but much faster for a large number of queries. The number of records must be specified with -nr. Search records are tagged with their initial position, sorted, and then the input file is searched by recursively dividing this list. This reduces the range of each binary search. May not be combined with -b or -e.
-msearch <RECORDS> Find record(s) in the input file whose search key matches those read from the RECORDS file. Same results as -search and -ssearch but much faster when the number of queries is about the same as the number of targets. The number of records must be specified with -nr. RECORDS and FILE must be presorted by key. Multiple query records may match one target record. Method is essentially a merge. May not be combined with -b or -e.
-generate <N> Generate N binary records filled with random bytes and write the specified range to the output. The random number generator is initialized with the current time unless the -s parameter is used to specify a seed.
-combine Combine records from two input files having the same number of records. Records are interleaved and emitted, like: A1, B1, A2, B2, ... , AN, BN. This effectively creates records A1+B1, A2+B2, ... ,AN+BN. -r and -sr set the record sizes of the first and second input files, respectively.
-unique Filter a presorted input file retaining only those records which differ from the preceding record in at least one key. This collapses runs of similar records into single representatives.
-duplicate Filter a presorted input file retaining only those records which are the same as the preceding record in all keys. If -dmin is set then emit one copy of the record followed by the count of the number observed. The default count size of 2 bytes may be changed with -dc. If -doff is set emit the offset to the start of the run before the count.
-preindex <N> Create FILE.idx based on the presorted input FILE and comparisons of the first N bits of the single key. N will be increased to 16 if a smaller value is specified. FILE.idx contains 2 + 2^N records of size M, where M is the smallest number of bytes which can index FILE. The first record in FILE.idx holds: N, M, each 1 byte, plus M-2 bytes of padding if needed. Records 1 through 1+2^N hold the last observed record in FILE ( as 1:N ) having a prefix corresponding to a key I ( as 0:N-1 ). Index values for keys falling before the first record in FILE have value 0. The range of records in FILE corresponding to a key I is index[1+I] to index[2+I]-1 , inclusive. That range may be empty, in which case no records matching that prefix key exist in FILE. All values in FILE.idx have the CPU’s native endianness. If the largest key in FILE is known specifying -im will use less memory.
-r <N> The number of bytes in each input file record. This parameter is mandatory and must precede all instances of -k.
-in <FILE1[,FILE2[,FILE3...]]> File to read fixed length binary records from. "-" or not specified [Default] reads from stdin. Only -merge accepts multiple inputs, and -search requires an actual file, not stdin or a FIFO.
-out <FILE> Write output to FILE. When combined with -bsort FILE must be a template for the output file names, like "example%d.dat". (Default - write to stdout).
-sr <N> The record size in bytes of the RECORDS or LIST file. Only required and mandatory with -search, -ssearch or -get. For LIST it may only be 2, 4, or 8.
-nr <N> The number of records in the RECORDS file. Required with -ssearch only. Maximum value is 2147483647.
-gr <N> Get range, modifies the action of -get. The index of the first record in a range of records is retrieved from the range of bytes indicated by the first key. The next N bytes, starting immediately after the key hold a native unsigned integer with the number of records to retrieve. N must be 1,2,4, or 8. So -k 1,4 -gr 2 extracts the first index from bytes 1,4 and the count from bytes 5,6. If the count is 0 and the index matches -miss then the retrieved part of the output record is all zero bytes, if the index has another value no record at all is emitted. If the count is K and the index matches -miss then K output records are emitted with the retrieved part filled with zero bytes, if the index has value S then records S to S+K-1 are retrieved and emitted. Default is single record -get, the count is 1 and the key only specifies the record index to retrieve.
-useindex Use FILE.idx to speed up searches. FILE.idx must have been created with -preindex on FILE using the same primary search key.
-im <N> Provide the largest key to be indexed with -preindex. When this is known (much) less memory may be used, possibly allowing indexing with more bits without running out of memory.
-scond <N> What to emit with -search or -ssearch. N is a bit mask, [Default = 0].
4 start values
8 count values (follows start, if also emitted)
16 search keys (precedes any values. The three key options are mutually exclusive)
32 all keys other than the search key
64 all keys
128 all targets are unique (if true, runs are faster)
256 all queries are unique (if true, runs are faster)
-rlock <LOCKFILE> Only read from input(s) when able to gain exclusive lock on LOCKFILE. The lock is released after the read completes. LOCKFILE need not be an input file. The order in which a lock is granted to waiting processes may not correspond to the order in which the locks were requested. (Default - read any time).
-srlock <LOCKFILE> Only read from the -ssearch RECORDS file when able to gain exclusive lock on LOCKFILE. The lock is released after the read completes. LOCKFILE need not be an input file. The order in which a lock is granted to waiting processes may not correspond to the order in which the locks were requested. (Default - read any time).
-wlock <LOCKFILE> Only write to output when able to gain exclusive lock on LOCKFILE. Lock is only attempted once input or generated data is available. The lock is released after the write completes. LOCKFILE need not be the output file. The order in which a lock is granted to waiting processes may not correspond to the order in which the locks were requested. (Default - write any time)
-x <N> When sorting from stdin or a FIFO N is the maximum number of records which will be processed (-sort) or processed in a block (-bsort). When files are sorted and this parameter is not specified the value of N is the size of the file.
-k < S,E[r][s][l][a][c]> Defines a "key", which is a field of contiguous bytes S (start position, numbered from 1) through E (end position) in a record. Key comparisons treat each key’s byte range as an integer. Default integer byte order is big endian, little endian with an l qualifier. Default sort order is ascending, descending with an r qualifier. Default integers are unsigned, signed with an s qualifier. The a qualifier specifies a field which is summed in -duplicate -dmin. The sum has the same endianness as the original field. The c qualifier specifies a field which becomes the count in -duplicate -dmin, overriding -dc. Search keys must be unsigned and the same endianness as the field. Qualifiers r,s,l,a,c are not case sensitive. Keys are applied in command line order. If no keys are specified the default key of 1,N is used, where N is the record length. Keys may not overlap, may not extend outside of a record, and must have a start value less than or equal to the end value.
-swap With -cut reverse the order of the bytes in any specified keys. With -search or -ssearch reverse the order of the bytes in the start and count values. This will convert between big and little endian data representations.
-kbmask <N> Mask the least significant byte of the first key, if that key is not signed. N is an integer between 0 and 255. Any bit that is set is ignored in sort, merge, or search operations. The default value is 0 - all bits in the least significant byte are compared.
-miss <INDEX> When a -search or -ssearch key does not find any matching records the unsigned 64 bit integer INDEX is returned for the start position. In a -get if the key matches INDEX all of its values are set to 0. If -miss is not specified it defaults to 2^64 - 1.
-sspre If set then the ssearch records must have already been sorted into ascending order. This eliminates the need to sort and unsort them and may slightly speed up processing. Default is that the search keys will be sorted into the correct order after they are read, and the results "unsorted" into their original order on output.
-nomap If set then -search, -ssearch and -get do not try to memory map the input file, which is the default action. This might be employed when it is known ahead of time that there is not enough memory in the system to allow a large file to be mapped. If -nomap is not specified a mapping will be attempted and the program will revert to normal file IO if it fails.
-b <N> Begin output at record N. Unless specified output begins with the first record.
-e <N> End output at record N. Unless specified output ends with the last record.
-s <Seed> Specify an integer Seed for the random number generator. If not specified the current time in seconds is used.
-c <Text> Add a comment to the command line, does not affect run. Text is also emitted to stderr on program exit.
-dmin <N> With -duplicate, emit one copy followed by a count when there are at least N duplicates. Minimum value is 1 (count all types). The count value written has the CPU’s endianness.
-dc <N> N is the number of bytes to hold the count for -dmin -duplicate. Allowed values are 1,2,4,8, default is 2. A key with the c qualifier overrides this value and that field becomes the count.
-doff <N> With -duplicate -dmin emit the offset in records to the start of the run of duplicates. N is the number of bytes to hold the offset, allowed values are 1,2,4,8. The offset is emitted after the record and before the count.
-h -help --help -? --?? Print the help message. (Default - do not print help message.)
-hexamples Print examples. (Default - do not print examples.)
-i Emit version, copyright, license, CPU endianness, and contact information.( Default - do not emit information.)
% binorder -h List the the command line options.
% binorder -sort - -x 1000 -r 6 -kbmask 3 > test.bin Read the first 1000 6 byte records from stdin, sort them
into ascending order treating bytes 1 to 6 as unsigned
integers, while ignoring the least significant 2 bits,
and then write them to the output file.
% binorder -generate 1234 -s 7654321 -out test.bin -r 6 Generate 1234 6 byte records filled with random data.
% binorder -sort -in test.bin -out testB.bin -r 6 -b 100 -e 200 Sort the file with the default key and write records 100-200
into the output file, discarding the rest.
% binorder -sort -in data.bin -out data.out -r 8 -k 1,4 -k 5,6r -k 7,8 Sort the 8 byte long records on 3 separate keys, with the
% binorder -sort -in data1.bin -out data1.out -r 8 -k 1,4 -k 5,6r; binorder -sort -in data2.bin -out data2.out -r 8 -k 1,4 -k 5,6r; binorder -sort -in data3.bin -out data3.out -r 8 -k 1,4 -k 5,6r; binorder -merge -in data1.out,data2.out,data3.out -out final.out -r 8 -k 1,4 -k 5,6r Sort three files with the same keys, then merge the resulting files.
This method is used to sort large data sets which will fit on disk,
but not entirely into memory.
% binorder -sort -in data1.bin -out data1.bin -r 8 -rlock rlock -wlock wlock; binorder -sort -in data2.bin -out data2.bin -r 8 -rlock rlock -wlock wlock; binorder -sort -in data3.bin -out data3.bin -r 8 -rlock rlock -wlock wlock The files are read into memory one at a time, the sorting of the contents in
each process begins when its read completes, and the results are written back to
storage one at a time. This avoids simultaneous reads or simultaneous writes, which
may improve throughput between memory and storage. The order in which locks
are granted may not correspond to the order in which the locks were requested.
If the same file is used for both types of lock only one process at a time will
be allowed to read its data in or write its results out.
% binorder -search keys.bin -in sorted.bin -r 10 -sr 6 -scond 27 -out result.bin -k 2,7 -k 1,6 Search the records in sorted.bin for keys in bytes 2-7 which match the key in
bytes 1-6 of each record in keys.bin. The input file must have already been
sorted into ascending order on its key field. Emit to result.bin the search key
(6 bytes) and the number of matching records (an unsigned long long of 8 bytes).
% binorder -ssearch keys.bin -in sorted.bin -r 10 -sr 6 -nr 123456 -scond 25 -out result.bin -k 2,7 -k 1,6 Similar result to preceding search but the method differs. This
sorts the 123456 records into ascending order by key, performs a combined
binary sort on the keys and input, then unsorts the results back into the
original order. This is much faster for large search lists.
In addition, only matching keys are emitted, mismatches are dropped.
% binorder -in anyorder.bin -get records.bin -r 15 -sr 4 -out result.bin Retrieve those records listed in records.bin from anyorder.bin and
write them to result.bin. records.bin contains 4 byte unsigned record
numbers, with 0 being the first record in the input file. The input file
need not be sorted because retrieval is by record number.
% mdump -in test.bin -dtype c -interval 6 -show 7 -format ’%2.2x’ When working with binorder it is sometimes convenient to convert the binary data into a readable format with mdump. Offsets in bytes (decimal values) followed by 6 byte record in hex.
% mdump -in test.bin -dtype c -interval 6 -show 9 -format ’%2.2x’ Offsets in records (decimal values) followed by 6 byte record in hex.
% binformat -in test.bin -format ’%2.2%6xc g -l ^ug’ Employ binformat to view more complex binary records as text. Here
each record consists of a a 6 byte hex key, a 64 bit signed int,
a 32 bit int (which is omitted), and a 64 bit unsigned int (emitted
with its byte order inverted).
Disk Access Random access with -search and -get have been observed to interact poorly with some file systems, resulting in very slow runs. If the file being accessed in this manner is smaller than system memory lock it into file cache with a program like vmtouch. Alternatively, place the file on a solid state drive.
mdump(1), binformat(1), vmtouch(1)
GNU General Public License 2
Copyright (C) 2015 David Mathog and Caltech.
David Mathog, Biology Division, Caltech <email@example.com>
|drm_tools||binorder (1)||0.0.13 NOV 06 2015|