extract - extract and format data from text files
Synopsis
Description
Operator Order
Options (alphabetically)
Options (by function)
Definitions
String Syntax
Patterns
Replacements
Operations
Math Expressions
Examples
See Also
License
Copyright
Acknowledgements
Authors
extract [options...]
extract reads a text file (or files) and extracts a range of rows and columns (character positions), optionally reformats this data, and then outputs it. extract can also process tokens, either delimited or defined by patterns, instead of or in addition to character columns. Complex logic and data manipulations may be defined with a scripting language that supports both text and numeric operations. The intent is that extract be able to handle increasingly difficult tasks through the use of more advanced features, without requiring that those same advanced features be employed for simple tasks. The EXAMPLES section shows the easier end of the this spectrum, with the methods for the more difficult end of the spectrum described in the other sections of this document.
extract may be obtained as part of the drm_tools package from: http://sourceforge.net/projects/drmtools/
There are many extract command line options but only those whose default values are not appropriate for a particular text modification must be specified subject to the caveat that at least one command line option must be specified. The order in which operations are executed and the command line options that affect those operations are:
process extra command line arguments, define variables: cmd,eoc,v
emit help or other information: i,h,help,?,hfmt,hmath,hpat,hvar,hexamples
set parse options for making tokens: s,dl,dq,dqs,po,xc
set buffer sizes: wl,xc,xe
convert scripting statements to runnable form: sect,op,psN,pmN,v,dbga
open input and output files, use binary output: in,out,b
emit output file prefix string: filebol
run Before script section: sect,op,psN,pmN,v,dbgm,dbgp,dbgs
Begin processing loop, for each line in the input file(s):read input files, template match two files: in,indl,eqlen,template
handle embedded null characters: hnr,hnd,hns,hnsubs
handle Carriage Return at end of input lines: crok
merge/unmerge input lines: merge,unmerge,mdl
make tokens for input: mt
run Main script section: sect,op,psN,pmN,v,dbgm,dbgp,dbgs
(Note: the Main script may disable or modify everything after it down to the After script.)
select rows [unconditionally: sr,er,nr]
conditionally: if,ifonly,ifn,ifterm,ifnorestart, allselect columns (fields) [explicitly: fmt]
implicitly: sc,ec,nc,is,rm
emit output line prefix string: bol,ifbol,iftermbol
emit input line count: n
emit input line length: ll
emit input line token count: lt
Output processing on each selected field:
emit debugging information: dbg,dbgv
sort multitoken fields: sortas,sortac,sortan,sortds,sortdc,sortdn
substitute empty fields: rs
substitute characters: rcds,rcdc,rcss
substitute text: rtds,rtdc,rtss
pattern match/substitute text: ps
add backslashes: bs,ba,b2,ecc
format numerics: fff,fffe,ffd,ffu,ffo,ffx,fp,nz
pad or set field width pd,fw
trim whitespace: trl,trr,trb,trc
justify: jl,jc,jr
adjust case: cu,cl,cf
add delimiters (tokens only): dv,dl,d-,dt
emit output line
emit output line suffix string: eol,ifeol,iftermeol
emit output file suffix string: fileeol
run After script section: sect,op,psN,pmN,v,dbgm,dbgp,dbgs
-all Emit unprocessed the text rows outside of the range specified with -sr , -er , -nr. (Default is not to emit these rows.)
-b Binary output mode. The default output is text. On some operating systems these are the same, and on others not. The primary difference in most cases is the handling of ’\n’, the end of line character. In text mode (the default) this character is expanded to the local end of line sequence. On Unix/Linux systems this is again ’\n’ and there is no difference between the two modes, but on Windows this sequence is ’\r\n’. If -b is specified then extract tries to use binary output, where the end of line character has no special meaning. This should always succeed if -out filename is used, but may fail when the output is to stdout.
-bol <bolstring> When set the prefix <bolstring> is emitted before any output for each input row. Specifically, there will be one prefix string emitted for each input row even if the rest of the output row is empty. <bolstring> may be an empty string. Note that the prefix precedes any line numbers triggered by -n. (Default is an empty string.)
-bs -ba -b2 Add backslashes (unix escape characters) before any character (other than alphabet, numeric, underscore, period, or slash), before all characters, or before all but the first character. If -ecc is also used the specified character is used instead of backslash. Note that this only applies within a field, so that, for instance, if the program is running in token mode a token range [1,3] would apply the backslashes between characters within each token but not between tokens. To work around that limitation use [dv\:1,3]. (Default is not to add backslashes.)
-cmd cmdfile Once all command line options are consumed reading more from cmdfile. Input is read until the end of the input or an -eoc. Use -cmd - to read commands from stdin, then after -eoc the following lines are treated as input. Options are delimited by spaces, tabs and End of Lines . Strings containing these characters may be double quoted. Single quotes have no special meaning. Special first characters in an input line: _ (underscore) Remainder of line is the next option. It is read verbatim and so need not be quoted. # The line is a comment - it is ignored. Special first characters in a token: ## The rest of the line is a comment and is ignored. Example: -mt -dl " \t" # this is a comment and is ignored ## as is this -fmt "[mt:1,4]" ##the text before this is script, but this comment is ignored # The next line is full of double quotes and spaces # but is pulled in verbatim because of the leading underscore. _This would be a "Mess\34 to quote [1,4] -eoc
-cols <format> Deprecated synonom for -fmt.
-crok Retain a Carriage Return character which appears before the End of Line character. (Default is to delete it.)
-cu -cl -cf In selected characters/tokens change case to upper, lower, or first letter upper and rest lower. (Default is to leave case unmodified.)
-dbg Emit state and parsing information as each input line is processed. Only a developer modifying the program’s code is likely to find this useful. (Default is not to emit this information.)
-dbga Emit autolabel information. Use this to find problems in scripts with automatic labels: c,{,}{, and }
-dbgm Emit a representation of a script showing which TARGET is to be tested and which operations are to be run based on the results of that test.
-dbgp Emit raw substrings on pattern matches. This is useful for working out regular expressions or sequential pattern matching logic. For instance, to see what a regular expression produces from a test input use something like this:
echo match_this | extract -pm p:A_Regular_Expression -dbgp
-dbgs Emit call and stack information while a script runs. Used for debugging flow problems in a script.
-dbgv Emit information for all 26 STRING variables. This is done once in the read loop, before the -fmt is executed. For debugging in scripts use instead -op ~VLIST , which shows the information for just the variables in VLIST at a particular place in the script.
-dl <delimiter_string> Change the delimiters used to define tokens. Typically <delimiter_string> must be quoted or escaped on the command line so that the shell does not interpret it. (Default string contains the characters space, colon, and tab )
-dt When tokens are emitted followed by delimiters use as that delimiter that which defined the end of the current token. (Default). See also -d- and -dv.
-dq -dqs While parsing tokens ignore delimiters within double quotes. -dq returns the token with the surrounding double quotes, -dqs returns the token without the quotes. (Default is to recognize delimiters no matter where they occur.)
-dv <delimit_character> When tokens are emitted followed by delimiters use -dv <delimit_character> . (Default is -dt ).
-d- Do not emit a delimiter following a token. This is most often used in combination with the -s, -pd, -fw, and -j* switches. (Default is -dt , see also -dv ).
-desc Process (Do) ESCapes at output, converts \\ to \ and so forth. The last character in a string cannot act as an escape even if it is a backslash. Differs from STRING SYNTAX substitution in that [[ and ]] are not special. (Default is that backslash is not a special character.)
-ec <end_column> The last character column to select. (Default is -1, the last column.)
-ecc <escape_character> When set the escape character for the -bs,ba,b2 commands becomes <escape_character> . This may be used to separate character based columns with delimiters so that the result can be read into a spreadsheet easily. (The default escape character is a backslash.)
-eoc Terminate input from a -cmd file.
-eol <eolstring> When set the output from each input row is terminated with <eolstring>. Specifically, there will be one <eolstring> emitted for each input row even if the rest of the output row is empty. <eolstring> may be an empty string. This may be used to compress multiple input lines into a single output line. Typically \n would be injected into the output through -if and/or -fmt and a comma, space or colon would be used for <eolstring>. (Default value of <eolstring> is \n )
-eqlen When reading from multiple input files require that they all have exactly the same number of lines. (Default is to read as many lines as are present in each.)
-er <end_row> The last text row to process. (Default is the last row in the file.)
-esc Respect escapes (backslashes) when parsing text into tokens. If the input is "foo\ bar two" the default -mt parsing will produce three tokens "foo\", "bar", and "two". With -esc it will produce two tokens "foo\ bar" and "two". Note that the escape character is not removed, use -desc for that. (Default is that backslash is not a special character.)
-ffe -fff -ffd -ffu -ffo -ffx Format a text fragment assuming it contains a numeric representation. If the fragment cannot be converted into a valid number a fatal error will result. Also sets the default format for numeric variables. Numeric values are formatted starting from a double precision floating representation. That type is used directly for exponent -ffe and floating -ffd formats, is first converted to a signed integer for -ffd format, or converted to an unsigned integer for -ffu, -ffo, or -ffx (decimal, octal, or hexadecimal) formats. Integer conversions truncate at the upper or lower limits if the initial value is out of that integer type’s range. The precision is set by -fp and is the number of digits after the decimal point for a floating point value, or the number of digits shown for an integer value. For integer values this is how leading zero characters are determined. The field width is set by -fw or -pd. If the resulting formatted number will not fit into the designated width the output will be expanded to fit, so be sure to leave enough space for the largest possible number. Formatting examples (for the value 123.4567 with precision 6): 1.234567e+02 (-ffe) , 123.456700(-ffd) , 000123(-ffd) , 000123(-ffu) , 000173(-ffo) , 00007b(-ffx). (Default is to format fields as text.)
-filebol <STRING> Writes <STRING> before the stream of data to the output file. See also -fileeol. (Default is not to write a string before the data stream.)
-fileeol <STRING> Writes <STRING> after the stream of data to the output file. See also -filebol. ( Default is not to send a string after the data stream.)
-fmt <format> Specify in great detail the format of the output line including the selection of multiple columns from each input line. <format> consists of any combination of STRINGs and FIELDs. (Default is to select a single column, which may be the entire input line.) When -fmt is specified the other command line options specify the default values for all column fields. Multiple column fields (indicated by [] brackets within <format> ) may be specified. Text strings containing any symbol, including escaped characters, may be introduced between column fields. See String Syntax for examples. [ and ] must be escaped in a string or they will be intrepreted as the limits of a column field. Column fields contain zero or more options delimited by colons ( : ) followed by a mandatory range value. Characters [ and ] are not allowed within a column field but all other characters are and escapes may be used to include colons. Arbitrary combinations of text strings and column fields may be employed, freely mixing token and character mode columns, and emitting columns in any order, including emitting a single column multiple times. Typically <format> must be quoted or escaped on the command line so that the shell does not mangle it before passing it into the program. When one or more consecutive FIELDs operate in token mode delimiters are emitted (subject to -dt, dV, etc.) after each token until interrupted by a STRING, character or numeric FIELD, or the end of the line.
The options for a column field are: + = as_set match command line specifications; p = default match program defaults (overrides -pd , -lj , -uc , etc.); - = disable options. If employed as a single character it applies to all settings and must be the first option within a column field. As a suffix these may be applied singly to each of the -fmt options.
mt/mc/m-/mp/m+ token mode/character mode/disable/default/as_set. Also sets the delimit state in some instances to match the command line, but this may be overridden again by a subsequent :d*: clause in the same column field. (overrides -mt , -mc )
jl/jr/jc/j-/jp/j+ justify left/right/center/disable/default/as_set (overrides -j* )
trl/trr/trb/trc/tr-/trp/tr+ trim left/right/both/compress/disable/default/as_set (overrides -tr* )
cu/cl/cf/c-/cp/c+ case upper/lower/first/disable/default/as_set (overrides -c* )
bs/ba/b2/b-/bp/b+ backslashes apply(as needed)/all/all but first/disable/default/as_set (overrides -bs )
eccCHAR/eccp/ecc+ escape character is CHAR /default/as_set (overrides -ecc )
dt/dvN/d-/dp/d+ emit actual token delimiter / char N / disable / default / as_set. Restriction: the delimit character N must be escaped if it is a colon or a backslash, ie \: and \\. (overrides -d* )
de/de-/dep/de+ Process escapes/ disable/default/as_set. (overrides -.B -desc )
pd###/pd-/pdp/pd+ pad with ### spaces/disable/default/as_set (overrides -pd and -fw )
fw###/fw-/fwp/fw+ field width ### spaces/disable/default/as_set (overrides -pd and -fw )
fp###/fp-/fpp/fp+ floating point/integer precision ### spaces/disable/default/as_set (overrides -fp )
fff/ffe/ffd/ffu/ffo/ffx/ff-/ffp/ff+ floating point format to float/exponent/int/unsigned int/octal int/hex int/default/as_set (overrides -ffe, -fff, -ffd, -ffu, -ffo, and -ffx )
nz/nz-/nzp/nz+ print -0 as -0 for fff and ffe, otherwise print it as 0, enable/disable/default/as_set
psP/ps-/psp/ps+ Pattern substitution at output. The match/substitution is written directly to the output buffer so that the input buffer is not modified in any way. P = predefined pattern {0-9}. Pattern substitution is: From pattern P/disable/default/as_set Examples: ps or ps0 use pattern 0, ps3 uses pattern 3. (overrides -ps).
rsSTR/rs-/rsp/rs+ replacement string is STR /disable/default/as_set. Restriction: STR may not contain a colon. (overrides -rs )
rcdsSTR/rcdcSTR/rcd-/rcdp/rcd+ rcds string is STR /case insensitive STR /disable/default/as_set. Restriction: STR may not contain a colon. (overrides -rcds )
rcssSTR/rcs-/rcsp/rcs+ rcss string is STR /disable/default/as_set. Restriction: STR may not contain a colon. (overrides -rcss )
rtdsSTR/rtdcSTR/rtd-/rtdp/rtd+ rtds string is STR /case insensitive STR /disable/default/as_set. Restriction: STR may not contain a colon. (overrides -rtds )
rtssSTR/rtsvN/rts-/rtsp/rts+ rtss string is STR /variable N/disable/default/as_set. Restriction: STR may not contain a colon. (overrides -rtss )
sortas/sortds/sortac/sortac/sortan/sortdn/sort-/sortp/sort+ sort methods ... /disable/default/as_set. (overrides -sortas etc. )
map$M/ump$M Map/UnMaP input token positions to output positions. Numeric variable $M holds a MAP. For map the token at output position i is that from input position $M[i]. For ump the token at output position $M[i] is that from input position i. There is no corresponding command line option. Overrides Range, setting it to [1,N,1], where N is the MAP size. A single field may not combine map,ump and sort modifiers.
Range The region to process, for instance [1,5] is the first 5 columns or tokens, depending on mode. See DEFINITIONS for the range syntax.
The default source is the input line, but variables may be used instead by specifying one of the following within the [] range specifier:
vABC...Z Variable values, in the order listed. Character or token mode. A variable may appear more than once on the list. The values are selected by the range value, the specified formatting applied, and then the values are concatenated.
=ABC...Z Variable token values interleaved, missing filled from -rs. TOKEN MODE only.
@ABC...Z Variable token values interleaved by token. TOKEN MODE only. If a variable’s Nth token does not exist it is replaced with by its first/last token for positive/negative range increments.
v$VLIST, =$VLIST, or @$VLIST use numeric variables instead. Numeric and string variables cannot be combined in a single [] field.
-fp <precision> The precision for floating point formats. See -fff etc. (Default precision is 6.)
-fw <number_of_characters> <number_of_characters> specifies the field width. The input field is either padded or truncated as required. See also -pd. (Default is 0 - no change to field sizes.)
-h -help --help -? --?? Print the help message. (Default is not to print help message.)
-hexamples Print examples. (Default is not to print examples.)
-hfmt Print detailed -fmt help. (Default is not to print the -fmt help message.)
-hmath Print detailed information on math expression. (Default is not to print this help message.)
-hnd If embedded null characters are encountered in the input they are deleted. hnd is an acronym for "Handle Nulls Delete". See also -hnr,-hns,-hnd,-hnsubs. (Default is -hnr )
-hnr If embedded null characters are encountered in the input they are retained. However, the appearance of such a null character is a fatal event since a string containing them cannot be further processed. hnr is an acronym for "Handle Nulls Retain". See also -hnd,-hns,-hnsubs. (Default)
-hns If embedded null characters are encountered in the input they are substituted with \255. hns is an acronym for "Handle Nulls Substitute". See also -hnr,-hnd,-hnsubs. (Default is -hnr. )
-hnsubs <CHAR> If embedded null characters are encountered in the input they are substituted with <CHAR>. hnsubs is an acronym for "Handle Nulls Substitute". See also -hnr,-hnd,-hns. (Default is -hnr. )
-hpat Print detailed pattern mode help. (Default is not to print this help.)
-hvar Print detailed variable usage help. (Default is not to print this help.)
-i Emit version, copyright, license and contact information. (Default is not to emit information.)
-if <tag> Conditionally operate on an input line. The syntax for <tag> is [!][^]string[$] , where: string is any text which may contain tab and numeric escapes as for -dl ; ^ string is located at the front of a line ; $ string is located at the end of a line ; ! invert logic - operate when string is not found. If neither ^ nor $ is present string may appear anywhere in a line. These special characters must be escaped when they are part of the string part of the expression: ^ , $ , ! , and \. Lines containing the <tag> are processed, other lines are just echoed to the output. Use ^$ to match an empty string and !^$ to match all nonempty strings. (An empty string is one containing no characters.) Command line interpreters may interfere with some of the special characters. If that occurs use decimal representations: \33 for ! , \94 for ^ , \36 for $. See also -ifonly. (Default is to process all lines within the specified row range.)
-ifbol When set those rows in an if block are emitted without the BOL string prefixed. This is used primarily to mark all rows other than those in the if block with a prefix tag. (Default is to emit the BOL string.)
-ifeol When set those rows in an if block are emitted without an EOL character. This may be used to compress multiple input lines within an if block into a single output line. (Default is to emit the EOL string.)
-ifn <N> Extends the condition set by -if for <N> more lines. May not be combined with -ifterm. (Default is not to extend the conditional processing.)
-ifonly When set only those rows satisfying -if and -ifn are emitted. (Default is to emit other lines unchanged.)
-ifnorestart Normally within an -if block each line is tested to see if it matches the -if <tag> and if it does the block is extended. This happens when either -ifn <N> or -ifterm <endtag> is also specified. If -ifnorestart is specified under these conditions lines within an existing -if block are not tested and so the block will not be "restarted". (Default is to restart.)
-ifterm <endtag> Extends the condition set by -if through the first line containing the <endtag>. The rules for processing the <endtag> are the same as for the -if <tag>. May not be combined with -ifn. When the tags are chosen so that the beginning -if and terminating -ifterm are not the same line use -iftermeol <STRING> to finish off the end of the if block. When these tags are the same the <endtag> really indicates the input line following the preceding if block. In this case use -iftermbol <STRING> to write a string between the two if blocks and do not use -iftermeol. (Default is not to extend conditional processing.)
-iftermbol <STRING> Writes <STRING> before the first character in the last line of an -if block. That line is determined by either -ifn <N> or -ifterm <endtag> or the end of the file. Primarily this is useful when -ifterm <endtag> and -if <tag> are the same and a separator needs to be written between consecutive if blocks. Only one <STRING> is written for each if block terminator no matter how many input lines the block contains. (Default value is an empty string.)
-iftermeol <STRING> Writes <STRING> after the last character in an -if block. The end of the block is determined from -ifn <N> , or -ifterm <endtag> , or if neither of these are specified, the first line not matching -if <tag> , or the end of the file. Only one <STRING> is written for each if block terminator no matter how many input lines the block contains. (Default value is an empty string.)
-in file1[,file2,file3,..fileN] Read input from one or more specified files in a comma delimited list. When reading from more than one file the lines from each are concatenated into a single input line in the order shown. Use -indl to delimit the substrings. The special file name - corresponds to stdin. Only a single input file may be read from stdin. See also -eqlen. The -h option displays the maximum number of input files. (Default is to read from stdin.)
-indl <StreamDelimit> When reading from more than one input file the string <StreamDelimit> is placed between each substring in the resultant final input string. (Default is an empty string - input strings are directly concatenated.)
-is Modify the indicated character or token range "in situ" and emit them and the unmodified surrounding region. This option may not be used with -rm or -fmt. (Default is to emit only the selected character/token range.)
-jl -jc -jr Justify field left, center, or right. (Default is not to change justification.)
-ll Prefix each line of output with "line_length:". The line length is the number of characters in the final input line after reading a line from all input files and inserting delimiters. (Default is not to emit line lengths.)
-lt Prefix each line of output with "token_count:". The token count is the number of tokens in the final input line after reading a line from all input files and inserting delimiters. This value will be zero unless -mt is specified, or mt is used in a -fmt field. (Default is not to emit line token counts.)
-mc Process lines as character columns. See also -mt. (Default.)
-merge <N> Examine the <N> first characters in consecutive rows. If they are the same emit the <N> character prefix once and the remainder of each matching row in sequence as one new row. Use -mdl to place delimiters between these fragments. The comparison is case sensitive. Prefix based merging follows merging from multiple input files and precedes any if contingent operations. See also -unmerge. (Default is not to merge based on common prefix.)
-mdl <MergeDelimit> When -merge is set and consecutive rows are being concatenated introduce the string <MergeDelimit> between the fragments from each row. (Default is an empty string - input strings are directly concatenated.)
-mt Process lines as tokens. In this mode -sc , -ec , and -nc values refer to token numbers. If a single token is emitted then no delimiter is emitted with it. However, two or more tokens are emitted as:
token1 delim1 token2 delim2 token3 ... tokenN
Where: delim1 is the first delimiter following token1. Note that no terminal delimiter is added after the last token. This mode is appropriate when delimiters are white space. Add -s when every delimiter indicates a token and empty tokens are allowed. For instance, when reading spreadsheet data. See also -dl. (Default is -mc. )
-n Prefix each line of output with: "line_number:". The line number is that line’s position in the input file. (Default is not to number input lines.)
-nc <number_of_columns> Number of columns to process starting from sc. Do not specify both -nc and -ec. (Default is to process all columns.)
-nr <number_of_rows> Number of text rows to process starting from sr. Do not specify both -nr and -er. (Default is to process all rows.)
-nz Print -0 as -0 in ffe and fff formats (Default = print as 0).
-opN <OPERATIONS> Run the OPERATIONS, all of which must be in the True branch. It is an error if any are present in the (unreachable) False branch. If N is specified the TARGET is a variable A-Z, otherwise the TARGET is the input. An unlimited number of these may be used on the command line via the -cmd, -eoc mechanism and incorporated into scripts. See DEFINITIONS for the syntax for OPERATIONS.
-out file0[,file1,...file9] Open up to 10 output files. Use "-" to direct one stream to stdout. Only scripts are able to direct output to streams 1-9. (Default is to write everything to stdout.)
-pd <number_of_characters> Specifies the <number_of_characters> (spaces) to be added to the right side of the field. When fields are processed they are padded, then justified, then the character cases adjusted. See also -fw. (Default is 0 - no padding.)
-pmN <PATTERN> <OPERATIONS> Pattern match. If N is specified the TARGET is a variable A-Z, otherwise the TARGET is the input. An unlimited number of these may be used on the command line via the -cmd, -eoc mechanism and incorporated into scripts. If PATTERN matches then the True branch of OPERATIONS executes, otherwise, the False branch executes. See DEFINITIONS for the syntax for PATTERN and OPERATIONS.
-poN <parse_options> Create up to 10 sets of parse options (for N = 0-9), which determine the rules for finding tokens within a string. The first is applied to the input line if tokens are needed. All may be used to parse strings stored in variables by using the po#VLIST operation in an -op/-pm/-pmN statement (see OPERATIONS). The first one may be set with -dq/-dqs/-esc/-s/-dl/-mdl on the command line. It may also be set using <parse_options> which is a colon delimited string of parsing specifiers. The other parse option groups must be entered using colon delimited specifiers. Multiple -po options must be entered sequentially in ascending order. -po is synonymous with -po0. The syntax for the specifiers is: dq,dqs,dq-,dqp,dq+ modifies parsing of double quoted text (like -dq, -dqs ): respect, respect & strip, disable, default, as_set
sd,sd-,sdp,sd+ modifies parsing of delimiter runs (like -s ): token for: each, run, default, as_set
esc,esc-,escp,esc+ modifies parsing of escape sequences (like -esc ): process escapes: yes, no, default, as_set
dlvSTR,dlp,dl+ sets delimiter string (like -dl ): becomes STR, default, as_set. (There is no dl- because tokens cannot be parsed without delimiters.)
mdlvSTR,mdlp,mdl+ sets merge delimiter string (like -mdl ): becomes STR, default, as_set. mdl- is forbidden.
Example: -po3 ’dq:dlv\t’
Parse options for group 3 are: respect double quotes and the only delimiters are tabs.
-ps# <PATTERN> <REPLACE> Pattern match and then substitute during output. The TARGET is the input line. Up to 10 of these may be used on the command line but they must be numbered sequentially starting from 0. -ps is equivalent to -ps0. If no -fmt is present the substitutions will be attempted in the order specified. If these numbered -ps statements are referenced in a -fmt field ([]) they may be in any order. See DEFINITIONS, PATTERNS, and REPLACEMENTS for syntax information.
-psN <PATTERN> <REPLACE> Pattern match and then substitute in a script. N specifies that the TARGET is a variable A-Z. An unlimited number of these may be used on the command line via the -cmd, -eoc mechanism and incorporated into scripts. See DEFINITIONS, PATTERNS, and REPLACEMENTS for syntax information.
-rcdc <RCDS_STRING> Case insensitive form of -rcds
-rcds <RCDS_STRING> Remove from the output any characters found in the string <RCDS_STRING>. If that string begins with ! only those characters which match will be retained. This option may be combined with -rcss to induce substitution instead of deletion. rcds is an acronym for "Replace Character Delete String". (Default is to emit all characters without filtering.)
-rcss <RCSS_STRING> When a character matches in <RCDS_STRING> it is substituted from the same position in <RCSS_STRING>. These two strings must be the same length. When substituting a ! in <RCDS_STRING> has no special meaning. rcss is an acronym for "Replace Character Substitute String". (Default is to emit all characters without filtering.)
-rm Remove the selected character columns/tokens instead of emitting them. This option may not be used with -is or -fmt. (Default is to emit only the selected character/token range.)
-rs <replacement_string> <replacement_string> substitutes for empty fields. Typically employed to insert NA or 0 in a tab delimited file which left unspecified values as empty fields. Note, a colon ( : ) is used to delimit fields filled with <replacement_string>. Use -dv to change this. (Default leave empty fields empty.)
-rtdc <RTDC_STRING> Case insensitive form of -rtds
-rtds <RTDS_STRING> Remove from the input string the text contained in <RTDS_STRING>. Multiple instances, if present, will be removed. This option may be combined with -rtss to induce substitution instead of deletion. rtds is an acronym for "Replace Text Delete String". (Default is to emit all text without replacement.)
-rtss <RTSS_STRING> When a part of a line of text matches <RTDS_STRING> it is substituted with <RTSS_STRING>. These two strings need not be the same length. rtss is an acronym for "Replace Text Substitute String". (Default is to emit all text without replacement.)
-rtsvN When a part of a line of text matches <RTDS_STRING> it is substituted with <variable N>. These two strings need not be the same length. rtsv is an acronym for "Replace Text Substitute Variable". (Default is to emit all text without replacement.)
-s Emit a token for each delimiter encountered. When -s is specified tokens may consist of empty strings. This mode is for use with delimited data as from a spreadsheet. (Default is to emit one token for each run of delimiters.)
-sc <start_column> The first character column to select. Columns are numbered from 1. Negative values are allowed and represent columns measured from the end of the line, where -1 is the last column. (Default is 1, the first column.)
-sect <section> Defines a section of a script stored in a -cmd file. section may be Common, Before, Main, or After (not case sensitive). Script lines placed in Common only execute when called from another section - this is where functions should be placed that are used in all other sections. Any function defined in Common must be referenced from every other defined section or it triggers a "not used" error. The three other sections execute at different times. Before executes once before the program enters the input loop, Main executes once for each line during the input loop, and After executes once after the loop. This allows for set up, run, and tear down sections in a script. If no sections are defined all -pm, -op, and -psN operations are placed in Main.
-sr <start_row> The first text row (line of text) to process. Rows are numbered from 1. (Default is 1, the first row.)
-sortac -sortas -sortan -sortdc -sortds -sortdn Sort tokens within an output field before any other formatting. Requires -mt on the command line or mt in the -fmt [] field to generate the tokens. Only fields with more than one token are sorted. The a or d following -sort specifies Ascending or Descending order. The s, c, or n after that specifies the type of token to sort: case sensitive string, case insensitive string, or numbers. Tokens with the same collating values, which are numeric (1e-1 and 0.1) or case insensitive strings (x and X), will sort in arbitrary order. A long string that begins with all the characters of a short string has a larger collating value than the short string. (Default is no sort.)
-template <N> Template match two files. This is used to fill in the holes in a column of a table if all of the rows are known. Use -in <template,file> to specify which is the <template> (the first) and which is the <file> to compare to it (the second.) The contents of the two files must be in the same order (for instance, sorted, but any order is ok). The <file> may contain a subset of the rows present in the <template>. It may not contain any rows not present in the <template>. Compare the first <N> characters in a case sensitive manner and if they are the same pass the row from the <file> into the program. If they are different this indicates a "hole" in the file. Instead, pass the first <N> characters from the <template> followed by the string specified by -indl. Normally this would be set to something like "NA", to indicate the presence of the hole. When -indl is not specified on the command line, then the entire template line will be used. -template is incompatible with -merge. It may be used with -eqlen to verify that all expected rows are present. It is strongly suggested that the data in the first <N> columns of both files be justified and padded with spaces - otherwise "AB" will not match "AB data" for <N> = 4. When a template is compared to a file the first blank line in each will act as an end of file. (Default is no template processing.)
-trl -trr -trb -trc Trim out whitespace (spaces and tabs) in the field on the left, right, or both sides. Internal whitespace is not affected. -trc eliminates white space on both ends and compresses runs of internal whitespace to a single space. (Default is to leave the whitespace as is.)
-unmerge <N> Take a line consisting of multiple tokens and treat it as several lines, each beginning with the same first token, and containing sequential groups of <N> tokens, until all are consumed. Token delimiters are from -mdl , or if that isn’t specified, -dl. Character column data must be converted to token (delimited) data before it may be processed with -unmerge. Each line number emitted by -n when -unmerge is active derives from the original input line. If an input line is unmerged into four lines each will have the same line number. See also -merge. (Default is not to unmerge.)
-vN <String> create a DEFINED variable or initialize a SET variable. N is a single letter A-Z. If String contains a FIELD the former occurs, otherwise the latter. See DEFINITIONS for more information.
-wl <widest_line> Widest input line in characters. (Default is 16000 characters.)
-xc <maXimum_Columns> Initial maximum number of column fields ([]) in -fmt. More space is automatically allocated as needed. (Default is 128 fields.)
-xe <maXimum_dEscriptions> Initial maximum number of static descriptions (not []) in -fmt. More space is automatically allocated as needed. (Default is 128 descriptions.)
Processing modes: mc,mt
Input/Output: in,out,cmd,eoc,b
Unconditional row/column limits: sc,ec,ncsr,er,nr,all,rm,s
Unconditional begin/end strings: filbol,fileeol,eol,bol
Input processing: merge,unmerge,template,indl,mdl,eqlen,crok hnr,hnd,hns,hnsubs
Conditional output: if,ifnorestart,ifn,ifterm,ifonly iftermeol,iftermbol,ifeol,ifbol
Delimiter based parsing of input into tokens: dl,s,dq,dqs,esc,po
Pattern based token generation, complex logic, scripting : v,sect,pm,op,ps
Output field processing: fmt, dt,dv,d-, pd,fw,fp, fff,ffe,ffd,ffu,ffo,ffx, nz, jl,jr,jc, trl,trr,trb,trc, cu,cf,cl, bs,ba,b2, ecc, sortas,sortds,sortac,sortdc,sortan,sortdn, rs,rcds,rcdc,rcss,rtds,rtdc,rtss,rtsv, ps
Output line processing: n,ll,lt
Data size allocation: xc,xe,wl
Debugging: dbg,dbgv,dbgp,dbgm,dbgs,dbga
Help and information: h,hfmt,hpat,hvar,hexamples,i
These terms are used throughout this document:
COLUMN NUMBER is a signed integer which specifies a particular character in the TARGET. If the COLUMN NUMBER is a positive number N then it is the Nth column. If the COLUMN NUMBER is a negative number N and there are M characters total, then it is the M + N -1 character. That is, -1 is the last character, 1 is the first character.
FIELD is a colon delimited list of formatting options followed by a RANGE all of which is enclosed in ’[]’. FIELDS specify a range of columns or tokens to extract as well as some processing options to be applied to this data. These are used by -fmt and -vN as well as some of the scripting operations. See -fmt for the list of options which may be included in a FIELD.
INPUT BUFFER holds the line(s) read and possibly merged from the input file(s) on each cycle of the main loop.
PATTERN is defined by a single specifier and determines what is to be matched within a TARGET, or which properties of a STRING or NUMERIC variable are to be tested, in a -pm/-pmN/-psN command. See PATTERNS for more information.
RANGE is the mandatory final entry in a FIELD. The RANGE consists of up to 3 integers [Colunn/Start, End, Increment] in one of these forms: [] [,] [c] [s,e] [s,] [,e] [s,,i] [,e,i] [,,i]. Defaults for implied range values are: [First, Last, 1]. In token mode Start and End are TOKEN NUMBERS, in column mode they are COLUMN NUMBERS. If Start and End have the same sign then it is an error for Start > End. For mixed signs the range may be empty for some line lengths, but this is not an error. (Example: [3,-3] for lines <5 characters long.) The increment value may be anything other than zero. Increments only function in token mode - in character mode they are ignored. The range [,,-1] emits all tokens in a line in reverse order. The map and ump field modifiers (see -fmt) override the explicit RANGE and replace it with [1,N], where N is the number of terms in the MAP. Alternatively, a range may be specified by the contents of a NUMERIC variable as [NUMERIC variable, index]: [$V] [$V,idx]. In the NUMERIC variable forms a c/s,e,i triplet is retrieved from the 3 elements starting at the index. If the index is omitted retrieval starts at the first element.
MAP is an array of TOKEN NUMBERS stored in an NUMERIC VARIABLE that is used to map/unmap input positions to output positions. A MAP is valid for a map operation if it contains any combination of the positions {1,2...N}. A MAP is valid for an unmap (ump) operation if it contains any permutation of those positions. A MAP may contain any combination of negative and positive TOKEN NUMBERS that satisfies these positional constraints. Maps may be generated by using the functions idx,six,tix, and tcx in Math Expressions.
REPLACE is one or more text strings from one or more sources. These are used to replace pattern matches sequentially in a -psN/-pS operation. See REPLACEMENTS for more information.
RESULT is the final number calculated by Math Expressions. It may be tested with the -pm ’?’ operator or assigned to the SWITCH INDEX with the ’ir’ operator.
ROW NUMBER in an unsigned integer indicating a particular input line. The first line is 1.
SUBSTRING LIST holds the results of match operations. The list consists of zero or more (start,end) pairs which indicate which strings in the target were matched. There is only one such list. PATTERN match operations may clear it or append to it. The substrings corresponding to the list entries may be appended to one or more VARIABLEs using OPERATIONS. The list values may also be stored in a NUMERIC variable with the sls()MATH EXPRESSION function.
STRING is a simple text string. It may contain escaped characters.
SWITCH INDEX holds an integer which is read by the sw (switch) operator, which jumps to the appropriate case, and then clears the index. There is only one such index and its value is set by certain OPERATIONS.
TARGET is the INPUT BUFFER for -op, -pm and -ps ; STRING VARIABLE N for -opN, -pmN and -psN; or an entry in a VLIST.
TOKEN is a substring in a TARGET. When tokens are created using -po in a TARGET the process is conservative. That is, the token information is added but the original string in the TARGET is unchanged. This is why -dqs appears to remove surrounding double quotes, but -esc leaves escape characters. TOKENs can be zero length or empty strings. They also have a delimiter value. When tokens are generated by -po, (which is done automatically if -mt is specified on the command line or there is an mt in any FIELD ) this is the character immediately following the token. When TOKENs are created from a SUBSTRING LIST following pattern matches, or when the last character in the token is the final character in the TARGET, the delimiter is arbitrarily set to "\0", which effectively means "undefined".
TOKEN NUMBER is a signed integer which specifies a particular token in the TARGET. If the TOKEN NUMBER is a positive number N then it is the Nth token. If the TOKEN NUMBER is a negative number N and there are M tokens total, then it is the M + N -1 token. That is, -1 is the last token, 1 is the first token.
TOKEN LIST is a list of TOKEN NUMBERs generated by the PATTERN test t: and used by the corresponding REPLACE operator t:. Each TOKEN NUMBER is from the matching query TOKEN. The SUBSTRING LIST is also produced. The token list indicates which queries matched (in order) while the substring list indicates the string which they matched in the target. A token list may stored in a NUMERIC variable with the tls()MATH EXPRESSION function.
VARIABLE is one storage area from one of two sets of 26 storage areas named A-Z and $A-$Z (not case sensitive). The former contains STRING variables and the latter NUMERIC variables. Both STRING and NUMERIC variables may be used in a -fmt or -vN statement via the vN,=N, and @N options FIELD. To indicate the use of NUMERIC variables these are written instead as v$N, =$N, and@$N. See -fmt for more information. Only NUMERIC variables may used to the left of an assignment in a math expression.
STRING VARIABLES may be undefined or hold a text string and a TOKEN representation of that string. STRING variables may be in one of three states: CLEAR = undefined, EMPTY = defined but holding only an empty string, or SET = holding some characters. There are in addition two types of STRING variables: DEFINED and SET. A DEFINED variable is very much like a -fmt statement. It contains a description of the TARGET(s) including one or more fields. The definition is created with a -v statement, and data is entered into it with a vN operation in an -op or -pm statement. A SET variable may be initialized with a -v statement that contains no FIELDs. It may be cleared or appended to by various OPERATIONS. (Note, there is no explicit set operation, to do that clear and then append.)
NUMERIC VARIABLES contain one or more double precision numbers. Initially they consist of a single element with a value preset to zero. Additional elements may be added so that each NUMERIC variable may be used as an array. RANGE syntax is similar to that for STRING varibles, with the first element being 1 and the last being -1. In math expressions $N[] is a synonym for $N[1,-1], both mean "the entire array".
VLIST is a list of single letter VARIABLES, like ’ABC’, it must have at least one entry. In some contexts, it may include ’-’, which means, again depending on the context, the INPUT BUFFER, the RESULT, or "none". In most cases entries may be repeated, like ’ABBA’.
OPERATIONS is a list of actions which are performed conditionally (-pm) or unconditionally (-op). Scripts are constructed from a series of statements, each of which contains one or more operatons. See the OPERATIONS section for more information.
Text strings which appear in the -fmt, -v, -rs, -dl, or -dv options are subject to the following substitutions:
\\ -> \
\n -> LF character
\r -> CR character
\t -> tab
[[ -> [
\[ -> [
]] -> ]
\] -> ]
\12 -> character whose value is 12 (values 1-255 only)
\1200 -> 1200 (because number was not in the allowed range)
\anything_else -> anything_elseWhen \ is the last character on a line it does not escape the line terminator and it is emitted. So -fmt ’[1] \’ will emit lines ending with \.
The pattern specifier indicates which string or regular expression in PCRE syntax is the query for a match or substitution. (For PCRE information see http://www.pcre.org/pcre.txt) The results of a match are placed on the SUBSTRING LIST and correspond to different parts of, or repeats of, a PATTERN. For PCRE pattern "(too)l(box)" matching the string "toolbox" would make the list: "toolbox", "toolbox", "too", and "box", where the substrings are not strictly sequential. Results from other pattern types will always be sequential.Pattern specifiers may also define other types of tests which do not result in changes to the SUBSTRING LIST.
Specifiers have the format BASE[MODIFIERS]:[WHAT] which describes what they are trying to match or test in the TARGET. MODIFIERS follow the BASE and precede the colon delimiter. The WHAT string cannot be blank.
BASE:WHAT that (may) create or modify the SUBSTRING LIST: p:RE (PCRE regular expression) RE s:STRING (string) STRING v:VLIST (string(s)) any of the listed variable(s) n#:VLIST (string(s)) token # in any of the listed variable(s) # of 0 uses RESULT for the TOKEN NUMBER (see Math Expressions). t:VLIST (string(s)) any token(s) in the listed variable(s) BASE:WHAT that do not create or modify the SUBSTRING LIST: m: always true. (Normally one would use -op instead of -pm ’m:’) ?:VLIST Test for existence of TARGET (input, variable, or token). True if all in list are defined. ?CMP:VLIST Test the value or status of NUMERIC variables in VLIST. True if all CMP comparisons are true. CMP values are listed in MODIFIERS below. i#: True if input stream # is open, False otherwise. Streams are 1->N, N from the number of entries with -in. # of 0 uses RESULT for the stream number (see Math Expressions). These are equivalent: i: and i1:. MODIFIERS c case invariant (default is case sensitive) g global (repeated search). /#[/#...] (PCRE only.) only keep the substring(s) indicated by the number. For -[v]ps and complex patterns /1 will often be needed so that the outer expression is substituted. Numbers must be in the range 1-255 and be in increasing order. r remainder (search resumes in same target after previous match) ^ (not for PCRE) starts at first character $ (not for PCRE) ends at last character MODIFIERS that determine the action when there are multiple search strings: q seQuential search. Each string searches the remainder of the target. True if all strings are found. f test until First string matches. < test all, nearest to the start of the target matches [DEFAULT]. > test all, nearest to the end of the target matches. Modifiers for Existence tests t test tokens (instead of variables) # Without t, minimum variable size. If 0, RESULT specifies. With t, token number to test. If 0, RESULT specifies. MODIFIERS (CMP operations) for math tests, one and only one must be specified: ? Value is normal (not NaN or +/-inf) = Value is zero. < Value is < zero. > Value is > zero. # Element number to test. If 0, RESULT specifies. PATTERN Examples: s:Fred Fred (anywhere in the string) sc^$:Fred A string containing only Fred (or fred or FRED etc.) v:ABC Matches ...<vA>.. or ..<vB>.. or ..<vC>.. vq:ABC Matches ...<vA>...<vB>...<vC>.. vqr:ABC Matches ...<PREVIOUSMATCH>...<vA>...<vB>...<vC>.. but not ...<vA>...<PREVIOUSMATCH>...<vB>...<vC>.. p:Fred Fred p:(?i)Fred Fred, fred, FRED etc. Better to use the "c" modifier. p:^Fred$ A line containing just Fred and no other characters p:(?i)(Fred).*(Ginger) Matches: ... fred ... ginger ... This creates a list of 3 matches: the whole pattern, fred, and ginger, which may be assigned to variables using OPERATIONS. ?:- Always true (input buffer always exists). ?e:- True if the input buffer is empty. ?t3:- True if -mt and >=3 tokens were parsed from the input buffer. ?:ABC True if A,B, and C are all defined (not cleared). ?t-4:ABC True if TOKEN NUMBER -4 exists in A,B,and C. ?te3:ABC True if TOKEN NUMBER 3 exists and is empty in A,B,and C. ?<:ABC True if all elements in NUMERIC variables A,B,C are negative. ?>3:ABC True if element 3 in NUMERIC variables A,B,C is positive. ??:ABC True if all elements in NUMERIC variables A,B,C are normal numbers. ?>:- True if RESULT is greater than zero. [May be used to test for "token 3 of A is at least 5 characters long" by preceding with -op "? tln(a[3])-4". (see Math Expressions).]
Replacement specifications describe the source of the strings which are used to replace each match in a -ps/-psN command. If all of the strings in the specification are used replacement starts over again with the first one. The syntax is BASE:WHAT as shown below.s:STRING replace with (string) STRING v:VLIST replace with (string(s)) from the variable(s) n#:VLIST replace with (string(s)) in token # from the variable(s) # of 0 uses RESULT for the TOKEN NUMBER (see Math Expressions). t:VLIST replace with (string(s)) in all token(s) from the variable If a TOKEN NUMBER LIST was generated by the t: PATTERN specifier replacement will be in the order specified by that list. This provides a way to do multiple defined replacements with one operation. Side effect: the TOKEN NUMBER LIST is consumed. REPLACMEMENTS Examples: -psA "sc:Fred" "v:BC" If variable A is "Fred1fred2fRed3" and variables B,C are respectively "one" and "two", then after the operation the value of variable A will be "one1two2one3" while the values of variables B,C will be unchanged. -psA "sc:Fred" "t:B" If variable A is "Fred1fred2fRed3FRED4" and variable B consists of 3 tokens "one","two", and "three", then after the operation the value of variable A will be "one1two2three3one4" while the values of variable B will be unchanged.[ Substitutions using the four different multiple string query modes when variable C is "12331233", variable A has tokens {1,2,3}, and variable B has tokens {A,B,C}. ] -psC "tQg:A" "t:B" C becomes "ABC3ABC3". -psC "tFg:A" "t:B" C becomes "A233ABCC". -psC "t<g:A" "t:B" C becomes "ABCCABCC". This is the default mode. -psC "t>g:A" "t:B" C becomes "12C31BCC".
Operations are presented in a semicolon separated list containing one or more of the action elements listed below. Elements may be preceded by spaces but must be immediately followed by a delimiter (";" or end of token). Whenever a -pm or -op is executed its operations are run before the next pattern match is attempted. OPERATIONS are written as a single list, but that list is logically divided into True and False sections by a "!" operation, with each part of the list executing conditionally on the match or test. Unless otherwise specified, in this section all variables are STRING variables.logic control: ! Elements in an operations list before this execute when the match or test is true, after it, when it is false. One or the other set may be empty, which means do nothing for that condition. May not be used with an -op. pattern match interpretation: %COND COND is any combination of {BHIA}. These modify the SUBSTRING LIST and affect how the list entries are assigned to tokens by the ">" operator. Default is to assign the list as is. By specifying Before(first hit), Hit, Interior (between hits),After (last hit) these other substrings are derived and added to the substring list. If a specified interpolated substring does not exist, for instance, no Before because the first hit is at the first character, an empty token is created. Use with care for PCRE matches, which may not be sequential! Should only be used once for each set of operations! ALL is equivalent to BHIA. Defaults to H (use pattern matches as is).variable and input/output (all except v must be applied to SET variables [see DEFINITIONS]): xVLIST Clear the listed Set variables (they may not be a TARGET again until set). x#VLIST Remove from TOKEN NUMBER # to the last token. Token number at or before the first token clears all. Token number after the last token clears none. # of 0 uses RESULT for the TOKEN NUMBER (see Math Expressions). These are equivalent: -x1A and -xA. vVLIST Apply the -vN definitions of the variables to the TARGET. vVLIST<N Like the preceding, but default input is from vN. VLIST should not include vN. (These are equivalent: -pmA "m:" "vB" and -pm "m: "vB<A".) |VLIST Append a new line (\n) to any variables in VLIST that are not empty. Also change the last token’s delimiter value to 0 >VLIST Append to one or more SET variables from the SUBSTRING LIST, using one SUBSTRING LIST entry for each variable in the VLIST. Variables are A->Z or -, - means ignore that part of the match. Variables are processed in the order listed. If the number of substrings is longer than the list the list is processed again. The SUBSTRING LIST is not consumed. Examples, for 7 entries: >ABC appends 1,4,7 to A, 2,5 to B, 3,6 to C. >A-A appends 1,3,4,6,7 to A. >A appends 1,2,3,4,5,6,7 to A. +N Append the entire TARGET to variable N (extends the string, adds token[s]). corresponding to the extended region). For any of the append operations, if TARGET is the input line an -mt or [:mt:] is needed or no tokens will have been generated. +N<=STR Append STR to vN (extends the string, adds one token). STR may include \n and similar escape sequences. +N<[FMT] Append [FMT] to variable N (extends the string, adds tokens). See -fmt for [] options. Example +A<[mt:1,3] adds 3 tokens. +N<VLIST Append all of each variable in VLIST to vN. A "-" in VLIST appends from the input. (These are equivalent: -opA "+B" and -op "+B<A".) +N<# Append TOKEN NUMBER # from TARGET to N. # of 0 uses RESULT for the TOKEN NUMBER (see Math Expressions). +N<#VLIST Like +N<# but TARGET is VLIST. A "-" for input is allowed in VLIST. (These are equivalent: -opA "+B<1" and -op "+B<1A".) ?EXPR Run the math expression EXPR. See MATH EXPRESSIONS for syntax. mVLIST If a variable is SET, merge all tokens into one. Otherwise, do nothing. (This is the inverse of the po operation, see below.) m#VLIST If a variable is set, merge from TOKEN NUMBER # to the end. Token number at or before the first token merges all. Token number at or after the last token merges none. # of 0 uses RESULT for the TOKEN NUMBER (see Math Expressions). These are equivalent: -m1A and -mA. ># Read the next line from input stream # into the INPUT BUFFER. # is 1->N, where N is the number of files in the -in statement. Safe in -sect Before and After, but not recommended for -sect Main, where there will be complex interactions with the built in read loop. # of 0 uses RESULT for the input stream (see Math Expressions). These are equivalent: > and >1. <#VLIST Emit each variable in VLIST to the # output stream. Output streams match -out entries, counting from 0. The number may be omitted for the default (0) output stream. A "-" in VLIST emits the input. <#=STR Emit STR to the # output stream. <#[FMT] Emit FMT to the # output stream. See -fmt for [] options. ~VLIST Dump each STRING variable in VLIST (for debugging). ~$VLIST Dump each NUMERIC variable in VLIST (for debugging). po#VLIST Make tokens for each variable in VLIST using the parse option set # (0-9) created with -po. poA is equivalent to po0A.
Conditional output control (These apply only in -sect Main): [ -fmt is applied when -if is true, see also -ifonly ]. if1 Equivalent to a true -if statement (without -ifn or -ifterm). ifc Equivalent to a true -if statement (with -ifn). if+ Equivalent to a true -if statement (with -ifterm). if- Equivalent to a true -ifterm statement, ifr Equivalent to an -ifnorestart statement, if+- is allowed. Also terminates ifc. if[c+]r is like -if -ifnorestart. n# Equivalent to -ifN #. afmt# Make -fmt# active for this input line. # is 0->9. nfmt# Make -fmt# active next, after completing the current input. hide Suppress -fmt and default output (overrides -if logic). show Allow -fmt and default output. (Accept -if logic, [Default]) [hide and show "stick" once set, affecting all subsequent output.] eof Close the input. (No more input results in no more output.)
flow control (through the -op/pm/pmN/psN list): [By default statements execute sequentially in the order entered. Other patterns may be achieved using these operations. Automatic labels provide a way to specify jump targets without having to separately name each statement. However, getting there still requires using an explicit jump operation: "^" or "sw". Labels alone have no effect on flow control, use a ret or a brk to prevent flow from passing down to the next line.] =LBL Label this -op/pm/pmN/psN command as LBL. The label may appear anywhere in the OPERATIONS list. However, the script is easiest to read if labels are first in the OPERATIONS list. LBL may not begin with digits, contain spaces or semicolons, or be a valid automatic label [ {,}{,},c,c#, or c#-# ]. Labels are not case sensitive. The special label MAIN may mark the script line where execution begins. If MAIN is not employed execution starts on the first line of the script. &LBL Call function at LBL. Recursion is allowed, but note that all variables have global scope. On returning from the call the remainder of the statement’s operations are processed. {, }{, } Generate automatic nestable labels. May be used to create if/then/else, if/ifelse/else, while and other logical structures without requiring explicit labels for each jump target Braces must appear in an operation before any jumps that reference positions in this structure, else that jump will reference the enclosing structure. A "{" need not be a jump target, but all "}{" and "}" must be. Use -dbga to resolve related syntax issues. c, c#, c#-# Generate (automatic) nestable case labels for a switch statement. The number of case labels must match the switch specification, "swC", see below. If present, each # is an integer from 0 -> C-1. c25 is a single case, c2-5 is a range of 4 cases. Without explicit numbering the "c" labels are numbered sequentially from 0. The numeric and nonnumeric forms may not be mixed in a single switch statement. Multiple case labels may be applied to the same statement. Case labels may be applied to -pm[N] statements. im Set the SWITCH INDEX to the token number of the last "t:N" match (1->N). If the last test failed, the SWITCH INDEX is set to 0. ip Set the SWITCH INDEX by the character position of the first match (1->N). If the last test failed, the SWITCH INDEX is set to 0. i# Shift SWITCH INDEX 1 bit left and set (#=1) or clear (#=0) the low bit. ir Set SWITCH INDEX with RESULT from the last math expression. [The following terminate processing of a statement’s OPERATIONS.] ^LBL Jump to the -op/pm/pmN/psN with the corresponding label. Reserved labels are "{[{...]","}{", and "}[}...]", which jump to the corresponding automatic label in this (single brace) or an enclosing (multiple braces) structure. That is, ^}} jumps to the terminating brace in the structure enclosing this one. ^N Skip forward # statements. ^0 is a one line loop, and ^1 is pointless, since the next line would execute next anyway. swC Switch. Jump to the statement ([automatically] labeled as "cX"), where X is the value of the SWITCH INDEX, then clear that index. The number of cases is set by C. The value of X must be in the range 0 -> C-1. ret Return from a function call. brk Skip the remainder of the -op/pm/pmN/psN statements. The action that results depends on the section: in Main it is another read cycle, in Before it is a transition to the next section, and in After the program exits. exit Exit the program with SUCCESS status. fail Exit the program with FAILURE status. abbreviations: In a -cmd file (only) an abbreviated syntax may be used: Long short -pm m -pmN mN -psN sN -op o -opN oN Labels may be placed in a separate token at the beginning of the line - see the examples. It is legal to have some labels in separate tokens and others in the OPERATIONS string for the same statement. Terminal "}"s may be used without a following argument. Because tokens are delimited by spaces and EOL "c4 c1;c3 (new line)c2" is equivalent to "c4;c1;c3;c2". A ’};’ may appear on an otherwise blank line to close a set of braces. The semicolon prevents the label from joining with the next statement, which would be invalid if it resulted in ’}’ and ’{’ appearing together. A trailing ’}’ at the end of a section does not require a semicolon, nor does one which precedes a statement with no brace labels. Examples: -pm "s:Match?" "+A; !;x1A" If match/test is true, append TARGET to vA. If false, delete most recent token from vA -pm "s:Match?" "+A;po1A;+B<3A" If match/test is true, append TARGET to vA, make tokens using po1, append the 3rd token which results to vB. If false, do nothing. -pm "s:Match?" "if+; !; brk" If match/test is true, set the IF state to "true until terminated", unless it is already true, in which case do nothing. If false, stop processing -op/pm/pmN/psN logic for this input line. -pm "s:Match?" "=Label; &Func1;!;<= Some failure\n;fail" Identify this line as "Label". If match/test is true call Func1. If false emit a failure message and exit. -pm "s:Match?" "{; !; ^}}" Generate an automatic label for this line. If match/test is true fall through, else jump outward in the nested braces one level to the enclosing final brace. -pm "s:Fred" "i1; !;i0" -pm "s:Ginger" "i1; !;i0" -pm "s:Dance" "i1; !;i0" -op "{; sw8" -op "c; <=some action\n; ^}" ##case 0 -op "c;c;c; <=some action\n; ^}" ##cases 1,2,3 -op "c; <=some action\n; ^}" ##case 4 -op "c; <=some action\n; ^}" ##case 5 -op "c;c; <=some action\n; ^}" ##cases 6,7 -op "}" Construct a switch index based on the results of three match/tests. An 8 case switch construct with single line cases, the labels are automatic but the cases are commented. The (automatic) labels are on the left side, the jumps (sw and ^) on the right. -op "{; sw4" -op "c0;c3; <=some action for 0,3\n; ^}" -op "c1;c2; <=some action for 1,2\n; ^}" -op "}" A 4 case switch construct with single line cases, the explicit case labels are not in sequential order. { o "sw4" c0;c3 o " <=some action for 0,3\n; ^}" c1 c2 o " <=some action for 1,2\n; ^}" }; The same 4 case switch construct using the abbreviated syntax. Note the semicolon following the close brace, which precludes collisions with a following line - because a label like ’};{;’ is forbidden. -sect BEFORE o "hide; +A<=abcdefghijklmnopqrstuvwxyz1234567890" -sect MAIN o "xB;+B<[mc:1]" ## put the first character in B mA "vc:B" "" ## find B in A, case invariant { o "ip; sw37" ## position->index, then switch c0 o "<=*; ^}" c1-26 o "<=A; ^}" c27-36 o "<=9; ^}" } -sect AFTER o "<=\n" Classify the first letter of each input line as: A=alphabet, 9=number, *=other. Note the case ranges.
Math expressions use an algebraic syntax to operate on either NUMERIC ($N) or STRING (N) variables. The latter are allowed when they contain only purely numeric text like ’12.34’, and without that constraint in functions which measure their properties. When a STRING variable contains text that cannot be converted to a valid double precision number it is converted instead to ’Not a Number’, which prints as nan. The test -pm ’??’ may be used to detect this condition.These math expressions are also available in the separate program dmath.
RESULT: is the final number calculated in an expression. It may be tested with the -pm ’?’ operator. RESULT may be loaded as $-, but $- cannot be assigned. Variables: may either be NUMERIC ($A) or STRING (A,if holding purely numeric text like ’12.34’). Cells/Tokens may be indexed as 1->N (from start) or -1->-N (from end). Scalar and Array math: Unless otherwise noted operators and functions will work with either scalar or array operands. If array operands are used the result will also be an array, with intermediate values stored in the leftmost array at each operation. In scalar math the RETURN value is meaningful, but it is not in array math. Except for a straight assignment only NUMERIC variables may be used in Array math. Operands: 12e-1,120,0xF0,0o77,0b1010 numbers in float, integer, hexadecimal, octal, or binary formats $A NUMERIC variable ($A is the same as $A[1]) $A[12],$A[3,4],$A[] One element, range of elements, all elements A All of STRING variable A A[2],A[3,4],A[] One token, range of tokens, all tokens Assignment: $A[6] = 1+$B[$C[3]] to an element $A[1,2]= $B[3,4] to a range of elements (number of elements must match) $A[] = 3 to an entire array Only NUMERIC variables may appear to the left of an ’=’ operator. Expressions may contain 0 or 1 ’=’ assignments. Operators: val1 OP val2 + addition - subtraction * multiplication % remainder ^ power (val1 ^ val2) ? compare (returns 1,0,-1 if val1 >,==,< val2) Functions(val): log base 10 log ln natural log e10 10^val ee e^val chs change sign abs absolute value rnd round to nearest int lid round away from zero to next integer trc round towards zero to next integer sin,asin sine, arc sine (angle in radians) cos,acos cosine, arc cosine tan,atan tangent, arc tangent d2r degrees to radians r2d radians to degress sinh,cosh,tanh hyperbolic sine, cosine,tangent not bitwise not (unsigned integer) Functions(val1,val2): max maximum min minimum and,or,xor bitwise and, or, xor (unsigned integer). shl,shr bitwise shift left/right (unsigned integer). Functions($A[range]) [scalar results only]: len number of cells sum sum of cells sm2 sum of squares of cells inv invert order of cells in range, Returns 0. del delete cells in range. Returns elements remaining. (If all deleted, variable is reset to one element with value zero.) idx replace elements with their array positions (1-N). Returns 0. srt sort elements into ascending order. Returns 0. six replace elements with the positions they would occupy if sorted into ascending order. Ie {5,10,-21} -> {2,3,1}. Returns 0. nml test for normal numbers. 0=all elements normal, 1=at least one infinite, 2=at least one NaN, 3=some infinite and some NaN tls (re)dim the variable and store the token list from the last t: pattern match. Returns the size of the TOKEN NUMBER LIST. The TOKEN NUMBER LIST is consumed. sls (re)dim the variable and store the SUBSTRING LIST (start/end pairs) from the last pattern match. Returns the number of matches in the SUBSTRING LIST (1/2 the list size). The SUBSTRING LIST is not altered. Functions($A[],value) [scalar results only]: dim (Re)size $A to value entries. New elements = 0.0. Returns 0. Functions($A[],val1,val2,...) [scalar results only]: cat Add values as new elements to $A, returns new length. ini (Re)initialize $A with values as elements, returns (new) length. Functions($A[],$B[]) [array results only] Rearrange array contents. RETURN value is not meaningful. map $A[i] = $A[$B[i]] for all i in the range. ump $A[$B[i]]= $A[i] for all i in the range (unmap). Functions(A[range]) [scalar results only] tok the number of tokens. tln the sum of the token lengths. sln the length of the entire string (range values are ignored). Functions($A[range],B[range]) [scalar results only] tix make a MAP in the indicated range in $A that corresponds to the TOKENS in the indicated range in STRING variable B sorted into ascending order. Case sensitive. The resulting MAP will have all positive values. Returns 0. tcx case invariant form of tix. Operator Precedence: ^ > */% > +- > ? > (), > =Examples: -op ’? (5/ee(5)) + $A[-1]’ RESULT=(5 divided by e^5) + contents of last element in array $A -op ’? $A = log(3+$B[6])’ RESULT=(expression), and is also stored in $A[1] -op ’?$A[-1] = A[-2]’ RESULT=(2nd to last token from A), also stored in the last cell of $A. If that token could not be converted to a valid number the RESULT and stored value are nan (not a number). -op ’? dim($A[],5); ? idx($A[]); ? sum(log($A[]))’ RESULT=sum of the logs of 1->5. log(1)->log(5) are stored in $A elements 1->5. -op ’? $B=1; ? $A=10’ -pm ’?>:A’ ’? $B=$B*$A; ? $A=$A-1; ^0;’ RESULT=10 factorial. Also stored in $B. Note the conditional single line loop, which jumps to itself (^0) while $A is greater than zero. -op ’log(max($A[1,4]*2.1,$B[3,6]))’ RESULT=(not meaningful). Multiply all elements in array $A[1,4] by 2.1 and store in place. Take the maximum with the corresponding elements (1->3, 2->4..4->6) of $B and store back into $A, Then take the log of each element in the range and store that in place too. Only elements 1->4 of $A are modified.
% extract -h List the command line options.
% cat file | extract -sr 1 Echo all text from stdin to stdout. (Specifying any one command line option with its default value will do the same.)
% extract -sc 50 <infile.txt >outfile.txt Extract characters 50 to end of row for every line in infile.txt and write them to outfile.txt.
% extract -sr 4 -sc 5 -ec 10 <infile.txt >outfile.txt Extract characters 5-10 from rows 4 to end of infile.txt and write them to outfile.txt.
% extract -sc 5 -nc 10 <infile.txt >outfile.txt Extract characters 5-14 from all rows in infile.txt and write them to outfile.txt.
% extract -sc 2 -ec 3 -mt -dl ’:,;’ <infile.txt >outfile.txt Extract the 2nd and 3rd tokens delimited by one or more :,; characters from each row in infile.txt and write them to outfile.txt.
% extract -sr 4 -er 40 -sc 2 -ec 3 -mt -dl ’:,;’ -s -all -rm <infile.txt >outfile.txt Process infile.txt as follows:
1. Emit verbatim rows 1 through 3.
2. For rows 4 though 40 emit the 1st, and 4th through Nth tokens delimited by a single :,; character.
3. Emit verbatim rows 41 to the final row in the file.
% ( cd / ; du -k ) | extract -fmt ’[jr:fw14:1] [2]’ -mt Lists the size of all directories on a Unix system with the size field right formatted so that the columns all line up.
% ls -al | extract -fmt ’[mc:1,32][fw14:jr:5] [6] [fw2:7] [jr:fw5:8] [9]’ -mt -dl ’ ’ Straighten the columns in a directory listing on a Unix system with large files.
% extract -b -fmt ’[,-2]’ <infile.txt Converts a Windows CRLF text file to a Unix LF text file. Will always work on a Unix system. Will usually work on a Windows system but may fail if the build does not support the -b switch.
% extract -fmt ’foo[cu:jl:fw20:3,5]blah[-:mc:10,30]er[1]’ -mt -fw 30 <infile.txt Process each line of infile.txt as follows:
1. Emit "foo".
2. Emit tokens 3,4, and 5 upper cased in a 20 character field, left justified.
3. Emit "blah".
4. Emit characters 10 through 30.
5. Emit "er".
6. Emit column 1 in a field of width 30.
% extract <infile.txt >outfile.txt -if ’^>’ -fmt ’>SPECIAL [1,]’ Lines beginning with > are emitted with the modification shown. All other lines are echoed unchanged.
% extract -mt -dv ’\t’ -fmt ’[1,5]\n[[WOW!]][6]’ <infile.txt Emit the first five tokens separated by tabs and then on the next line emit [WOW!] followed immediately by the sixth token.
% extract -eol ’,’ -if Teacher -fmt ’\n[1,]’ -fileeol ’\n’ <infile.txt If the infile consists of "Teacher name" lines each followed by many lines of student names, the output will consist of one blank line (assuming the first input line has "Teacher" in it) followed by lines like: "Teacher name, student1,student2,...studentn,".
% extract -mt -if Teacher -ifterm Teacher -iftermbol \n -ifeol -fmt ’[2],’ <infile.txt If the infile consists of many instances of a "Teacher: Name" line followed by N "Student: Name" lines the output will consist of several lineslike: "Teacher name, student1,student2,...studentn,".
% extract -indl ’,’ -in file1,file2,- Merge the contents of file1, file2, and stdin, placing a comma between the part of the line from each file.
% extract -mdl ’,’ -merge 4 -in file1,file2,- As above but also merge consecutive rows which begin with the same 4 character prefix. If three such rows were "foo 1", "foo 2", and "foo 3" the single output row would be "foo 1, 2, 3".
% extract -rcds ’\r\12’ -in file1 Remove carriage returns and linefeeds from the file and emit to stdout.
% extract -rcds ’Tt’ -rcss ’Uu’ -in file1 Substitute characters T->U and t->u and emit to stdout.
% extract -rtds ’Thomas’ -rtss ’Tom’ -in file1 Substitute string Tom for Thomas and emit to stdout.
% extract -merge 5 -mdl ’,’ -in file1 If file1 contained the lines "abcd_ 1", "abcde 2", "abcde 3","abcdf 4" the output would be "abcd_ 1", "abcde 2,3" ,"abcdf 4"
% extract -unmerge 2 -in file1 If file1 contained the line: "blah a b c d e" the output would be: "blah ab", "blah cd", "blah e".
% extract -in template,file -indl ’ MISS’ -template 3 -out fout If template contains "120","121","122" and file contains "120 fred","122 mary" write "120 fred","121 MISS""122 mary" to fout.
% find . | extract -fmt ’extract -in [1,] -out foo.tmp -rtds /usr/bin/perl -rtss /usr/bin/perl5 ; mv foo.tmp [1,]' | execinput Use extract recursively as a stream editor. For each input file found by find the first extract prepares a command line where a second instance of extract converts each instance of /usr/bin/perl to /usr/bin/perl5. The final execinput executes these command lines one at a time. (Note that the output first goes to a temporary file and is then copied back over the original input file.)
% extract -nr 1 -sc 3 -all -in unicode.txt -hnd Delete embedded null characters from 16 bit unicode text. If the -hnd was omitted there would be a fatal error when the first null character was encountered during the reading of this file. Also deletes the first two characters of the first line only, which comprise the unicode Byte Order Mark.
% extract -b2 -ecc ’,’ -in data.txt -out comma_delimited.txt Place a comma between every character in data.txt. The result may be read into a spreadsheet with one character per cell.
% echo ’z Y x 1 3e-1 -123.45’ | extract -mt -fmt ’[sortac:1,3] [sortdn:4,]’ Emits "x Y z 1 3e-1 -123.45"
% echo ’z Y x 1 3e-1 -123.45’ | extract -mt -fmt ’[sortac:1] [sortdn:6]’ Emits "z -123.45" because neither field contains more than one token, so no sort will occur.
GNU General Public License 2
Copyright (C) 2011 David Mathog and Caltech.
This program was inspired by Pat Rankin’s EXTRACT utility for VMS.
David Mathog, Biology Division, Caltech <mathog@caltech.edu>
drm_tools | extract (1) | 1.1.11 Jul 14 2014 |