=pod =head1 NAME B - Detect scanning activity in a SiLK dataset =head1 SYNOPSIS rwscan [--scan-model=MODEL] [--trw-sip-set=SETFILE] [--output-file=OUTFILE] [--no-titles] [--no-columns] [--column-separator=CHAR] [{--delimited | --delimited=CHAR}] [--model-fields] [--scandb] [--threads=THREADS] [--queue-depth=DEPTH] [--verbose-progress=CIDR] [--verbose-flows] [FILES...] =head1 DESCRIPTION B performs scan detection analysis on SiLK flow records. Input data can come from an input pipe, or can be read from the files listed on the command line. Input data should be pre-sorted with B by sip, proto, and dip. =head1 OPTIONS Option names may be abbreviated if the abbreviation is unique or is an exact match for an option. A parameter to an option may be specified as B<--arg>=I or S I>, though the first form is required for options that take optional parameters. =over 4 =item B<--scan-model>=I Select a specific scan detection model. If not specified, the default value for I is C<0>. See the L section for more details. =over 4 =item C<0> Use the Threshold Random Walk (TRW) and Bayesian Logistic Regression (BLR) scan detection models in series. =item C<1> Use only the TRW scan detection model. =item C<2> Use only the BLR scan detection model. =back =item B<--trw-sip-set>=I Specify an IPset file containing B valid internal IP addresses. This parameter is required when using the TRW scan detection model, since the TRW model requires the list of targeted IPs (i.e., the IPs to detect the scanning activity to). This switch is ignored when the TRW model is not used. =item B<--output-file>=I Specify the output file that scan records will be written to. If not specified, the scan records are written to C. =item B<--no-titles> Turns off column titles. By default, titles are printed. =item B<--no-columns> Disable fixed-width columnar output. =item B<--column-separator>=I Use specified character between columns. When this switch is not specified, the default of 'B<|>' is used. =item B<--delimited> =item B<--delimited>=I Run as if B<--no-columns> B<--column-sep>=I had been specified. That is, disable fixed-width column output; if character I is provided, it is used as the delimiter between columns instead of the default 'B<|>'. =item B<--model-fields> Show scan model detail fields. This switch controls whether additional informational fields about the scan detection models are printed. =item B<--scandb> Produce output suitable for loading into a database. Sample database schema are given below under L. This option is equivalent to C<--no-titles --no-columns --model-fields>. =item B<--threads>=I Specify the number of worker threads to create for scan detection processing. By default, one thread will be used, changing this number to match the number of available CPUs will often yield a large performance improvement. =item B<--queue-depth>=I Specify the depth of the work queue. The default is to make the work queue the same size as the number of worker threads, but this can be changed. Normally, the default is fine. =item B<--verbose-progress>=I Report progress as B processes input data. The I argument should be an integer that corresponds to the netblock size of each line of progress. For example, B<--verbose-progress>=I<8> would print a progress message for each /8 network processed. =item B<--verbose-flows> This flag will print very verbose information for each flow, and is primarily useful for debugging. =back =head1 METHOD OF OPERATION B's default behavior is to consult two scan detection models to determine whether a source is a scanner. The primary model used is the Threshold Random Walk (TRW) model. The TRW algorithm takes advantage of the tendency of scanners to attempt to contact a large number of IPs that do not exist on the target network. By keeping track of the number of "hits" (successful connections) and "misses" (attempts to connect to IP addresses that are not active on the target network), scanners can be detected quickly and with a high degree of accuracy. Sequential hypothesis testing is used to analyze the probability that a source is a scanner as each flow record is processed. Once the scan probability exceeds a configured maximum, the source is flagged as a scanner, and no further analysis of traffic from that host is necessary. The TRW model is not 100% accurate, however, and only finds scans in TCP flow data. In the case where the TRW model is inconclusive, a secondary model called BLR is invoked. BLR stands for "Bayesian Logistic Regression." Unlike TRW, the BLR approach must analyze all traffic from a given source IP to determine whether that IP is a scanner. =cut =pod Because of this, BLR operates much slower than TRW. However, the BLR model has been shown to detect scans that are not detected by the TRW model, particularly scans in UDP and ICMP data, and vertical TCP scans which focus on finding services on a single host. It does this by calculating metrics from the flow data from each source, and using those metrics to arrive at an overall likelihood that the flow data represents scanning activity. The metrics BLR uses for detecting scans in TCP flow data are: =over 4 =item * the ratio of flows with no ACK bit set to all flows =item * the ratio of flows with fewer than three packets to all flows =item * the average number of source ports per destination IP address =item * the ratio of the number of flows that have an average of 60 bytes/packet or greater to all flows =item * the ratio of the number of unique destination IP addresses to the total number of flows =item * the ratio of the number of flows where the flag combination indicates backscatter to all flows =back The metrics BLR uses for detecting scans in UDP flow data are: =over 4 =item * the ratio of flows with fewer than three packets to all flows =item * the maximum run length of IP addresses per /24 subnet =item * the maximum number of unique low-numbered (less than 1024) destination ports contacted on any one host =item * the maximum number of consecutive low-numbered destination ports contacted on any one host =item * the average number of unique source ports per destination IP address =item * the ratio of flows with 60 or more bytes/packet to all flows =item * the ratio of unique source ports (both low and high) to the number of flows =back The metrics BLR uses for detecting scans in ICMP flow data are: =over 4 =item * the maximum number of consecutive /24 subnets that were contacted =item * the maximum run length of IP addresses per /24 subnet =item * the maximum number of IP addresses contacted in any one /24 subnet =item * the total number of IP addresses contacted =item * the ratio of ICMP echo requests to all ICMP flows =back Because the TRW model has a lower false positive rate than the BLR model, any source identified as a scanner by TRW will be identified as a scanner by the hybrid model without consulting BLR. BLR is only invoked in the following cases: =over 4 =item * The traffic being analyzed is UDP or ICMP traffic, which B's implementation of TRW cannot process. =item * The TRW model has identified the source as benign. This occurs when the scan probability drops below a configured minimum during sequential hypothesis testing. =item * The TRW model has identified the source as unknown. (where the scan probability never exceeded the minimum or maximum thresholds during sequential hypothesis testing.) =back In situations where the use of one model is preferred, the other model can be disabled using the B<--scan-model> switch. This may have an impact on the performance and/or accuracy of the system. =head1 EXAMPLES Basic usage requires only input and output file arguments: $ rwscan -o scans.dat data.rw Typically, though, data will be piped into B from B and B, e.g.: $ rwfilter --start=2004/12/29:00 --type=in,inweb --all-dest=stdout \ | rwsort --fields=sip,proto,dip \ | rwscan --trw-sip-set=sip.set --scan-model=0 \ --output-file=scans.dat =head2 Sample Schema for Oracle CREATE TABLE scans ( id integer unsigned not null unique, sip integer unsigned not null, proto tinyint unsigned not null, stime datetime not null, etime datetime not null, flows integer unsigned not null, packets integer unsigned not null, bytes integer unsigned not null, scan_model integer unsigned not null, scan_prob float unsigned not null, primary key (id) ); =head2 Sample Schema for Postgres CREATE SEQUENCE scans_id_seq; CREATE TABLE scans ( id BIGINT NOT NULL DEFAULT nextval('scans_id_seq'), sip BIGINT NOT NULL, proto SMALLINT NOT NULL, stime TIMESTAMP without time zone NOT NULL, etime TIMESTAMP without time zone NOT NULL, flows BIGINT NOT NULL, packets BIGINT NOT NULL, bytes BIGINT NOT NULL, scan_model INTEGER NOT NULL, scan_prob FLOAT NOT NULL, PRIMARY KEY (id) ); =head2 Sample Schema for MySQL CREATE TABLE scans ( id integer unsigned not null auto_increment, sip integer unsigned not null, proto tinyint unsigned not null, stime datetime not null, etime datetime not null, flows integer unsigned not null, packets integer unsigned not null, bytes integer unsigned not null, scan_model integer unsigned not null, scan_prob float unsigned not null, primary key (id), INDEX (stime), INDEX (etime) ) TYPE=InnoDB; =head1 SEE ALSO B, B, B, B =cut $SiLK: rwscan.pod 7187 2007-05-16 17:35:54Z mthomas $ Local Variables: mode:text indent-tabs-mode:nil End: