SYNOPSIS

loccount [-cegHijklinsuVW] [-d nnn] [-t regexp] [-x regexp] file-or-dir…​

DESCRIPTION

This program counts physical source lines of code (SLOC) and logical lines of code (LLOC) in one or more files or directories given on the command line.

A line of code is counted in SLOC if it includes non-whitespace characters outside the scope of a comment. LLOC is counted by tallying SLOCs with statement-terminating punctuation.

Note: this definition of SLOC is identical to what sloccount(1) calls SLOC and what scc(1) calls "CLOC".

LLOC reporting is not available in all supported languages, as the concept may not fit the language’s syntax (e.g. the Lisp family) or its line-termination rules would require full parsing. In these case LLOC will always be reported as 0. On the other hand, LLOC reporting is reliably consistent in languages with C-like statement termination by semicolon.

These definitions are simplistic and arguably lead to undercounting if LLOC is being used as a complexity measure; the author considers it a particular problem that most C macro definitions won’t be counted. However, they have the advantage that they improve comparability of results across broad swathes of different languages.

Certain kinds of syntactic errors in source code - notably unbalanced comment and string literal delimiters - make this program likely to produce wrong counts and spurious errors.

It is advisable to run "make clean" or equivalent in your source directory before running this program, though it knows how to detect some common kinds of generated files (such as yacc and lex output and manual pages or HTML generated by asciidoc) and will ignore them. You can explicitly tag a file as generated by putting the string "GENERATED" somewhere in the first few lines.

Optionally, this program can perform a cost-to-replicate estimation using the COCOMO I and (if LLOC count is nonzero) COCOMO II models. It uses the "organic" profile of COCOMO, which is generally appropriate for open-source projects.

SLOC/LLOC figures should be used with caution. While they do predict project costs and defect incidence reasonably well, they are not appropriate for use as 'productivity' measures; good code is often less bulky than bad code. Comparing SLOC across languages from different families (for example, Algol-descended vs. Lisp-descended) is also dubious, as these can have can have greatly differing complexity per line.

With these qualifications, SLOC/LLOC does have some other uses. It is quite effective for tracking changes in complexity and attack surface as a codebase evolves over time.

All languages in common use on Unix-like operating systems are supported. For a full list of supported languages, run "loccount -s"; "loccount -l" lists languages for which LLOC computation is available.

The program also emits counts for build recipes - Makefiles, autoconf specifications, scons recipes, and waf scripts. Generated Makefiles are recognized and ignored. An installed copy of waf and any waf build directory is ignored, but a wscript file is not.

Counts for many other DSLs are also reported, including the configuration languages JSON, YAML, TOML, and INI.

The program emits counts for well-known documentation markups as well, including man-page, asciidoc, Markdown, Tex, reStructuredText, and others. There is no equivalent of LLOC for these. The -n option disables this feature.

PostScript is a special case. It is usually generated from some other markup and thus not source code, but not always. This program looks for "!PS-Adobe" early in the fire as an indication that it was generated, and ignores such files.

Languages are recognized by file extension or filename pattern; executable filenames without an extension are mined for #! lines identifying an interpreter. Some languages in groups that share file extensions are recognized by matching content patterns inside the file. Files with a shared extension that don’t match content keywords may be assigned to a language if it’s the only language with content matches in the repository.

Files that cannot be classified file extension (and possibly content) are skipped, but a list of files skipped in this way is available with the -u option.

When an argument is inside a recognized version-control working copy, loccount normally asks the VCS for its versioned file set and counts only those files beneath that argument. At present, Git is queried with git ls-files -z, Mercurial with hg files -0, Perforce with the union of p4 have and p4 opened, and Subversion with svn status -v --xml. This skips unversioned files by default. The -k option disables this behavior and reverts to a regular recursive directory walk.

Some file types are identified and silently skipped without being reported by -u; these include symlinks, .o, .a, and .so object files, various kinds of image and audio files, and the .pyc/.pyo files produced by the Python interpreter. All files and directories named with a leading dot are also silently skipped (in particular, this ignores metadata associated with version-control systems). Finally, files that can’t be otherwise classified and contain a NUL in their first 1024 bytes of data are ignored.

When using the regular recursive walker rather than a VCS-provided versioned-file list, loccount reads .gitignore and .ignore files and interprets them in the least surprising way, skipping the files they match.

LIMITATIONS

There are some sources of error and confusion that no amount of clever code in this program can abolish.

One has to do with comment nesting in Pascal. ISO 7185:1990, the standard for the language, specifies that comments do not nest; however important historical and current Pascal compilers support comment nesting. This program assumes that if a block comment start is within the scope of a block comment, the programmer is working with such a compiler and did that deliberately.

Python detection is slightly flaky. Anything with a .py extension will be classified simply as "Python", not distinguishing between Python 2 and Python 3. Python files without an extension will be correctly detected only when they have a hashbang line containing "python", "python2" or "python3".

There is a conflict among Objective-C, MATLAB, MUMPS, and ntroff/troff over the extensions .m and .mm; this may lead to misidentification of files with these extensions. To avoid problems, ensure that every MATLAB file contains at least one %-led winged comment or %{-led block comment.

What is reported as "ML" includes its dialects Caml and Ocaml, which are not readily distinguishable, but unlikely to be mixed in the same source tree. Standard ML and Concurrent ML have distinguishing file extensions and can therefore be reported separately (as "SML" and "CML" respectively).

The syntax of Algol 60 was not carefully specified. Variants in which keywords are distinguished from variable and function names by either being uppercase or being quoted like string exist. This program assumes an Algol dialect with all-caps unquoted keywords. The sticking point here is that COMMENT (uppercase, no quotes) is used to recognize comments.

This program assumes that Lisp and Scheme interpret backslash as C does, that is as an escape for a following string delimiter. While this is true in Common Lisp, Scheme, Emacs Lisp, and Guile, it may not be true in other, older Lisp dialects.

Manual pages sometimes have idiosyncratic extensions (that is, other than ".man" or a single section digit) which this program will not recognize. Older manual pages sometimes abuse nroff to achieve commenting in ways this program does not recognize, resulting in some overcounting of source lines.

ECMAScript6/es6 files with a .js extension will be reported as Javascript.

Some languages derived from Visual Basic that are declared with their own file extensions (such as FreeBasic, Yabasic, nuBasic) will accept a .bas extension but have VBA- or C-style native comment syntax; in these cases SLOC counts may be nisreported high.

X11-Basic is not distinguishable by file extension or required keywords from a generic Basic, but uses ! as a winged-conmment leader as well as ignoring REM lines. Its will be reported as Basic and ! lines will be counted as code.

OPTIONS

-?

Display usage summary and quit.

-c

Report COCOMO cost estimates. Use the coefficients for the "organic" project type, which fits most open-source projects. An EAF of 1.0 is assumed.

-d n

Set debug level. At > 0, displays various progress messages. Mainly of interest to loccount developers.

-e

Show the association between languages and file extensions.

-g

List files normally excluded by the autogeneration filter; do not emit line counts.

-H

With -t, emit HTML instead of plain text. This option requires -t.

-i

Report file path, SLOC, LLOC, and type for each individual path. Because type names may contains spaces, fields are separated by ":". For files that can’t be classified, only the path is reported.

-j

Dump SLOC and LLOC counts as self-describing JSON records for postprocessing.

-k

Keep the regular recursive directory traversal even inside a recognized version-control working copy. Without this option, loccount uses the repository’s own file list and counts only versioned files.

-l

List languages for which we can report LLOC and exit. Combine with -i to list languages one per line.

-n

Do not tally documentation SLOC.

-s

List languages for which we can report SLOC and exit. Includes all languages that can report LLOC. Combine with -i to list languages one per line.

-t regexp

Describe all languages whose primary name or alias strings match the specified Go regular expression. Use "^…​$" if you want an exact match rather than a substring match. This option exits after emitting the matching language reports. By default the report is plain text; combine with -H for HTML.

-u

List paths of files that could not be classified into a known source type or as autogenerated.

-v

Dump lexer stare transitions; debugging option.

-W

Emit a standalone HTML page describing all known languages and exit. Entries are sorted by name rather than by YAML order. Pipe this output through the companion webify script to add a top control panel and a names-only initial view for interactive browsing, with click-to-expand entries and a global full-detail toggle.

-x regexp

Ignore paths matching the specified Go regular expression.

-V

Show program version and exit.

Arguments following options may be either directories or files. Directories are recursed into when -k is given or when they are not inside a recognized version-control working copy. Otherwise, loccount iterates over the versioned files beneath each argument using the repository’s own listing command. The report is generated on all paths specified on the command line.

EXIT VALUES

Normally 0. 1 in -s or -e mode if a non-duplication check on file extensions or hashbangs fails.

HISTORY AND COMPATIBILITY

The algorithms in this code originated with David A. Wheeler’s sloccount utility, version 2.26 of 2004. It is, however, faster than sloccount, and handles many languages that sloccount does not.

Generally it will produce identical SLOC figures to sloccount for a language supported by both tools; the differences in whole-tree reports will mainly be due to better detection of some files sloccount left unclassified. Notably, for individual C and Perl files you can expect both tools to produce identical SLOC. However, Python counts are different, because sloccount does not recognize and ignore single-quote multi-line literals.

A few of sloccount’s tests have been simplified in cases where the complexity came from a rare or edge case that the author judges to have become extinct since 2004.

The reporting formats of loccount 2.x are substantially different from those in the 1.x versions due to absence of any LLOC fields in 1.x.

The base salary used for cost estimation will differ between these tools depending on time of last release.

BUGS

Eiffel indexing comments are counted as code, not text. (This is arguably a feature.)

LLOC counts in languages that use a semicolon as an Algol68-like statement separator, rather than a terminator, will be a bit low. This group includes Algol68, Pascal, Modula3, and Oberon. In practice, Pascal allows empty statements and programmers thus can and do write as though semicolon is a statement terminator, removing this source of error.

Dylan LLOC will be a bit high due to its use of semicolon as a terminator for classes and methods as well as statements.

If a Factor program defines words containing embedded ! or ", loccount will be confused.

Fantom documentation comments (led with **) are counted as code.

Comment detection in Forth can be confused by tabs or unusual whitespace following a \\ or (, or by strings containing unbalanced parens.

User-facing comment lines in Pkl are counted as code.

The older fixed filenames "BUILD" and "WORKSPACE" are not recognized as Bazel files; the newer equivalents with the ".bazel" extension are.

REPORTING BUGS

Report bugs to Eric S. Raymond <esr@thyrsus.com>.

SEE ALSO

sloccount(1), scc(1)