
The module documentation is available here.
The latest vesion is available at LuaForge.
It is often said of Lua that it does not include batteries. That is because the goal of Lua is to produce a lean expressive language that will be used on all sorts of machines, (some of which don't even have hierarchical filesystems). The Lua language is the equivalent of an operating system kernel; the creators of Lua do not see it as their responsibility to create a full software ecosystem around the language. That is the role of the community.
A principle of software design is to recognize common patterns and reuse them. If you find yourself writing things like io.write(string.format('the answer is %d ',42)) more than a number of times then it becomes useful just to define a function printf. This is good, not just because repeated code is harder to maintain, but because such code is easier to read, once people understand your libraries.
Penlight captures many such code patterns, so that the intent of your code becomes clearer. For instance, a Lua idiom to copy a table is {unpack(t)}, but this will only work for 'small' tables (for a given value of 'small') so it is not very robust. Also, the intent is not clear. So tablex.deepcopy is provided, which will also copy nested tables and and associated metatables, so it can be used to clone complex objects.
The default error handling policy is to return nil,message if there is a problem. There are some exceptions; functions like input.fields default to shutting down the program immediately with a useful message. This is more appropriate behaviour for a script than providing a stack trace. (However, this default can be changed.) The lexer functions always throw errors, to simplify coding, and so should be wrapped in pcall. Consistent error checking is still a bit lacking, but random stack trace crashes should be considered a bug in the library.
If you are used to Python conventions, please note that all indices consistently start at 1.
The Lua function table.foreach has been deprecated in favour of the for in statement, but such an operation becomes particularly useful with the higher-order function support in Penlight. Note that tablex.foreach reverses the order, so that the function is passed the value and then the key. Although perverse, this matches the intended use better.
The only important external dependence of Penlight is LuaFileSystem (lfs), and if you want dir.copyfile to work properly on Windows, you will need alien as well.
Some of the examples in this guide were created using ilua, which doesn't require '=' to print out expressions, and will attempt to print out table results as nicely as possible. This is also available under Lua for Windows, as a library, so the command lua -lilua -s will work (the s option switches off 'strict' variable checking, which is annoying and conflicts with the use of _DEBUG in some of these libraries.
It was realized a long time ago that large programs needed a way to keep names distinct by putting them into tables (Lua), namespaces (C++) or modules (Python). It is obviously impossible to run a company where everyone is called 'Bruce', except in Monty Python skits. These 'namespace clashes' are more of a problem in a simple language like Lua than in C++, because C++ does more complicated lookup over 'injected namespaces'. However, in a small group of friends, 'Bruce' is usually unique, so in particular situations it's useful to drop the formality and not use last names. It depends entirely on what kind of program you are writing, whether it is a ten line script or a ten thousand line program.
So the Penlight library provides the formal way and the informal way, without imposing any preference. You can do it formally like:
require 'pl.utils'
pl.utils.printf("%s\n","hello, world!")
or informally like:
require 'pl'
utils.printf("%s\n","That feels better")
require 'pl' also brings in all the separate Penlight modules, without needing to require them each individually.
This is also commonly done like so, especially when writing modules:
local utils = require 'pl.utils'
utils.printf("The answer is %d\n",42)
Penlight will not bring in functions into the global table, or clobber standard tables like 'io'. require('pl') will bring tables like 'utils','tablex',etc into the global table.
The exception is that require('pl') will put the pl.string methods into the standard string table. The reason is that saying s:strip() is very convenient. A more explicit way of doing this is:
require('pl.string').import()
A more delicate operation is importing tables into the local environment. This is convenient when the context makes the meaning of a name very clear:
> require 'pl'
> utils.import(math)
> = sin(1.2)
0.93203908596723
utils.import can also be passed a module name, which is first required and then imported. If used in a module, import will bring the symbols into the module context.
The function printf discussed earlier is included in pl.utils because it makes properly formatted output easier. (There is an equivalent fprintf which also takes a file object parameter, just like the C function.)
Another set of functions which are simple but useful are readfile and writefile. Generally it isn't a good idea to pull a large file into a string in one operation, since such files can be usually dealt with more efficiently. But small files litter our hard drives, and they can be efficiently processed as strings. For example, this little script converts a file into upper case:
require 'pl'
utils.writefile('out.txt',utils.readfile('in.txt'):upper())
Since these functions work with standard input and output if not passed explicit filenames, you can do this kind of thing from the command-line:
c:\test> lua -lpl -e "utils.writefile(utils.readfile():upper())" < in.txt > out.txt
One of the elegant things about Lua is that tables do the job of both lists and dicts (as called in Python) or vectors and maps, (as called in C++), and they do it efficiently. However, if we are dealing with 'tables with numerical indices' we may as well call them lists and look for operations which particularly make sense for lists. The Penlight List class was originally written by Nick Trout for Lua 5.0, and translated to 5.1 and extended by myself. It seemed that borrowing from Python was a good idea, and this eventually grew into Penlight. (see pl.list)
Here is an example showing List in action; it redefines __tostring, so that it can print itself out more sensibly:
> l = List()
> l:append(10)
> l:append(20)
> = l
{10,20}
> l:extend {30,40}
> = l
{10,20,30,40}
> l:insert(1,5)
> = l
{5,10,20,30,40}
> = l:pop()
40
> = l
{5,10,20,30}
> = l:index(30)
4
> = l:contains(30)
true
> = l:reverse() ---> note: doesn't make a copy!
> = l
{30,20,10,5}
A particular feature of Python lists is slicing. This is fully supported in this version of List, except we use 1-based indexing. So List.slice works rather like string.sub:
> l = List {10,20,30,40}
> = l:slice(1,1)
{10}
> = l:slice(2,2)
{20}
> = l:slice(2,3)
{20,30}
> = l:slice(2,-2)
{20,30}
> = l(2,-2) --> can use call notation for slices!
{20,30}
> = l:slice_assign(2,2,{21,22,23})
> l
{10,21,22,23,30,40}
> = l:chop(1,1)
> l
{21,22,23,30,40}
Functions like slice_assign and chop modify the list; the first is equivalent to Pythonl[i1:i2] = seq and the second to del l[i1:i2].
List objects are ultimately just Lua 'list-like' tables, but they have extra operations defined on them, such as equality and concatention. For regular tables, equality is only true if the two tables are identical objects, whereas two lists are equal if they have the same contents.
> l1 = List {1,2,3}
> l2 = List {1,2,3}
> = l1 == l2
true
> = l1..l2
{1,2,3,1,2,3}
The List constructor can be passed a function. If so, it's assumed that this is an iterator function that can be repeatedly called to generate a sequence. One such function is io.lines; the following short, intense little script counts the number of lines in standard input:
-- linecount.lua
require 'pl'
ls = List(io.lines())
print(#ls)
pl.list.iter captures what List considers a sequence. In particular, it can also iterate over all 'characters' in a string:
> for ch in pl.list.iter 'help' do io.write(ch,' ') end
h e l p >
There are a number of operations that go beyond the Python implementation. For instance, you can partition a list into a table of sublists using a function. In the simplest form, you use a predicate (a function returning a boolean value) to partition the list into two lists, one of elements matching and another of elements not matching. But you can use any function; if we use type then the keys will be the standard Lua type names.
> ls = List{1,2,3,4}
> ops = require 'pl.operator'
> ls:partition(function(x) return x > 2 end)
{false={1,2},true={3,4}}
> ls = List{'one',math.sin,List{1},10,20,List{1,2}}
> ls:partition(type)
{function={function: 00369110},string={one},number={10,20},table={{1},{1,2}}}
Some notes on terminology: Lua tables are usually list-like (like an array) or map-like (like an associative array or dict); they can of course have a list-like and a map-like part. Some of the table operations only make sense for list-like tables, and some only for map-like tables.
The functions provided in table provide all the basic manipulations on Lua tables, but as we saw with the List class, it is useful to build higher-level operations on top of those functions. For instance, to copy a table involves this kind of loop:
local res = {}
for k,v in pairs(T) do
res[k] = v
end
The tablex module (see pl.tablex) provides deepcopy which goes further than a simple loop in two ways; first, it also gives the copy the same metatable as the original (so it can copy objects like List above) and any nested tables will also be copied, to arbitrary depth.
In a similar spirit, deepcompare will take two tables and return true only if they have exactly the same values and structure.
> t1 = {1,{2,3},4}
> t2 = deepcopy(t1)
> = t1 == t2
false
> = deepcompare(t1,t2)
true
find will return the index of a given value in a list-like table. This is a direct linear search, so it can slow down code that depends on it; note that a function can be provided that defines equality for the search. If efficiency is required, consider using an index map. index_map will return a table where the keys are the original values of the list, and the associated values are the indices. (It is almost exactly the representation needed for a set.)
> t = {'one','two','three'}
> = tablex.find(t,'two')
2
> = tablex.find(t,'four')
nil
> il = tablex.index_map(t)
> il['two']
2
> il.two
2
A version of index_map called set is also provided, where the values are just true. This is useful because two such sets can be compared for equality using deepcompare:
> deepcompare(set {1,2,3},set {2,1,3})
true
find_if will search a table using a function. The optional third argument is a value which will be passed as a second argument to the function. pl.operator provides the Lua operators conveniently wrapped as functions, so the basic comparison functions are available:
> ops = require 'pl.operator'
> = tablex.find_if({10,20,30,40},ops.gt,20)
3 true
Note that find_if will also return the actual value returned by the function, which of course is usually just true for a boolean function, but any value which is not nil and not false can be usefully passed back.
deepcompare does a thorough recursive comparison, but otherwise using the default equality operator. compare allows you to specify exactly what function to use when comparing two list-like tables, and compare_no_order is true if they contain exactly the same elements. Do note that the latter does not need an explicit comparison function - in this case the implementation is actually to compare the two sets, as above:
> compare_no_order({1,2,3},{2,1,3})
true
> compare_no_order({1,2,3},{2,1,3},'==')
true
(Note the special string '==' above; instead of saying ops.gt or ops.eq we can use the strings '>' or '==' respectively.)
There are several ways to merge tables in PL. If they are list-like, then see the operations defined by pl.list.List, like concatenation. If they are map-like, then tablex.merge provides two basic operations. If the third arg is false, then the result only contains the keys that are in common between the two tables, and if true, then the result contains all the keys of both tables. These are in fact generalized set union and intersection operations:
> S1 = {john=27,jane=31,mary=24}
> S2 = {jane=31,jones=50}
> tablex.merge(S1,S2,false)
{jane=31}
> tablex.merge(S1,S2,true)
{mary=24,jane=31,john=27,jones=50}
When working with tables, you will often find yourself writing loops like in the first example. Loops are second nature to programmers, but they are often not the most elegant and self-describing way of expressing an operation. Consider the map function, which creates a new table by applying a function to each element of the original:
> = map(math.sin,{1,2,3,4})
{ 0.84, 0.91, 0.14, -0.76}
> = map(function(x) return x*x end,{1,2,3,4})
{1,4,9,16}
map saves you from writing a loop, and the resulting code is often clearer, as well as being shorter. This is not to say that 'loops are bad' (although you will hear that from some extremists), just that it's good to capture standard patterns. Then the loops you do write will stand out and acquire more significance.
pairmap is interesting, because the function works with both the key and the value.
> t = {fred=10,bonzo=20,alice=4}
> = pairmap(function(k,v) return v end, t)
{4,10,20}
> = pairmap(function(k,v) return k end, t)
{'alice','fred','bonzo'}
(These are common enough operations that the first is defined as values and the second as keys.) If the function returns two values, then the second value is considered to be the new key:
> = pairmap(t,function(k,v) return v+10,k:upper() end)
{BONZO=30,FRED=20,ALICE=14}
map2 applies a function to two tables:
> map2(ops.add,{1,2},{10,20})
{11,22}
> map2('*',{1,2},{10,20})
{10,40}
The various map operations generate tables; reduce applies a function of two arguments over a table and returns the result as a scalar:
> reduce ('+',{1,2,3})
6
> reduce ('..',{'one','two','three'})
'onetwothree'
Finally, zip sews different tables together:
> = zip({1,2,3},{10,20,30})
{{1,10},{2,20},{3,30}}
two-dimensional tables are of course easy to represent in Lua, for instance {{1,2},{3,4}} where we store rows as subtables and index like so A[col][row]. This is the common representation used by matrix libraries like LuaMatrix. pl.array does not provide matrix operations, since that is the job for a specialized library, but rather provides generalizations of the higher-level operations provided by pl.tablex for one-dimensional arrays.
array.iter is a useful generalization of ipairs. (The extra parameter determines whether you want the indices as well.)
> array = require 'pl.array'
> a = {{1,2},{3,4}}
> for i,j,v in array.iter(a,true) do print(i,j,v) end
1 1 1
1 2 2
2 1 3
2 2 4
Bear in mind that you can always convert an arbitrary 2D array into a 'list of lists' with List(tablex.map(List,a))
array.map will apply a function over all elements (notice that extra arguments can be provided, so the operation is in effect function(x) return x-1 end)
> array.map('-',a,1)
{{0,1},{2,3}}
2D arrays are stored as an array of rows, but columns can be extracted:
> array.column(a,1)
{1,3}
There are three equivalents to tablex.reduce. You can either reduce along the rows (which is the most efficient) or reduce along the columns. Either one will give you a 1D array. And reduce2 will apply two operations: the first one reduces the rows, and the second reduces the result.
> array.reduce_rows('+',a)
{3,7}
> array.reduce_cols('+',a)
{4,6}
> -- same as tablex.reduce('*',array.reduce_rows('+',a))
> array.reduce2('*','+',a)
21 `
tablex.map2 applies an operation to two tables, giving another table. array.map2 does this for 2D arrays. Note that you have to provide the rank of the arrays involved, since it's hard to always correctly deduce this from the data:
> b = {{10,20},{30,40}}
> array.map2('+',2,2,a,b) -- two 2D arrays
{{11,22},{33,44}}
> array.map2('+',1,2,{10,100},a) -- 1D, 2D
{{11,102},{13,104}}
> array.map2('*',2,1,a,{1,-1}) -- 2D, 1D
{{1,-2},{3,-4}}
Of course, you are not limited to simple arithmetic. Say we have a 2D array of strings, and wish to print it out with proper right justification. The first step is to create all the string lengths by mapping string.len over the array, the second is to reduce this along the columns using math.max to get maximum column widths, and last, apply string.rjust with these widths.
maxlens = reduce_cols(math.max,map('#',lines))
lines = map2(string.rjust,2,1,lines,maxlens)
There is product which returns the Cartesian product of two 1D arrays. The result is a 2D array formed from applying the function to all possible pairs from the two arrays.
> array.product('{}',{1,2},{'a','b'})
{{{1,'b'},{2,'a'}},{{1,'a'},{2,'b'}}}
These are convenient borrowings from Python, as described in 3.6.1 of the Python reference, but note that indices in Lua always begin at one. There are methods like s:isalpha() and s:isdigit(), which return true if s is only composed of letters or digits respectively. s:startswith() and s:endswith() are convenient ways to find substrings. (endswith works as in Python 2.5, so that f:endswith {'.bat','.exe','.cmd'} will be true for any filename which ends with these extensions.) There are justify methods and whitespace trimming functions like strip.
Most of these can be fairly easily implemented using the Lua string library, which is more general and powerful. But they are convenient operations to have easily at hand. (see pl.string)
Another borrowing from Python, as described in 4.1.6.
local Template = require ('pl.string').Template
t = Template('${here} is the $answer')
print(t:substitute {here = 'Lua', answer = 'best'})
Lua string pattern matching is very powerful, and usually you will not need a traditional regular expression library. Even so, sometimes Lua code ends up looking like Perl, which happens because string patterns are not always the easiest things to read, especially for the casual reader. Here is a program which needs to understand three distinct date formats:
-- parsing dates using Lua string patterns
months={Jan=1,Feb=2,Mar=3,Apr=4,May=5,Jun=6,
Jul=7,Aug=8,Sep=9,Oct=10,Nov=11,Dec=12}
function check_and_process(d,m,y)
d = tonumber(d)
m = tonumber(m)
y = tonumber(y)
....
end
for line in f:lines() do
-- ordinary (English) date format
local d,m,y = line:match('(%d+)/(%d+)/(%d+)')
if d then
check_and_process(d,m,y)
else -- ISO date??
y,m,d = line:match('(%d+)%-(%d+)%-(%d+)')
if y then
check_and_process(d,m,y)
else -- <day> <month-name> <year>?
d,mm,y = line:match('%(d+)%s+(%a+)%s+(%d+)')
m = months[mm]
check_and_process(d,m,y)
end
end
end
These aren't particularly difficult patterns, but already typical issues are appearing, such as having to escape '-'. Also, string.match returns its captures, so that we're forced to use a slightly awkward nested if-statement.
Verification issues will further cloud the picture, since regular expression people try to enforce constraints (like year cannot be more than four digits) using regular expressions, on the usual grounds that one shouldn't stop using a hammer when one is enjoying oneself.
pl.sip provides a simple, intuitive way to detect patterns in strings and extract relevant parts.
> sip = require 'pl.sip'
> write = require('pl.pretty').write
> function pprint(t) print(write(t)) end
> res = {}
> c = sip.compile 'ref=$S{file}:$d{line}'
> = c('ref=hello.c:10',res)
true
> pprint(res)
{
line = 10,
file = "hello.c"
}
> c('ref=long name, no line',res)
false
sip.compile creates a pattern matcher function, which is given a string and a table. If it matches the string, then true is returned and the table is populated according to the named fields in the pattern.
Here is another version of the date parser:
-- using SIP patterns
function check(t)
check_and_process(t.day,t.month,t.year)
end
shortdate = sip.compile('$d{day}/$d{month}/$d{year}')
longdate = sip.compile('$d{day} $v{mon} $d{year}')
isodate = sip.compile('$d{year}-$d{month}-$d{day}')
for line in f:lines() do
local res = {}
if shortdate(str,res) then
check(res)
elseif isodate(str,res) then
check(res)
elseif longdate(str,res) then
res.month = months[res.mon]
check(res)
end
end
SIP patterns start with '$', then a one-letter type, and then an optional variable in curly braces.
Type Meaning
v variable, or identifier.
i possibly signed integer
f floating-point number
r 'rest of line'
q quoted string (either ' or ")
p a path name
( anything inside (...)
[ anything inside [...]
{ anything inside {...}
< anything inside <...>
[---------------------------------]
S non-space
d digits
...
If a type is not one of v,i,f,r or q, then it's assumed to be one of the standard Lua character classes. Any spaces you leave in your pattern will match any number of spaces. And any 'magic' string characters will be escaped.
SIP captures (like $v{mon}) do not have to be named. You can use just $v, but you have to be consistent; if a pattern contains unnamed captures, then all captures must be unnamed. In this case, the result table is a simple list of values.
sip.match is a useful shortcut if you like your matches to be 'in place'. (It caches the result, so it is not much slower than explicitly using sip.compile.)
> sip.match('($q{first},$q{second})','("john","smith")',res)
true
> res
{second='smith',first='john'}
> res = {}
> sip.match('($q,$q)','("jan","smit")',res) -- unnamed captures
true
> res
{'jan','smit'}
> sip.match('($q,$q)','("jan", "smit")',res)
false ---> oops!
> sip.match('( $q , $q )','("jan", "smit")',res)
true
As a general rule, allow for whitespace in your patterns.
Finally, putting a ' $' at the end of a pattern means 'capture the rest of the line, starting at the first non-space'.
> sip.match('( $q , $q ) $','("jan", "smit") and a string',res)
true
> res
{'jan','smit','and a string'}
> res = {}
> sip.match('( $q{first} , $q{last} ) $','("jan", "smit") and a string',res)
true
> res
{first='jan',rest='and a string',last='smit'}
Programs should not depend on quirks of your operating system. They will be harder to read, and need to be ported for other systems. The worst of course is hardcoding paths like 'c:\' in programs, and wondering why Vista complains so much. But even something like dir..'\\'..file is a problem, since Unix can't understand backslashes in this way. dir..'/'..file is usually portable, but it's best to put this all into a simple function, path.join. If you consistently use path.join, then it's much easier to write cross-platform code, since it handles the directory separator for you.
pl.path provides the same functionality as Python's os.path module (11.1).
> p = 'c:\\bonzo\\DOG.txt'
> = path.normcase (p)
c:\bonzo\dog.txt
> = path.splitext (p)
c:\bonzo\DOG .txt
> = path.extension (p)
.txt
> = path.basename (p)
DOG.txt
> = path.exists(p)
false
> = path.join ('fred','alice.txt')
fred\alice.txt
> = path.exists 'pretty.lua'
true
> = path.getsize 'pretty.lua'
2125
> = path.isfile 'pretty.lua'
true
> = path.isdir 'pretty.lua'
false
It is becoming increasingly important for all programmers, not just on Unix, to only write to where they are allowed to write. path.expanduser will expand '~' (tilde) into the home directory. Depending on your OS, this will be a guaranteed place where you can create files:
> = path.expanduser '~/mydata.txt'
'C:\Documents and Settings\SJDonova/mydata.txt'
> = path.expanduser '~/mydata.txt'
/home/sdonovan/mydata.txt
Under Windows, os.tmpname returns a path which leads to your drive root full of temporary files. (And increasingly, you do not have access to this root folder.) This is corrected by path.tmpname, which uses the environment variable TMP:
> os.tmpname() -- not a good place to put temporary files!
'\s25g.'
> path.tmpname()
'C:\DOCUME~1\SJDonova\LOCALS~1\Temp\s25g.1'
A useful extra function is pl.path.package_path(), which will tell you the path of a particular Lua module. So on my system, package_path('pl.path') returns 'C:\Program Files\Lua\5.1\lualibs\pl\path.lua', and package_path('ifs') returns 'C:\Program Files\Lua\5.1\clibs\lfs.dll'.
pl.dir provides some useful functions for working with directories. fnmatch will match a filename against a shell pattern, and filter will return any files in the supplied list which match the given pattern, which correspond to the functions in the Python fnmatch module. getdirectories will return all directories contained in a directory, and getfiles will return all files in a directory which match a shell pattern. These functions return the files as a table, unlike lfs.dir which returns an iterator.)
Copying files is suprisingly tricky. dir.copyfile and dir.movefile attempt to use the best implementation possible. On Windows, they link to the API functions CopyFile and MoveFile, but only if the alien package is installed (this is true for Lua for Windows.) Otherwise, the system copy command is used.
dir.makepath can create a full path, creating subdirectories as necessary; rmtree is the Nuclear Option of file deleting functions, since it will recursively clear out and delete all directories found begining at a path (there is a similar function with this name in the Python shutils module.)
> = dir.makepath 't\\temp\\bonzo'
> = path.isdir 't\\temp\\bonzo'
true
> = dir.rmtree 't'
dir.rmtree depends on dir.walk, which is a powerful tool for scanning a whole directory tree. Here is the implementation of dir.rmtree:
--- remove a whole directory tree.
-- @param path A directory path
function dir.rmtree(fullpath)
for root,dirs,files in dir.walk(fullpath) do
for i,f in ipairs(files) do
os.remove(path.join(root,f))
end
lfs.rmdir(root)
end
end
dir.clonetree clones directory trees. The first argument is a path that must exist, and the second path is the path to be cloned. (Note that this path cannot be inside the first path, since this leads to madness.) By default, it will then just recreate the directory structure. You can in addition provide a function, which will be applied for all files found.
-- make a copy of my libs folder
require 'pl'
p1 = [[d:\dev\lua\libs]]
p2 = [[D:\dev\lua\libs\..\tests]]
dir.clonetree(p1,p2,dir.copyfile)
A more sophisticated version, which only copies files which have been modified:
-- p1 and p2 as before, or from arg[1] and arg[2]
dir.clonetree(p1,p2,function(f1,f2)
local res
local t1,t2 = path.getmtime(f1),path.getmtime(f2)
if t1 > t2 then
res = dir.copyfile(f1,f2)
end
return res -- indicates successful operation
end)
dir.clonetree uses path.common_prefix. With p1 and p2 defined above, the common path is 'd:\dev\lua'. So 'd:\dev\lua\libs\testfunc.lua` is copied to 'd:\dev\lua\test\testfunc.lua', etc.
If you need to find the common path of list of files, then tablex.reduce will do the job:
> p3 = [[d:\dev]]
> = tablex.reduce(path.common_prefix,{p1,p2,p3})
'd:\dev'
The first thing to consider is this: do you actually need to write a custom file reader? And if the answer is yes, the next question is: can you write the reader in as clear a way as possible? Correctness, Robustness, and Fast; pick the first two and the third can be sorted out later, if necessary.
A common sort of data file is the configuration file format commonly used on Unix systems. This format is often called a property file in the Java world.
# Read timeout in seconds
read.timeout=10
# Write timeout in seconds
write.timeout=10
Here is a simple Lua implementation:
-- property file parsing with Lua string patterns
props = []
for line in io.lines() do
if line:find('#,1,true) ~= 1 and not line:find('^%s*$') then
local var,value = line:match('([^=]+)=(.*)')
props[var] = value
end
end
Very compact, but it suffers from a similar disease in equivalent Perl programs; it uses odd string patterns which are 'lexically noisy'. Noisy code like this slows the casual reader down. (For an even more direct way of doing this, see the next section, 'Reading Configuration Files')
Another implementation, using the Penlight libraries:
-- property file parsing with extended string functions
require 'pl'
props = []
for line in io.lines() do
if not line:startswith('#') and not line:is_blank() then
local var,value = line:splitv('=')
props[var] = value
end
end
This is more self-documenting; it is generally better to make the code express the intention, rather than having to scatter comments everywhere - comments are necessary, of course, but mostly to give the higher view of your intention that cannot be expressed in code. It is slightly slower, true, but in practice the speed of this script is determined by i/o, so further optimization is unnecessary.
Text data is sometimes unstructured, for example a file containing words. The 'pl.input` module has a number of functions which makes processing such files easier. For example, a script to count the number of words in standard input (see pl.input.words):
-- countwords.lua
require 'pl'
local k = 1
for w in input.words(io.stdin) do
k = k + 1
end
print('count',k)
Or this script to calculate the average of a set of numbers (see pl.input.numbers):
-- average.lua
require 'pl'
local k = 1
local sum = 0
for n in input.numbers(io.stdin) do
sum = sum + n
k = k + 1
end
print('average',sum/k)
These scripts can be improved further by eliminating loops In the last case, there is a perfectly good function seq.sum which can already take a sequence of numbers and calculate these numbers for us:
-- average2.lua
require 'pl'
local total,n = seq.sum(input.numbers())
print('average',total/n)
A further simplification here is that if numbers or words are not passed an argument, they will grab their input from standard input. The first script can be rewritten:
-- countwords2.lua
require 'pl'
print('count',seq.count(input.words()))
A useful feature of a sequence generator like numbers is that it can read from a string source. Here is a script to calculate the sums of the numbers on each line in a file:
-- sums.lua
for line in io.lines() do
print(seq.sum(input.numbers(line))
end
It is very common to find data in columnar form, either space or comma-separated, perhaps with an initial set of column headers. Here is a typical example:
EventID Magnitude LocationX LocationY LocationZ
981124001 2.0 18988.4 10047.1 4149.7
981125001 0.8 19104.0 9970.4 5088.7
981127003 0.5 19012.5 9946.9 3831.2
...
input.fields is designed to extract several columns, given some delimiter (default to whitespace). Here is a script to calculate the average X location of all the events:
-- avg-x.lua
require 'pl'
io.read() -- skip the header line
local sum,count = seq.sum(input.fields {3})
print(sum/count)
input.fields is passed either a field count, or a list of column indices, starting at one as usual. So in this case we're only interested in column 3. If you pass it a field count, then you get every field up to that count:
for id,mag,locX,locY,locZ in input.fields (5) do
....
end
input.fields by default tries to convert each field to a number. It will skip lines which clearly don't match the pattern, but will abort the script if there are any fields which cannot be converted to numbers.
The second parameter is a delimiter, by default spaces. ' ' is understood to mean 'any number of spaces', i.e. '%s+'. Any Lua string pattern can be used.
The third parameter is a data source, by default standard input (see pl.input.create_getter) It assumes that the data source has a read method which brings in the next line, i.e. it is a 'file-like' object. As a special case, a string will be split into its lines:
> for x,y in input.fields(2,' ','10 20\n30 40\n') do print(x,y) end
10 20
30 40
Note the default behaviour for bad fields, which is to show the offending line number:
> for x,y in input.fields(2,' ','10 20\n30 40x\n') do print(x,y) end
10 20
line 2: cannot convert '40x' to number
This behaviour of input.fields is appropriate for a script which you want to fail immediately with an appropriate user error message if conversion fails. The fourth optional parameter is an options table: {no_fail=true} means that conversion is attempted but if it fails it just returns the string, rather as AWK would operate. You are then responsible for checking the type of the returned field. {no_convert=true} switches off conversion altogether and all fields are returned as strings.
Sometimes it is useful to bring a whole dataset into memory, for operations such as extracting columns. Penlight provides a flexible reader specifically for reading this kind of data (see pl.data.read). Given a file looking like this:
x,y
10,20
2,5
40,50
Then data.read will create a table like this, with each row represented by a sublist:
> t = data.read 'test.txt'
> t
{{10,20},{2,5},{40,50},
column_by_name=function: 00435A50,
column_names=function: 00426B30,select=function: 00452CE0,
fieldnames={'x','y'},delim=','}
You can now analyze this returned table using the supplied methods. For instance, the method column_by_name returns a table of all the values of that column.
-- testdata.lua
require 'pl'
d = data.read('fev.txt')
for _,name in ipairs(d.fieldnames) do
local col = d:column_by_name(name)
if type(col[1]) == 'number' then
local total,n = seq.sum(col)
utils.printf("Average for %s is %f\n",name,total/n)
end
end
data.read tries to be clever when given data; by default it expects a first line of column names, unless any of them are numbers. It tries to deduce the column delimiter by looking at the firstline. Sometimes it guesses wrong; these things can be specified explicitly.
d = data.read('xyz.txt',{fieldnames=true,delim=' '})
A very powerful feature is a way to execute SQL-like queries on such data:
-- queries on tabular data
require 'pl'
local d = data.read('xyz.txt')
local q = d:select('x,y,z where x > 3 and z < 2 sort by y')
for x,y,z in q do
print(x,y,z)
end
Please note that the format of queries is restricted to the following syntax:
<fieldlist> [ 'where' <lua-condn> [ 'sort by' <field>] ]
I've always been an admirer of the AWK programming language; with filter (see pl.data.filter) you can get Lua programs which are just as compact:
-- printxy.lua
require 'pl'
data.filter 'x,y where x > 3'
Finally, for the curious, the global variable _DEBUG can be used to print out the actual iterator function which a query generates and dynamically compiles. By using code generation, we can get pretty much optimal performance out of arbitrary queries.
> lua -lpl -e "_DEBUG=true" -e "data.filter 'x,y where x > 4 sort by x'" < test.txt
return function (t)
local i = 0
local v
local ls = {}
for i,v in ipairs(t) do
if v[1] > 4 then
ls[#ls+1] = v
end
end
table.sort(ls,function(v1,v2)
return v1[1] < v2[1]
end)
local n = #ls
return function()
i = i + 1
v = ls[i]
if i > n then return end
return v[1],v[2]
end
end
10,20
40,50
The config module provides a simple way to convert several kinds of configuration files into a Lua table. Consider the simple example:
# test.config
# Read timeout in seconds
read.timeout=10
# Write timeout in seconds
write.timeout=5
#acceptable ports
ports = 1002,1003,1004
This can be easily brought in using config.read and the result shown using pl.pretty.write (see pl.pretty.write)
-- readconfig.lua
local config = require 'pl.config'
local pretty= require 'pl.pretty'
local t = config.read(arg[1])
print(pretty.write(t))
and the output of lua readconfig.lua test.config is:
{
ports = {
1002,
1003,
1004
},
write_timeout = 5,
read_timeout = 10
}
That is, config.read() will bring in all key/value pairs, ignore # comments, and ensure that the key names are proper Lua identifiers by replacing non-identifier characters with '_'. If the values are numbers, then they will be converted. (So the value of t.write_timeout is the number 5). In addition, any values which are separated by commas will be converted likewise into an array.
Any line can be continued with a backslash. So this will all be considered one line:
names=one,two,three, \
four,five,six,seven, \
eight,nine,ten
Windows-style INI files are also supported. The section structure of INI files translates naturally to nested tables in Lua:
; test.ini
[timeouts]
read=10 ; Read timeout in seconds
write=5 ; Write timeout in seconds
[portinfo]
ports = 1002,1003,1004
The output is:
{
portinfo = {
ports = {
1002,
1003,
1004
}
},
timeouts = {
write = 5,
read = 10
}
}
You can now refer to the write timeout as t.timeouts.write.
As a final example of the flexibility of config.read, if passed this simple comma-delimited file
one,two,three
10,20,30
40,50,60
1,2,3
it will produce the following table:
{
{ "one", "two", "three" },
{ 10, 20, 30 },
{ 40, 50, 60 },
{ 1, 2, 3 }
}
config.read isn't designed to read all CSV files in general, but intended to support some Unix configuration files not structured as key-value pairs, such as '/etc/passwd'.
This function is intended to be a Swiss Army Knife of configuration readers, but it does have to make assumptions, and you may not like them. So there is an optional extra parameter which allows some control, which is table that may have the following fields:
{
variablilize = true,
convert_numbers = true,
trim_space = true,
list_delim = ','
}
variablilize is the option that converted write.timeout in the first example to the valid Lua identifier write_timeout. If convert_numbers is true, then an attempt is made to convert any string that starts like a number. trim_space ensures that there is no starting or trailing whitespace with values, and list_delim is the character that will be used to decide whether to split a value up into a list (it may be a Lua string pattern such as '%s+'.)
For instance, the password file in Unix is colon-delimited:
t = config.read('/etc/passwd',{list_delim=':'})
This produces the following output on my system (only last two lines shown):
{
...
{
"user",
"x",
"1000",
"1000",
"user,,,",
"/home/user",
"/bin/bash"
},
{
"sdonovan",
"x",
"1001",
"1001",
"steve donovan,28,,",
"/home/sdonovan",
"/bin/bash"
}
}
You can get this into a more sensible format, where the usernames are the keys, with:
t = tablex.pairmap(t,function(k,v) return v,v[1] end)
and you get:
{ ...
sdonovan = {
"sdonovan",
"x",
"1001",
"1001",
"steve donovan,28,,",
"/home/sdonovan",
"/bin/bash"
}
...
}
Although Lua's string pattern matching is very powerful, there are times when something more powerful is needed. pl.lexer.scan provides a lexical scanner which tokenizes a string, classifying tokens into numbers, strings, etc.
> lua -lpl
Lua 5.1.4 Copyright (C) 1994-2008 Lua.org, PUC-Rio
> tok = lexer.scan 'alpha = sin(1.5)'
> = tok()
iden alpha
> = tok()
= =
> = tok()
iden sin
> = tok()
( (
> = tok()
number 1.5
> = tok()
) )
> = tok()
The scanner is a function, which is repeatedly called and returns the type and value of the token. Recognized types are iden,string,number,space,comment and keyword, and everything else is represented by itself. Note that by default the scanner will skip any 'space' tokens.
'comment' and 'keyword' aren't applicable to the plain scanner, which is not language-specific, but a scanner which understands Lua is available:
> for t,v in lexer.lua 'for i=1,n do' do print(t,v) end
keyword for
space
iden i
= =
number 1
, ,
iden n
space
keyword do
A lexical scanner is useful where you have highly-structured data which is not nicely delimited by newlines. For example, here is a snippet of a in-house file format which it was my task to maintain:
points (818344.1,-20389.7,-0.1),(818337.9,-20389.3,-0.1),(818332.5,-20387.8,-0.1)
,(818327.4,-20388,-0.1),(818322,-20387.7,-0.1),(818316.3,-20388.6,-0.1)
,(818309.7,-20389.4,-0.1),(818303.5,-20390.6,-0.1),(818295.8,-20388.3,-0.1)
,(818290.5,-20386.9,-0.1),(818285.2,-20386.1,-0.1),(818279.3,-20383.6,-0.1)
,(818274,-20381.2,-0.1),(818274,-20380.7,-0.1);
Here is code to extract the points using pl.lexer:
-- assume 's' contains the text above...
local expecting = lexer.expecting
local append = table.insert
local tok = lexer.scan(s)
local points = {}
local t,v = tok() -- should be 'points'
while t ~= ';' do
c = {}
t,v = tok() -- should be '('
t,v = tok()
c.x = v
expecting(tok,',')
t,v = tok()
c.y = v
expecting(tok,',')
t,v = tok()
c.z = v
expecting(tok,')')
t,v = tok() -- either ',' or ';'
append(points,c)
end
The expecting function grabs the next token and if the type doesn't match, it throws an error. (pl.lexer, unlike other PL libraries, raises errors if something goes wrong, so you should wrap your code in pcall to catch the error gracefully.)
The ultimate highly-structured data is of course, program source. Here is a snippet from 'text-lexer.lua':
-- uses asserteq from pl.test
lines = [[
for k,v in pairs(t) do
if type(k) == 'number' then
print(v) -- array-like case
else
print(k,v)
end
end
]]
ls = List()
for tp,val in lexer.lua(lines,{space=true,comments=true}) do
assert(tp ~= 'space' and tp ~= 'comment')
if tp == 'keyword' then ls:append(val) end
end
asserteq(ls,List{'for','in','do','if','then','else','end','end'})
pl.lexer.lua does not by default exclude spaces and comments, but the second argument is an exception list.
Here is a useful little utility that identifies all common global variables present in a lua module:
-- testglobal.lua
require 'pl'
local txt = utils.readfile(arg[1])
local globals = List()
for t,v in lexer.lua(txt) do
if t == 'iden' and _G[v] then
globals:append(v)
end
end
print(pretty.write(seq.count_map(globals)))
Rather then dumping the whole list, with its duplicates, we pass it through seq.count_map which turns the list into a table where the keys are the values, and the associated values are the number of times those values occur in the sequence. Typical output looks like this:
{
type = 2,
pairs = 2,
table = 2,
print = 3,
tostring = 2,
require = 1,
ipairs = 4
}
You could further pass this through tablex.keys to get a unique list of symbols. This can be useful when writing 'strict' Lua modules, where all global symbols must be defined as locals at the top of the file.
For a more detailed use of lexer.scan, please look at 'testxml.lua' in the examples directory.
A Lua iterator (in its simplest form) is a function which can be repeatedly called to return a set of one or more values. The for in statement understands these iterators, and loops until the function returns nil. There are standard sequence adapters for tables in Lua, ipairs and 'pairs', and io.lines returns an iterator over all the lines in a file. In the Penlight libraries, such iterators are also called sequences. A sequence of single values (say from io.lines) is called single-valued, whereas the sequence defined by pairs is double-valued.
pl.seq provides a number of useful iterators, and some functions which operate on sequences. At first sight this example looks like an attempt to write Python in Lua, (with the sequence being inclusive):
> for i in seq.range(1,4) do print(i) end
1
2
3
4
But range is actually equivalent to Python's xrange, since it generates a sequence, not a list. To get a list, use seq.copy(seq.range(1,10)), which takes any single-value sequence and makes a table from the result. seq.list is like ipairs except that it does not give you the index, just the value.
> for x in seq.list {1,2,3} do print(x) end
1
2
3
seq.printall is useful for printing out sequences, and provides some finer control over formating, such as a delimiter, the number of fields per line, and a format string to use (see string.format)
> seq.printall(seq.random(10))
0.0012512588885159 0.56358531449324 0.19330423902097 ....
> seq.printall(seq.random(10),',',4,'%4.2f')
0.17,0.86,0.71,0.51
0.30,0.01,0.09,0.36
0.15,0.17,
filter will filter a sequence using a boolean function (often called a predicate). For instance, this code only prints lines in a file which are composed of digits:
for l in seq.filter(io.lines(file),pl.string.isdigit) do print(l) end
We're already encounted seq.sum when discussing input.numbers. This can also be expressed with seq.reduce:
> seq.reduce(function(x,y) return x + y end,seq.list{1,2,3,4})
10
seq.reduce applies a binary function in a recursive fashion, so that:
reduce({1,2,3},op) => op(1,reduce({2,3},op) => op(1,op(2,3))
it's now possible to easily generate other cumulative operations; the standard operations declared in pl.operator are useful here:
> ops = require 'pl.operator'
> -- can also say '*' instead of ops.mul
> seq.reduce(ops.mul,input.numbers '1 2 3 4')
24
There are functions to extract statistics from a sequence of numbers:
> l1 = List {10,20,30}
> l2 = List {1,2,3}
> = seq.minmax(l1)
10 30
> = seq.sum(l1)
60 3
It is common to get sequences where values are repeated, say the words in a file. count_map will take such a sequence and count the values, returning a table where the keys are the unique values, and the value associated with each key is the number of times they occurred:
> t = seq.count_map {'one','fred','two','one','two','two'}
> t
{one=2,fred=1,two=3}
This will also work on numerical sequences, but you cannot expect the result to be a proper list, i.e. having no 'holes'. Instead, you always need to use pairs to iterate over the result:
> t = seq.count_map {1,2,4,2,2,3,4,2,6}
> for k,v in pairs(t) do print(k,v) end
1 1
2 4
3 1
4 2
6 1
unique uses count_map to return a list of the unique values, that is, just the keys of the resulting table.
last turns a single-valued sequence into a double-valued sequence with the current value and the last value:
> for current,last in seq.last {10,20,30,40} do print (current,last) end
20 10
30 20
40 30
This makes it easy to do things like identify repeated lines in a file, or construct differences between values. filter can handle double-valued sequences as well, so one could filter such a sequence to only return cases where the current value is less than the last value by using `operator.lt'.
Finally, sequences can be combined, either by 'zipping' them or by concatenating them.
> for x,y in seq.zip(l1,l2) do print(x,y) end
10 1
20 2
30 3
> for x in seq.splice(l1,l2) do print(x) end
10
20
30
1
2
3
List comprehensions are a compact way to create tables by specifying their elements. In Python, you can say this:
ls = [x for x in range(5)] # == [0,1,2,3,4]
In Lua, using pl.comprehension:
> C = require('pl.comprehension').new()
> C ('x for x=1,10') ()
{1,2,3,4,5,6,7,8,9,10}
C is a function which compiles a list comprehension string into a function. In this case, the function has no arguments. The parentheses are redundant for a function taking a string argument, so this works as well:
> C 'x^2 for x=1,4' ()
{1,4,9,16}
> C '{x,x^2} for x=1,4' ()
{{1,1},{2,4},{3,9},{4,16}}
Note that the expression can be any function of the variable x!
The basic syntax so far is <expr> for <set>, where <set> can be anything that the Lua for statement understands. <set> can also just be the variable, in which case the values will come from the argument of the comprehension. Here I'm emphasizing that a comprehension is a function which can take a list argument:
> C '2*x for x' {1,2,3}
{2,4,6}
> dbl = C '2*x for x'
> dbl {10,20,30}
{20,40,60}
Here is a somewhat more explicit way of saying the same thing; _1 is a placeholder refering to the first argument passed to the comprehension.
> C '2*x for _,x in pairs(_1)' {10,20,30}
{20,40,60}
This extended syntax is useful when you wish to collect the result of some iterator, such as io.lines. This comprehension creates a function which creates a table of all the lines in a file:
> f = io.open('array.lua')
> lines = C 'line for line in _1:lines()' (f)
> #lines
118
There are a number of functions that may be applied to the result of a comprehension:
> C 'min(x for x)' {1,44,0}
0
> C 'max(x for x)' {1,44,0}
44
> C 'sum(x for x)' {1,44,0}
45
(These are equivalent to a reduce operation on a list.)
After the for part, there may be a condition, which filters the output. This comprehension collects the even numbers from a list:
> C 'x for x if x % 2 == 0' {1,2,3,4,5}
{2,4}
There may be a number of for parts:
> C '{x,y} for x = 1,2 for y = 1,2' ()
{{1,1},{1,2},{2,1},{2,2}}
> C '{x,y} for x for y' ({1,2},{10,20})
{{1,10},{1,20},{2,10},{2,20}}
These comprehensions are useful when dealing with functions of more than one variable, and are not so easily achieved with the other Penlight functional forms.
Lua functions may be treated like any other value, although of course you cannot multiply or add them. One operation that makes sense is function composition, which chains function calls (so (f * g)(x) is f(g(x)).)
> func = require 'pl.func'
> printf = func.compose(io.write,string.format)
> printf("hello %s\n",'world')
hello world
true
Many functions require you to pass a function as an argument, say to apply to all values of a sequence or as a callback. Usually this function is required to have a particular number of arguments, often one (in the case of the map functions) or two (for comparison functions.) But often useful functions have the wrong number of arguments. For instance, operator.add simply adds its two arguments, but can't be passed to tablex.map, which expects to pass only one value to its function. So there is a need to construct a function of one argument from one of two arguments, binding the extra argument to a given value.
currying takes a function of n arguments and returns a function of n-1 arguments where the first argument is bound to some value:
> p2 = func.curry(print,'start>')
> p2('hello',2)
start> hello 2
The module pl.operator contains all the Lua operators expressed as functions, much as the Python module of the same name.
> ops = require 'pl.operator'
> tablex.filter({1,-2,10,-1,2},curry(ops.gt,0))
{-2,-1}
> tablex.filter({1,-2,10,-1,2},curry(ops.le,0))
{1,10,2}
This unfortunately reads backwards, because curry is always binding the first argument!
Currying is a specialized form of function binding. Here is another way to say the print example:
> p2 = func.bind(print,'start>',func._1,func._2)
> p2('hello',2)
start> hello 2
where _1 and _2 are placeholder variables, corresponding to the first and second argument respectively.
Having func all over the place is distracting, so it's useful to pull all of pl.func into the local context. Here is the filter example, this time the right way around:
> utils.import 'pl.func'
> tablex.filter({1,-2,10,-1,2},bind(ops.gt,_1,0))
{1,10,2}
tablex.merge does a general merge of two tables. This example shows the usefulness of binding the last argument of a function.
> S1 = {john=27,jane=31,mary=24}
> S2 = {jane=31,jones=50}
> intersection = bind(tablex.merge,_1,_2,false)
> union = bind(tablex.merge,_1,_2,true)
> intersection(S1,S2)
{jane=31}
> union(S1,S2)
{mary=24,jane=31,john=27,jones=50}
When using bind to curry print, we got a function of precisely two arguments, whereas we really want our function to use varargs like print. This is the role of _0:
> _DEBUG = true
> p = bind(print,'start>',_0)
return function (fn,_v1)
return function(...) return fn(_v1,...) end
end
> p(1,2,3,4,5)
start> 1 2 3 4 5
I've turned on the global _DEBUG flag, so that the function generated is printed out. It is actually a function which generates the required function; the first call binds the value of _v1 to 'start>'.
A common pattern in Penlight is a function which applies another function to all elements in a table or a sequence, such as tablex.map or seq.filter. Lua does anonymous functions well, although they can be a bit tedious to type:
> tablex.map(function(x) return x*x end,{1,2,3,4})
{1,4,9,16}
pl.func allows you to define placeholder expressions, which can cut down on the typing required, and also make your intent clearer. First, we bring contents of pl.func into our context, and then supply an expression using placeholder variables, such as _1,_2,etc. (C++ programmers will recognize this from the Boost libraries.)
> utils.import 'pl.func'
> tablex.map(_1*_1,{1,2,3,4})
{1,4,9,16}
Functions of up to 5 arguments can be generated.
> tablex.map2(_1+_2,{1,2,3},{10,20,30})
{11,22,33}
These expressions can use arbitrary functions, altho they must first be registered with the functional library. pl.func.register brings in a single function, and pl.func.import brings in a whole table of functions, such as math.
> sin = register(math.sin)
> tablex.map(sin(_1),{1,2,3,4})
{0.8414709848079,0.90929742682568,0.14112000805987,-0.75680249530793}
> import 'math'
> tablex.map(cos(2*_1),{1,2,3,4})
{-0.41614683654714,-0.65364362086361,0.96017028665037,-0.14550003380861}
A common operation is calling a method of a set of objects:
> tablex.map(_1:sub(1,1),{'one','four','x'})
{'o','f','x'}
> tablex.map(_1:at(1),{'one','four','x'})
{'o','f','x'}
There are some restrictions on what operators can be used in PEs. For instance, because the __len metamethod cannot be overriden by plain Lua tables, we need to define a special function to express `#_1':
> tablex.map(Len(_1),{'one','four','x'})
{3,4,1}
Likewise for comparison operators, which cannot be overloaded for different types, and thus also have to be expressed as a special function:
> tablex.filter(Gt(_1,0),{1,-1,2,4,-3})
{1,2,4}
It is useful to express the fact that a function returns multiple values. For instance, tablex.pairmap expects a function that will be called with the key and the value, and returns the new value and the key, in that order.
> pairmap(Args(_2,_1:upper()),{fred=1,alice=2})
{ALICE=2,FRED=1}
PEs cannot contain nil values, since PE function arguments are represented as an array. Instead, a special value called Nil is provided. So say _1:f(Nil,1) instead of _1:f(nil,1).
A placeholder expression cannot be automatically used as a Lua function. The technical reason is that the call operator must be overloaded to construct function calls like _1(1). If you want to force a PE to return a function, use pl.func.I.
> tablex.map(_1(10),{I(2*_1),I(_1*_1),I(_1+2)})
{20,100,12}
Here we make a table of functions taking a single argument, and then call them all with a value of 10.
There are some performance considerations to using placeholder expressions. Instantiating a PE requires building an compiling a new function, which is not such a fast operation. So to get best performance, factor out PEs from loops like this;
local fn = I(_1:f() + _2:g())
for i = 1,n do
res[i] = tablex.map2(fn,first[i],second[i])
end
The best way to test a library is to apply it to a problem one finds useful and interesting. As someone who moves between Windows and Linux frequently, the Windows console comes across as a very poor cousin to bash. luash is not however a reinvention of a Unix shell but an exploration of a few key ideas. The first is that we already have a good scripting language (Lua), so there is no need to produce a new 'language". The second is that commands can be Lua functions, and so adding new commands becomes easy. Thirdly, commands do not have to communicate purely through input and output, in text, which is of course the innovation of Microsoft's new Powershell. So commands may generate structured data, which can be manipulated by other commands.
C:\libs>luash
C:\lang\lua\projects\libs
$> cd $LUA_DEV
C:\Program Files\Lua\5.1
$> cd
1 C:\lang\lua\projects\libs
2 C:\Program Files\Lua\5.1
$> cd $1
1 = C:\lang\lua\projects\libs
C:\lang\lua\projects\libs
$>
Environment variables are expanded with $, like the Unix shells. The cd command without parameters dumps a list of directories which have been visited, and $n is a reference to the last list of directories or files.
$> list *.c
1 copy-tokens.c
2 lanes.c
3 old-tokens.c
4 tokens.c
5 tokens2.c
$> head -n3 $4
4 = tokens.c
/* TOKENS.C
*/
#include <ctype.h>
The list command does a directory listing, and also sets the context for any further $n expansions.
$> list -l *.c
1 copy-tokens.c 9K 03/30/09 19:03:46
2 lanes.c 0K 11/11/07 09:31:59
3 old-tokens.c 13K 11/14/07 18:49:06
4 tokens.c 9K 11/14/07 19:33:39
5 tokens2.c 9K 03/30/09 19:02:00
$> show context
context has 5 items and fields file size date
$> show file size where size > 10K
1 old-tokens.c 13K
$> echo $files
copy-tokens.c lanes.c old-tokens.c tokens.c tokens2.c
The -l flag gives an extended directory listing. The output of list is not really the text you see - that is generated by the default dumper. Rather, it sets the context to be a dataset with three fields file,size and date. This context can then be manipulated by the show command. (Note the special variable files which also retrieves values from the context.)
$> ls is list -l
$> ls t*.lua
There were 38 items
edit result? [true]
$>
Aliases can be created, which can save typing on common commands. Note that if a listing is greater than 25 lines, luash will ask you if you wish to see the result in your favourite editor, using the special alias 'edit', which is set to Notepad by default.
Naturally, Notepad is not my favourite editor, so I have the alias 'edit is metapad' in my .luashrc file, which sits in my home directory. luash also keeps the profile path in a variable:
$> echo ~/.luashrc
"C:\Documents and Settings\steve/.luashrc"
$> echo $LUASHRC
"C:\Documents and Settings\steve\.luashrc"
The rest of this section will show how Penlight was used to create luash.
Before a line is parsed, variable expansion takes place. First search the environment, then look at any luash variables. Quote anything with spaces, and return the empty string if not found (like sh). If the variable is a number, then we use get_value_from_context to look up that index in the current context.
local function expandvars (v)
local subst,status
if v:isdigit() then
status,subst = get_value_from_context(v,'file')
if not status then
status,subst = get_value_from_context(v,'path')
end
if not status then errorsh('no file or path') end
print(v..' = '..subst)
else
subst = os.getenv(v) or variables[v]
if type(subst) == 'function' then
subst,status = subst()
if status then variables[v] = subst end
return subst
end
end
return quote_if_necessary(subst)
end
If a variable is a function, then it is called for the string value. This is how the files variable is implemented.
Note how I am freely using extended string methods like isdigit in preference to the usual Lua style (v:find '^%d+$'.) because the intention is more obvious.
Parsing unstructured text and still writing readable code is always a challenge. But
the default lexer.scan chops things up too fine for our purposes. Shell commands are delimited by space, although they may be quoted (this alone rules out simple string splitting.) Due to the insanity of paths containing spaces on Windows, we also have to cope with things like "C:\Program Files\Lua\5.1"\lua\pl which result from variable expansion. This suggests a customized scanner which just handles these cases:
local yield = coroutine.yield
local matches = {
-- not interested in space...
{'^%s+',function(t) return end},
-- common pattern with expanded path with spaces next to filename
{ '^"[^"]+"%S+', function(t) return yield('string',t:sub(2):gsub('"','')) end},
-- a double-quoted string
{'^"[^"]+"',function(t) return yield('string',t:sub(2,-2)) end},
-- otherwise, just a token
{'^%S+',function(t) return yield('token',t) end},
}
local function scanner (line)
return lexer.scan(line,matches,{})
end
The match items will be applied in order, so put the most specific rules first, and finally catch anything left with '%S+'.
We have to look one token ahead, to cope with the = and is commands. If it isn't one of these, then the token must be inserted back into the stream, so that the next call of tok() will return it. We have to be careful not to prematurely close the stream when it's empty, so lexer provides a hack variable:
t,cmd = tok()
if not lexer.finished then
tn,nxt = tok()
end
We do not want the stream to be closed in the case of only one parameter, since that parameter may be an alias, and then we will need to insert tokens into the stream again. Note that seq.copy2 will convert the output of a scanner into a token list, which is a list of (type,value) pairs:
if alias[cmd] then
local subst = expand_string(alias[cmd])
if tn then
subst = substitute_alias_parameters(tok)
end
local tokens = seq.copy2(scanner(subst))
if tn then
append(tokens,{tn,nxt})
end
lexer.insert(tok,tokens)
_,cmd = tok()
...
(This is less than elegant, since one always has to consider the 'lookahead' token. No doubt a better solution can be found.)
The value of the alias must go through the same variable expansion described in the last section. For this, we need some way of quoting $ and ~ so that they aren't immediately expanded. The convention in luash is that if there is a space after these characters, then this is understood to mean the character itself. For instance:
pwd is echo $ pwd
pwd is already a special variable we can use, but without the space $pwd is evaluated immediately. Using show alias we can see that the value of the alias is echo $pwd, which will be properly expanded when the alias is actually used.
A command like list -l does not print out a listing itself; it sets the context to be a dataset with fieldnames file,size,date. This is a dataset in the sense that pl.data understands it (see pl.data); it is a table of rows, plus at least the fieldnames field.
The function dump_data actually prints out the values. By default it will not print out more than 25 items (you can change this with the variable maxdump); if interactive it will then ask you if you wish to view the results in an editor. The heart of this function is very short and shows the power of array operations:
-- convert our dataset into strings using the column formatters
local outs = array.map2('()',1,2,d.formatters,d,flags)
-- get the maximum column widths
local maxlens = array.reduce_cols(math.max,array.map('#',outs))
-- can now right justify each line appropriately
for i,row in ipairs(outs) do
row = tablex.map2(justify,row,maxlens)
outf:write(('%02d '):format(i)..concat(row),'\n')
end
d.formatters is a list of display formatter functions, and we have to call these functions, passing the corresponding column. Here is the default formatter for the 'file' column:
function formatters.file (val,flags)
if flags.n then
val = path.basename(val)
end
return val
end
So, any command outputing data containing a 'file' column can be passed a flag -n for forcing the file to be displayed without the directory part.
Once a context is established, a number of operations become possible. The pseudo-variable file is actually a function:
function variables.files ()
local slot = get_context_slot("file",true)
local res = array.column(context,slot)
res = tablex.imap(quote_if_necessary,res)
return concat(res,' ')
end
get_context_slot simply returns the index of the given fieldname in the current dataset. (If not found, it throws an error.) Now remember that the data is stored as rows, so to get all files we need to extract that particular index from each of the rows. pl.operator provides the useful array function that we can use with tablex.imap to extract a column from a matrix using row storage. The files may contain spaces, so there's a further use of imap to apply a conditional quoting operation. The final result is then in a suitable form to pass to any other command.
$> # suppress output, just set context
$> list -q *.c
$> wc $files
373 1042 9538 copy-tokens.c
5 8 65 hello.c
0 0 0 lanes.c
532 1518 13573 old-tokens.c
373 1042 9538 tokens.c
373 1042 9538 tokens2.c
1656 4652 42252 total
The most versatile command in luash is show. Without any arguments, it just dumps out the current context. If given fields and a condition, it uses the pl.data machinery to filter the context. First a query expression is built up:
where_idx = args:index 'where'
...
fieldlist = args:slice(1,where_idx-1)
condn = concat(args:slice(where_idx+1),' ')
condn = condn:gsub('[%d%.]+[K|M]',function(s)
local tp = s:at(-1) --last char in string
local fact = 1024
s = s:sub(1,-2)
if tp == 'M' then fact = 1024*fact end
return tostring(tonumber(s)*fact)
end)
Q = concat(fieldlist,',')..' where '..condn
Lua commands receive their args argument as a List (see pl.list) so there are extra methods available like slice. (The gsub is replacing numbers of the form 24K and 1.5M with their decimal equivalents.)
local query,err = context:select(Q)
...
local res = {}
local res = seq.copy_tuples(query)
res.fieldnames = fieldlist
prepare_data(res)
dump_data(res,flags)
show_context = res
A query over the given fields of the dataset is created, and the results copied into a new dataset (this is a useful pattern when dealing with pl.data); by default the results will be shown (but -q will always override this) and the show context is set. We do this because you may want to try a another query on the same original context; the command accept explicitly makes the show context into the main context.
It is straightforward to write commands in Lua once you understand the conventions. Any command must be in the commands table, and you may optionally put some help in the command_help table. Your function will be called using pcall, so don't worry too much about error handling at first.
-- dostring.lua
command_help.dostring = "Evaluate a Lua expression with access to globals"
function commands.dostring (args,flags)
print(loadstring('return '..args[1])())
end
This can be loaded into luash with the load command:
$> load dostring
$> dostring 10+20
30
$> dostring variables.pwd
D:\dev\lua\libs
Here luash is exposing some of its internals; variables is the global table containing luash variables. (At this early stage of luash's development, I am not too worried about this exposure, taking the view that the best way of finding out what to expose safely is best done experimentally.)
The next example shows how you can create and set the context using a Lua command. I've always admired the Unix locate command, and tend to use a poor man's version: use dir /S /B *.* > contents.txt in the root of a drive to build up an index of all files on the disk, and then just use grep to find patterns in it. It is not dificult to make this operation play nicely with luash; if a command returns a dataset (with the field fieldnames set) then luash will make it the current context:
# locate.lua
require 'pl'
if not command_help then
print "must load this using luash"
return
end
command_help.locate = 'search file index for pattern'
function commands.locate (args,flags)
local f = io.popen('grep '..args[1]..' d:\\contents.txt','r')
local d = {fieldnames = 'file'}
local maxn = variables.locate_max or 500
local i = 1
for line in f:lines() do
if i > maxn then
print('first '..maxn..' items found')
break
end
d[i] = {line}
i = i + 1
end
return d
end
Now luash will cope with any fieldnames, but file is special. In particular, file references like $1 and the files variable will now work as expected:
$> locate grep
1 D:\utils\bin\egrep.exe
2 D:\utils\bin\fgrep.exe
3 D:\utils\bin\grep.exe
$> echo $files
D:\utils\bin\egrep.exe D:\utils\bin\fgrep.exe D:\utils\bin\grep.exe
$> fileinfo $1
1 = D:\utils\bin\egrep.exe
1 D:\utils\bin\egrep.exe 81K 02/15/05 19:38:06
The power comes when you make searches that will generate a lot of output:
$> locate_max = 2500
$> locate -q .lua
context contains 771 items
$> show -n file where file:find 'sip' and file:find 'lua$'
1 sipscan.lua
2 old-sip-test.lua
3 sip-test.lua
4 sipscan.lua
5 siptest1.lua
6 siptest2.lua
7 siptest3.lua
8 siptest4.lua
9 sip.lua
10 test-sip.lua
11 old-sip-test.lua
12 sip-test.lua
13 sipscan.lua
14 siptest1.lua
15 siptest2.lua
16 siptest3.lua
17 siptest4.lua
18 sip.lua
19 test-sip.lua
It is in fact possible to read any dataset with this little command:
# readdata.lua
require 'pl'
command_help.readdata = 'read dataset into context'
function commands.readdata (args,flags)
return data.read(args[1])
end
Given an input file test.txt like this:
file,length
fred,1
bonzo,2
alice,1
Then we can read it in:
$> load readdata
$> readdata test.txt
1 fred 1
2 bonzo 2
3 alice 1
$> show context
context has 3 items and fields file length