The World Wide Web Consortium (W3C) has a number of free tools available to help with the correct generation and processing of HTML and XML files. The HTML-XML package is a set of simple utilities for manipulating HTML and XML files from the command line. It is available for many of the different Linux distributions and can be useful for those who have to process HTML or XML files on a regular basis.
To install the package on Ubuntu, use:
There are 31 tools in this package, here is a summary of what they can do:
- cexport – create headerfile of exported declarations from a C file
- hxaddid – add ID’s to selected elements
- hxcite- replace bibliographic references by hyperlinks
- hxcite-mkbib- expand references and create bibliography
- hxcopy- copy an HTML file while preserving relative links
- hxcount – count elements and attributes in HTML or XML files
- hxextract – extract selected elements
- hxclean – apply heuristics to correct an HTML file
- hxprune – remove marked elements from an HTML file
- hxincl- expand included HTML or XML files
- hxindex – create an alphabetically sorted index
- hxmkbib – create bibliography from a template
- hxmultitoc- create a table of contents for a set of HTML files
- hxname2id- move some ID= or NAME= from A elements to their parents
- hxnormalize – pretty-print an HTML file
- hxnum – number section headings in an HTML file
- hxpipe- convert XML to a format easier to parse with Perl or AWK
- hxprintlinks- number links & add table of URLs at end of an HTML file
- hxremove- remove selected elements from an XML file
- hxtabletrans- transpose an HTML or XHTML table
- hxtoc – insert a table of contents in an HTML file
- hxuncdata – replace CDATA sections by character entities
- hxunent – replace HTML predefined character entities to UTF-8
- hxunpipe- convert output of pipe back to XML format
- hxunxmlns – replace “global names” by XML Namespace prefixes
- hxwls – list links in an HTML file
- hxxmlns – replace XML Namespace prefixes by “global names”
- asc2xml, xml2asc- convert between UTF8 and entities
- hxref – generate cross-references
- hxselect- extract elements that match a (CSS) selector
To introduce you to the power of this tool set, here are some examples on how you would use a few of the commands.
hxnormalize” command will reformat an HTML file so that it is easy to read and nicely formatted. To test this command, we will create an ugly HTML. Select and copy the following lines and paste them directly into a terminal window.
This will create a file called test.html. The HTML is missing some of the closing tags and is all written in one line. The
hxnormalize command will reformat the file and write the pretty version to the standard output (stdout). Here is how you run the command:
The “-e” flag tells hxnormalize to insert any missing closing tags.
You can also run the command against a web page by replacing “test.html” with a URL, for example:
hxwls command will parse a local HTML file or a website, and list the links within the HTML. For example:
Here is the first few lines of output for the Make Tech Easier website:
hxtabletrans command changes a table so that rows become columns and columns become rows.
Let’s create an HTML file with a simple table. Select and copy the following lines, and then paste them directly into a terminal window.
The result is a file called table.html. In a web browser the table would look something like this:
If you run the
hxtabletrans command, then it will write the transposed table to the standard output. The results can be redirected to another file like this:
The new file, table2.html, will show Jill Smith and Eve Jackson in columns, rather than in rows as in the original. The resulting table will be something like this:
Most of the commands are used in a similar way to the examples above, i.e. you need to specify a file or URL to process and the output is written to the stdout. Try experimenting with the different commands as you might find them useful.
If you have any questions about the HTML-XML utilities then please feel free to ask them in the comments below and we will see if we can help.