How to Manipulate HTML and XML Files from the Command Line

The World Wide Web Consortium (W3C) has a number of free tools available to help with the correct generation and processing of HTML and XML files. The HTML-XML package is a set of simple utilities for manipulating HTML and XML files from the command line. It is available for many of the different Linux distributions and can be useful for those who have to process HTML or XML files on a regular basis.

To install the package on Ubuntu, use:

sudo apt-get install html-xml-utils

There are 31 tools in this package, here is a summary of what they can do:

  • cexport – create headerfile of exported declarations from a C file
  • hxaddid – add ID’s to selected elements
  • hxcite- replace bibliographic references by hyperlinks
  • hxcite-mkbib- expand references and create bibliography
  • hxcopy- copy an HTML file while preserving relative links
  • hxcount – count elements and attributes in HTML or XML files
  • hxextract – extract selected elements
  • hxclean – apply heuristics to correct an HTML file
  • hxprune – remove marked elements from an HTML file
  • hxincl- expand included HTML or XML files
  • hxindex – create an alphabetically sorted index
  • hxmkbib – create bibliography from a template
  • hxmultitoc- create a table of contents for a set of HTML files
  • hxname2id- move some ID= or NAME= from A elements to their parents
  • hxnormalize – pretty-print an HTML file
  • hxnum – number section headings in an HTML file
  • hxpipe- convert XML to a format easier to parse with Perl or AWK
  • hxprintlinks- number links & add table of URLs at end of an HTML file
  • hxremove- remove selected elements from an XML file
  • hxtabletrans- transpose an HTML or XHTML table
  • hxtoc – insert a table of contents in an HTML file
  • hxuncdata – replace CDATA sections by character entities
  • hxunent – replace HTML predefined character entities to UTF-8
  • hxunpipe- convert output of pipe back to XML format
  • hxunxmlns – replace “global names” by XML Namespace prefixes
  • hxwls – list links in an HTML file
  • hxxmlns – replace XML Namespace prefixes by “global names”
  • asc2xml, xml2asc- convert between UTF8 and entities
  • hxref – generate cross-references
  • hxselect- extract elements that match a (CSS) selector

To introduce you to the power of this tool set, here are some examples on how you would use a few of the commands.

The “hxnormalize” command will reformat an HTML file so that it is easy to read and nicely formatted. To test this command, we will create an ugly HTML. Select and copy the following lines and paste them directly into a terminal window.

cat > test.html << __EOF__
<html><body><p>hello</html>
__EOF__

This will create a file called test.html. The HTML is missing some of the closing tags and is all written in one line. The hxnormalize command will reformat the file and write the pretty version to the standard output (stdout). Here is how you run the command:

hxnormalize -e test.html

The “-e” flag tells hxnormalize to insert any missing closing tags.

hxnormalize

You can also run the command against a web page by replacing “test.html” with a URL, for example:

hxnormalize http://www.example.com

The hxwls command will parse a local HTML file or a website, and list the links within the HTML. For example:

hxwls http://www.example.com

Here is the first few lines of output for the Make Tech Easier website:

lxwls

The hxtabletrans command changes a table so that rows become columns and columns become rows.

Let’s create an HTML file with a simple table. Select and copy the following lines, and then paste them directly into a terminal window.

cat > table.html << __EOF__
<table>
<tr>
<td>Jill</td>
<td>Smith</td>
<td>50</td>
</tr>
<tr>
<td>Eve</td>
<td>Jackson</td>
<td>94</td>
</tr>
</table>
__EOF__

The result is a file called table.html. In a web browser the table would look something like this:

JillSmith50
EveJackson94

If you run the hxtabletrans command, then it will write the transposed table to the standard output. The results can be redirected to another file like this:

hxtabletrans table.html > table2.html

The new file, table2.html, will show Jill Smith and Eve Jackson in columns, rather than in rows as in the original. The resulting table will be something like this:

JillEve
SmithJackson
5094

Most of the commands are used in a similar way to the examples above, i.e. you need to specify a file or URL to process and the output is written to the stdout. Try experimenting with the different commands as you might find them useful.

If you have any questions about the HTML-XML utilities then please feel free to ask them in the comments below and we will see if we can help.