Saturday, April 30, 2011

Rebuilding Radio NZ - Part 4: Content Extraction & Recipes

The next group of posts will deal with the migration of content. In each I’ll show how we were managing the particular content type in Matrix, the design of the content type in ELF, how we migrated the content, and how we manage the content now.

Getting the content out

There were two options for getting the content out of Matrix.

The first was a custom script that we could use for exporting the whole site. The difficulty was defining everything I would need up-front for Squiz to code against - there were many types of content and different requirements, most not known. The other issue was cost - a script to extract just news was estimated to take about 30 hours to write.

The second option was to setup pages to display groups of single assets in XML format. An example of this was audio assets. These are self contained and contain all the required data for importing to ELF. Where this was not possible screen-scraping would be used. More on this in a future post.

The DIY approach has worked out simpler and cheaper, and I have been able to adjust the export and import to suit each kind of content, building on code from the previous phase.

Recipes

The recipes section was chosen to go first because the section was completely self-contained apart from some in-bound links from programme pages.

Matrix recipes

At the original launch in 2005 we had high expectations for our recipes section - we wanted to divide content into sections based on ingredients and style of cooking. This proved to be more difficult than expected. At right is what our the tree in Matrix looked like.

The recipes home page had seasonal ingredients at the bottom, the right had section featured recipes baed on the season or special events (Christmas, Thanksgiving, Easter, etc), and visitors could search or browser by recipe title.

Managing the content was simplified by putting recipes into lettered folders, however the complex asset structure made it hard to see at a glance what recipes were in which section. Another problem is that the URL structure has an extra (redundant) segment in it with the first letter of the recipe.

When tagging was added we tried that, but this required linking every recipe to a pre-named tag, and making new ones on the fly. This would have required a complete reworking of the section, and all-in-all was too unwieldy to use, even for a section that gets only 3 to 5 new recipes a week.

Recipes took about 5 -10 minutes to format, link into the correct folders and set a future status that matched the broadcast time.

Pasting recipe content into the WYSIWYG from email and Word documents was patchy. Often the markup would contain code that could not be removed with the built in code cleaner. We are very fussy about the quality of our markup, so we developed a separate pre-parser to deal with markup issues. The parser has an FCK Editor and a drop-down to select the type of content. This was able to remove extraneous markup and ensure that the XHTML returned was valid. It was also able to do basic formatting for some type of content.

Even a two-step cut and paste process was faster than hand editing code in the Matrix WYSIWYG (or any editor).

Design

Designing recipes was pretty simple. Recipes have a title, body and broadcast date. They have a chef, tags and are broadcast on a particular programme.

In Rails terms (edited for brevity):

has_many :chefs
belongs_to :programme
acts_as_taggable_on :tags

The programme model contained basic information about the programme such as name and webpath, just to get us started.

Having a chef association and tags allows us to provide navigation by tag and by chef. Since adding both features visitor engagement in that part of the site has increased.

Content Migration

Importing the content was tricky. Each recipe had chef and programme information in the HTML. The import script had to find this information and make the necessary associations.

A rake task was written to parse the content and create recipe assets in ELF. I have posted the code as a gist on github for reference purposes. Note that I was learning Ruby at the time and that it is fairly rough and ready.

As the import script was being written I had it output recipes where it could NOT extract this information. These were found to not be formatted in the standard way, and were edited so that they could be imported.

Tags were manually added to each recipe.

ELF recipes management

In ELF we wanted a data entry screen designed specifically for recipes. This would need to allow for tagging and specifying a chef and broadcast time. And here it is:



The edit screen is simple to use. The tag list offers auto-completion to avoid duplication, and add chef allows new chefs to be added without going to a new screen. A recipe can be added in under 5 minutes.

The WYSIWG is based on CK Editor. This has powerful built-in routines to clean markup pasted from email and MS Word.

The recipes footers which contain seasonal ingredients, and the sidebar with special features both have their own manager: This allows the content to be reused and updated each year.

Now that tagging has been simplified, the seasonal ingredient lists (bottom of page) links to relevant tags. The system allow free-tagging, so any new tag is available immediately. Page impressions in the recipes section is double what it was at this time last year, driven by people browsing content by chef and by tag.

An image uploader is built-in, so pictures can be uploaded and added right inside the WYSIWYG.

Handling old URLs

Legacy URLs are passed to the search page, where it will attempt to extract the title to use as the basis of a search. Try this broken URL for example. In most cases this will give the visitor the recipe they want.

The new recipes section was soft-launched last year, and has streamlined to entry of recipes and improved the user experience.

It also gave me some confidence that we were on the right path.

In the next post I'll cover the evolution of our news section from a basic service offering only 20-30 stories at a time, to the current version with categories and sophisticated remote management.

Saturday, April 23, 2011

Rebuilding Radio NZ - Part 3: Groundwork

In part 3 of this series I'll be covering setting up our new app, and looking at some of the design considerations. I have bumped recipes to next week.

If you are looking for advice on which CMS to get, or not get, this is the wrong place. This series looks at how we at Radio NZ are solving our particular business problems. You mileage can and will vary. You have been warned.

Note: I use the term asset to refer to an instance of a piece of content.

Foundations


When we started building ELF Rails 3 was in RC. The first decision was which version to use - the very stable 2.x branch, or the new-with-cool-features-we-could-use 3.0 branch.

We chose stable because many of the plugins we intended to use were not yet compatible from 3.0, and we did not want to be working around bugs while starting a new application. Rails 3 was also new to our contractor.

The first discussion with contractors Able Tech revolved around what the app was going to be able to do long term, and what core functionality would be required system wide. This would be built first, and everything else would be added on top. One of these features was user authentication for the administration section of the site.

The design of the app had to make maintenance and future development simple because this is likely something we will be using for at least 5 years. While code written in Rails is usually self documenting, I was keen to have file well commented so future any developers could understand why things were done a certain way.

One early decision that was later abandoned was the use of a general content subclass. Many of our planned content types shared some attributes in common - title, webpath, broadcast time, body-copy and so on.

The first few content types built in ELF used this subclass, however this was later abandoned because of the performance impact of extra joins (and the work required to optimise the DB), and ease of maintenance. Having to remember which attributes are delegated to where is bad enough when you are working regularly on a new app, but imagine in a year's time.

This approach brought back memories of Matrix's EAV database schema (also know as Open Schema). With EAV there is no direct relationship between you data models and tables in the database. EAV makes performance tuning for specific use-cases virtually impossible. It does make development easier though, because you do not have to make changes to the database schema as you add new content over time, and it can be more space efficient if the data is quite sparse.

This article is an excellent overview but in summary:
A major downside of EAV is its lower efficiency when retrieving data in bulk in comparison to conventional structure. In EAV model the entity data is more fragmented and so selecting an entire entity record requires multiple table joins.
We went with the standard out-of-the-box AREL layer for ELF. The approach taken now is for each model to have all the fields it needs, and for common functionality (as new models are added) to be extracted into Modules. For example, there is a module that handles all the database scopes for selecting assets based on the broadcast time.

Another big area was the migration of content. This task was going to be mine; I had the best understanding the content, and how to extract it from Matrix.

We made the decision to built the application in small pieces (read: agile), and move content over when each piece was ready.

The system would need a set of Rake tasks for cleaning up exported Matrix content and importing it. These tasks would need to extract contextual information automatically from the HTML, as often there was no metadata.

The test framework built into Rails would allow us to write tests to ensure the handling of imported data was consistent and reliable.

The gradual migration of content meant a certain amount of sharing between the Matrix and ELF - stylesheets, javascript and some images. It would require very careful planning of each phase to ensure the change-over between apps was seamless to site visitors.

As more content was moved we would need to use XML feeds to share data between systems when it was needed in both (more on this in later posts).

Nginx is running in front of Matrix and serves our static assets, so this would be used to divert requests to the new application, allowing us to pick and chose which app did what.

As an aside ELF is now servering the stylesheet for both applications. Matrix is still serving the javascript. The choice is driven by convenience, and which app is driving the most changes in that file.

Broadcast Timestamps

One critical design feature was the use of broadcast related time-stamps.

Matrix only gives you control of created at and published at times. We'd used both as the broadcast time in different places on the site for different but valid reasons.

Station highlights use created time, this being set after the item is created and edited. It means the time the items is published does not matter.

Audio items use published time, as these sometimes have a future status change so we needed to use a time that was updated by the system based on the item going live.

These differences created some management issues. If an audio item has to be temporarily removed from public view, and later restored, the listed broadcast time is wrong and has to be reset manually.

Likewise, if you forgot to set the created at time for a highlight, it would not list at all because only future highlights are shown. The site is so big that you can often forget which of the two attributes is the broadcast time.

In ELF we have two attributes to get around this problem.

The published_at attribute serves two purposes. It can be used to sort, and it controls visibility. When published_at is not set, the item is not visible. This gives us two states: 'Live' and 'Under Construction'.

The broadcast_at attribute contains the date and time the item was (or will be) broadcast. It is never changed by the app, although an it can be changed manually if required.

Keeping things DRY

Don't repeat yourself, they say . We wanted to maximise the advantages of Rails' MVC (Model View Controller) layout, and DRY coding practices to avoid repetition and improve maintainability of the HTML code.

Code Deployment

Deploying new versions of Matrix was hard. This has been improved recently with script-based upgrades.

This is something that is highly optimised in Rails already, where most people seem to use Capistrano. In recent weeks I have been deploying new code to the live server several times a day.

Sensible URLs

Most of the site already had a good URL schema. A few place like news was problematic, and these needed to be revamped.

Revision control

Code was going to be worked on by at least 3 people. We needed a system that allowed this and easy branching. Git. No contest, IMHO.

The migration began with the recipes section, which was chosen to go first because it was largely stand-alone. The next post will cover this in detail.

Saturday, April 9, 2011

Rebuilding Radio NZ - Part 2: The Birth of ELF

In this second part I'll be talking about the birth of ELF.

Warning

Even though its a drag, I'm repeating this disclaimer.

There is no such thing as instant pudding. You cannot copy what someone else does and get the same result.

This post is about a specific site with its own special functional requirements and traffic loads. radionz.co.nz is a public broadcaster's website that includes news, audio and content related to on-air programmes. Traffic loads are very peaky (and high).

This series of posts should NOT taken as advice for or against any particular system. It deals with our specific pain-points and how we are solving them.

You should do your own research and assessment before choosing any CMS or development framework. A good starting point is Strategic Content Management on A List Apart.

Some Management Theory

The manager is responsible for the system in which his staff work. By system, I mean all aspects of the job that contribute to whatever you are producing. The system includes workspaces, office layout, tools, technology, processes and procedures to name a few components.

It is the manager's responsibility to improve the system. In doing so he must understand the difference between problems which are part of the system (built in), and those that are outliers (from outside).

For example, for knowledge workers their computer is part of the system. No one can be productive if their computer keeps failing, is underpowered or does not have the software they need to do their job.

A one-off power cut that stops people working for a day is probably an outlier that needs special attention. (Or may need no attention at all).

The system itself (and everything in it) needs to be designed and maintained. There is nothing worse than a free-running system where components essentially design themselves, are become sub-optimised, failing to work together as a whole. It is very common for processes to become run-down over time and no longer be fit-for-purpose.

The aim is to have stable, predictable processes where you can be sure that content moving through the system meets quality expectations when it is finally published. Efficiency, and replicability are just two aspects of the equation.

The tools that are used to produce and manage web content play a critical role in the system, and one of my roles is to make sure the tools do not get in the way of creating our content.

It is from that base that we considered the suitability of our current CMS tools.

Cracks in the walls

The Radio NZ website was built from scratch - when we started we had no existing processes to support publishing large amounts of web content, and no web infrastructure. We designed new publishing processes and chose our tools (Matrix and a number of custom scripts) based on those processes (I'll be documenting these in later posts).

These processes have been improved iteratively over time. Some of these changes were facilitated by new features in Matrix, others from internal rearrangement. As well as process improvement, we continued to add new content and functionality to the site.

But from late 2009 we found it increasingly difficult to innovate. The modular approach to building sites in Matrix - the very paradigm that got us off the ground so fast - was slowing us down.

Matrix makes The Hard Things simple. Start a new site, set up a home page, a 404 page; all done is 5 minutes. Change content on an About Us page; 1 minute. Setup a form for people to submit queries; done in 10. Display the same content in three different places, auto-generate menu structures; more complex, but still relatively fast to implement.

But for us, some Simple Things were getting harder to do. We were having to create increasingly complex configurations to optimise the display of our content, and create new ways of viewing it. (Examples of this in subsequent posts).

This was largely because our content was stored as individual assets, rather than as modelled data. Each asset knows nothing about any other asset. For example, an audio asset does not know what programme it was broadcast on. A programme asset (a programme's home page) does not know who the hosts of the programme are. And so on.

Some of the asset structures required to support certain features require huge amounts of work to implement.

On top of this was system performance. We are a media site with fast-changing content and high performance demands, and I think the only Matrix customer using the system in this way.

Many pages (like our old home page) were built from many pieces, putting a high load in the system when they had to be rebuilt and cached. With frequent publishing we had to expire the cache as often as 10 times an hour.
In order to deal with our high traffic load it was suggested that a custom caching regime be considered. This would allow us to publish updates 5-10 times an hour, and for the content to be recached more efficiently.

We had already made changes to the operation of the cache (see this old post), and they'd been running for several years, so I had a very good understanding of how this part of the system worked. It was unlikely that these new changes would be of use to other Matrix users and would not become a part of the core product; if implemented, they would be our responsibility to maintain.

The cost of working-around these two problems (asset modeling and caching) - problems that may not exist with other systems - was deemed too high. Sadly, matrix was no longer a good fit for our content or our traffic profile. It was time to consider alternatives.

The decision to change was entirely pragmatic and based on changing business requirements. It was a difficult decision to make, especially after a long history with one product.

ELF is born

Looking at our content, and the sort of features we wanted, it was pretty obvious that a lot of custom code would have to be written.

Very few of our pages are the standard 'edit, upload a photo, update the title' type of content. With this in mind I thought it better to have complete control over all the software, rather than bolt 95% of what we wanted onto an existing product.

Rails looked like a good platform to model and deliver content like ours, and had an excellent local (Wellington) community. There are many development houses and government agencies working with Rails.

So Ruby on Rails it was.

An additional factor was the use of the framework on our company intranet. We had developed a number of powerful modules that could be leveraged for the public website. (In practice, I think we saved about 6 weeks time by recycling existing code).

The name ELF was chosen after a brain-storming session. ELF stands for Eight Legged Freak (i.e. a spider). It was chosen because a spider lives on the web, and because an Elf has 'magical powers' that benefit its users.

In my next post I'll talk about planning the migration of content and the first section we built and made public: Recipes.

Rebuilding Radio NZ - Part 1

This is the first in a series of posts explaining how (and why) we are rebuilding www.radionz.co.nz. I'll be examining the technology behind it and looking at some of the difficult choices we've made along the way.

Warning

This post is about a specific site with its own special functional requirements and traffic loads. radionz.co.nz is a public broadcaster's website that includes news, audio and content related to on-air programmes. Traffic loads are very peaky (and high).

This series of posts should NOT taken as advice for or against any particular system. It deals with our specific pain-points and how we are solving them.

You should do your own research and assessment before choosing any CMS or development framework. A good starting point is Strategic Content Management on A List Apart.

Beginnings

Since October 2005 the site has been running on MySource Matrix (now Squiz Matrix). We started the build of the site around Easter that year, meaning we've used the system for six years - not a bad life for any piece of software. We know it very well.

Matrix was chosen after an exhaustive process where we evaluated dozens of web CMSs and called for a Request for Proposal (we got over 40 responses). The project took a year to complete as we had no existing infrastructure or business processes to support the content we wanted to publish.

We ran the site in-house for 3 months to bed in new publishing processes and iron out any bugs.

The primary reasons we chose Matrix was depth and breadth of functionality and the ability to build sites without needing a programming resource.

The previous version of the site was based on a custom built CMS (PHP), and the requirement to have access to a programmer was a constraint we wanted to avoid for the next version of the site. The only custom work was a Matrix asset to support audio content.

With Matrix we were able to quickly build almost anything we could conceive and have it live quite quickly. We also had the ability to try stuff out, modify it, and then release, all without a code editor in sight.

In five years we grew the site from having a rolling one-week back-catalogue of audio content for some programmes, to having over three years of back-content for most programmes. Traffic increased 10-fold.

In late 2009 we started to experience some pain, and by early 2010 we made the decision to move on from Matrix. This was not a decision made quickly or lightly, and it was based on deeply pragmatic reasons that I'll explain in future posts.

The first major reason was increasing difficulty in managing our content - at the time we had about 5,500 individual programme pages (today it's about 7,000). Moving around the site between pages, and the time required to update content was limiting the amount of content we could publish and causing frustration for editors.

The second was performance. We have a lot of content that is updated frequently - a fast moving news story could easily be updated 10 times in a morning - and the system was not able to cope with our requirement to refresh the caches for these pages for every update (more detail on this in part 3), at least not with the hardware we had at our disposal.

You could say that we outgrew the system. Not because there was necessarily anything wrong with it per se, but because our operation had grown in a direction where the system was no longer a good fit. Pragmatic, as I said.

Today (April 2011) our site is running partly in Matrix and partly in the new system (called ELF). This has presented some challenges both in integration (keeping the experience seamless for visitors), migration of content, and in the training of content editors. I'll cover all this in a later post.

In the next post I'll talk about the birth of ELF.