At globo.com my team is responsible for live video streaming infrastructure, and we have a huge infrastructure with lots of projects.

One of our projects is PremiereFC’s website, where we live stream brazilian soccer games.

PremiereFC is written in Python + Tornado + MongoDB, the project is not complex and we have great test suites. It seems perfect.

We did not touch the project for months, and some months ago we needed to adapt it to embrace new championships and other stuff. My colleague Flávio Ribeiro started running the current tests against our QA1 environment. For his surprise some tests were broken! Can you believe it? Nobody touch the project and BOOM, some tests are failing.

He started digging into the tests to find out what happened, and unfortunately he could not understand easily what was happening.

WHY? Our conclusion is that we have lots and lots of test helpers…

Note: I started to draft this post in the end of 2012, but I’ve forgotten to publish it.

Testing and Test Helpers

We do love Test-Driven Development, and we have lots of unit, functional, and integration tests. As we do not like repeating ourselves, creating abstractions around our repeated tests seems obvious, and we started doing it in the beginning of PremiereFC.

We have a test to ensure our FAQ is only visible to authorized users:

def test_should_access_faq():
    headers, html = get_authorized('/ajuda')
    assert "200" == headers['status']

The test name is just kind of generic and there is a get_authorized() test helper. The assertion is based on HTTP status code. Seems ok, let’s move on.

Do not repeat yourself - DRY

Why do we have a get_authorized() test helper? Because we have other tests that share common steps, and we like the idea of not repeating ourselves. Using that test helper we can create new tiny tests that do a lot and we do not need to care about lots of details. That idea is really great when you are writing new tests and they don’t fail or break.

Let’s see what get_authorized() is:

def get_authorized(url, user_name='PFCGuy', session_key=None):
    cookies = set_user_in_cookie(user_name)
    return get_html_with_cookie(url, cookies)

Hmm, it uses other two test helpers: set_user_in_cookie() and get_html_with_cookie(), and it seems to use "PFCGuy" as the user trying to see that page. Keep going.

def set_user_in_cookie(user_name):
    pfc_cookie = serializer.serialize({'authorized': True, 'user_name': user_name})
    cookies = {
        settings.PFC_COOKIE: pfc_cookie
    }
    return cookies

Okay, it just creates a dictionary with a serialized entry.

We are not done yet and we had to stack lots of stuff in our head just to understand something I can’t remember, because I had to dig into a lot to understand what my test is doing. If our brain is limited to deal with 7 different things (based on the Seven, Plus or Minus Two paper), we are in trouble, because we already stacked 1) test_should_access_faq() 2) get_authorized() and 3) set_user_in_cookie().

Now that we know that we first need to set a cookie before requesting the FAQ page, we can move on get_html_with_cookie():

def get_html_with_cookie(url, cookie_value):
    return get_with_cookie(url, cookie_value, get_html)

Hmm, it uses another helper and there is a magic get_html thing. What is that? Using my text-editor search I could find that is yet another helper. Oh god…

def get_html(url, follow_redirects=False, cookies=None, headers=None):
    headers, body = get(url, follow_redirects, headers, cookies=cookies)
    if not body:
        pytest.fail("Response could not be parsed as html. "
                    "Status: %r. Body: %r" % (headers['status'], body))
    else:
        return headers, html.fromstring(body)

Oh, get_html() is just a wrapper to yet another f*cking helper:

def get(url, follow_redirects=False, headers=None, host=settings.SITE_URL, cookies=None):
    if url.startswith('/'):
        url = host + url
    resp = requests.get(url, headers=headers, allow_redirects=follow_redirects, cookies=cookies)
    headers = resp.headers
    content = resp.text or ''
    headers['status'] = str(resp.status_code)
    return headers, content

Now that we got here we can go back and look for get_with_cookie(), that is called inside get_html_with_cookie().

def get_with_cookie(url, cookies, get_function=get, follow_redirects=False, **kwargs):
    headers, body = get_function(url,
                                 follow_redirects=follow_redirects,
                                 cookies=cookies,
                                 **kwargs)
    return headers, body

It just call that get_html callback, named get_function here, using the arguments we already had.

I am pretty sure you are lost here, because we, that wrote the code in the past, got lost multiple times. And as I am writing this post I got lost again.

Let’s see what our mental stack is:

  1. test_should_access_faq()
  2. get_authorized()
  3. set_user_in_cookie()
  4. get_html_with_cookie()
  5. get_html()
  6. get()
  7. get_with_cookie()

SEVEN? That magical number?! Really? My mind can’t stack that much and not get lost. When we reached get_html() I was already lost.

DO REPEAT YOURSELF! (when testing)

Our conclusion is that when you are writing tests, YOU SHOULD repeat yourself, and make the test very clear.

No matter how much similar test code you have, repeat it and make yourself as clear as possible.

We discussed these ideas with our colleague Juarez Bochi, and he mentioned the beginning of the Structure and Interpration of Computer Programs book, where the authors describe three mechanisms every powerful language have:

  • primitive expressions, which represent the simplest entities the language is concerned with,
  • means of combination, by which compound elements are built from simpler ones, and
  • means of abstraction, by which compound elements can be named and manipulated as units.

And after mentioning that he asked us: “Are you saying that we should not combine abstraction?”

My conclusion is that we should apply these abstraction ideas to our tests, but not so extensively as we should in our business code. We must be very careful to not fall into those matrioskas traps.

Flávio and I tried to rewrite that single test as clear as possible. The result:

def test_authorized_user_should_access_faq():
    user_name = "PFCGuy"
    url = settings.SITE_URL + "/ajuda"
    pfc_cookie = serializer.serialize({'authorized': True, 'user_name': user_name})
    cookies = { settings.PFC_COOKIE: pfc_cookie }

    resp = requests.get(url, cookies=cookies)

    assert 200 == resp.status_code
    assert resp.text

Now we have one thing to stack in our head, nothing else. If we have a new test that shares that authorization logic, we are going to repeat ourselves, and we are ok with it. The benefit is immediate: if the test break, we know where to set breakpoints and debug. And everyone that looks at that test know exactly what happened.

If we compare these two approaches we realize the first (with lots of helpers) have business logic spread all over the code, while the second is self-contained. We like self-contained tests, they are better to mantain.

After this story we discussed some guidelines in order to avoid that kind of trap.

Guideline

We are not set yet, but we are discussing some set of rules when testing:

  • If your helper calls another helper, stop right now. Do not do that.
  • If your helper hides business logic, stop it. That should be clear in your test.
  • If your helper has default arguments, you are probably hiding important information and hurting the next developer that touch that code. Think twice before doing it.

Next steps

We have lots of great discussions here in my team and we are always getting better. That’s why I love these guys. Unfortunately we lost Igor Sobreira (he moved to Hawaii), a great developer and he always made our discussions richer. And I am about to leave the team (that’s for another post).

Some ideas we have discussed over and over again are: how to write and maintain documentation, and how to do environment-dependent configuration. These subjects need a special post, but let me just introduce them.

Documentation

My team works phisically in the same place, one next to the other, and most of what we write is not opensource. We work at the same place and day hours, and because of that we are not forced to write better documentation. Our opensource projects have a way better documentation, due to the nature of the project itself (people around the world will try to use and contribute), they are not only for ourselves. We need to learn how to improve on it.

Configuration

We have projects where there is a settings/ directory and there are prod.py, qa1.py, dev.py, and these files contain environment-specific configuration. It sucks because you must remember to change all files if one configuration changes and they tend to mix code and configuration. If you use Django you are probably used to this workflow, but think twice the next time you touch a settings file.

That settings on different files practice is bad because someone needs to manage that, and it is possible to add code to one file and forget to add to another. There is a possibility that you have all tests passing and everything working fine in an environment, but it may break when you deploy to production because of a missing line at prod.py.

Twelve Factor App has a section on Configs, and they recommend using environment variables for configuration. We like that idea and in some projects we have a hybrid approach.

I dropped my bachelor degree in 2010 while I was in the second semester; my classes were boring, some teachers were not updated to what is happening in the industry (some not even in the academia), and most of the students were not interested in learning, they just wanted to get enough score to go to the next semester.

That was not right. Maybe my university or my course was not good enough. Or maybe the way the university teaches is not as good as it could be.

Every person learn in a different pace, and in a class with 30 or more students someone will be left behind, and others will get bored because they learn in a faster pace.

All text that follows is based on my experiences.

The Start of a Revolution

In 2002 MIT started MIT OpenCourseWare, an initiative to put all their material from undergraduate and graduate courses online to everyone; no other well named institution had anything like that, they pioneered. After some time other universities started to do the same, and projects like CS50 from Harvard were born. I’ve watched CS50 videos and they are gold material.

Institutions started to put recorded video classes, pdfs, textbooks, and assignments online, but it was not enough. Something was missing.

Khan Academy And Short Videos

Salman Khan started to record videos to help his cousin with math, and uploaded them to YouTube. In a short amount of time he was recording more videos to help more people in the family. And them friends. It was so quick and enlightening that Khan founded Khan Academy, a non-profit organization with a noble mission: change the world education.

Khan Academy’s video format is very nice, all videos are about 10 minutes - they break a subject into many short videos. It is much easier to grasp the content this way.

Students can pause, go back to some point they did not understand very well, replay, watch later, read what other students are saying in the comments, practice with many online exercises, and do all this any time.

Khan Academy started with Math, but now it is about Math, Computing, Biology, Physics, History, and much more. There are more than 3,100 videos on many subjects.

Khan Academy Video On Probabilty

Watch Salman Khan at TED: http://www.ted.com/talks/salman_khan_let_s_use_video_to_reinvent_education.html

AI Class and the Concept of a Class

Last year I saw an announcement of an Introduction to Artificial Intelligence free course offered by Stanford, which had Peter Norvig and Sebastian Thrun as teachers. I signed up and waited the starting date.

I got shocked with the first videos. It was different from everything I had seen. There were about 160,000 (one hundred sixty thousand) students enrolled! And the videos were made using only pencil and paper.

AI Class video on Bayes Network

AI Class was somehow like Khan Academy: short videos and a kind of whiteboard. But they had something different: the concept of a class. The videos were broken in units, each unit with many videos (each video about 4 minutes long), and a homework with a deadline. Later a midterm and a final exam were set.

They had the great idea to setup a forum where students could discuss the content. Every time I had questions I went to the forum and could find answers to my questions - in a class with many thousand students there will always someone with the same questions you do.

Another great idea in AI Class was quizzes along the units. The quizzes are meant to help students follow the content, and it does not matter if anyone get them wrong, because the key idea is to make students think for a while about what they have been learned.

It was an amazing experience to me, and at the same time there were other courses been offered by other Stanford professors: Machine Learning by professor Andrew Ng, and Introduction to Databases by professor Jennifer Widom - I did not signed up for any of them.

All these courses were beta so professors could see what works well and what not. After AI Class Sebastian Thrun and some fellows founded Udacity, and Andrew Ng and Daphne Koller founded Coursera. Both offering high level online courses for free.

I took only AI Class last year, and I do not know how the other courses were - I heard some people saying that ML class was too much slide oriented.

Udacity

Sebastian Thrun and his fellows want to change how online classes work with very audacious goals in their courses. Their teaching method is the same used in AI Class: whiteboard with quizzes along the videos.

They started with CS101: Building a Search Engine, and CS373: Programming a Robotic Car. Both courses were 7-week long.

CS101 aims people who knows nothing about computing, and they are promising these students will build a search engine in 7 weeks?! Very audacious.

CS373 aims people with basic programming knowledge and they are promising these students are going to program the basics of a self-driving car in 7 weeks?! Really audacious.

Udacity have a web programming environment and a studio to record their classes with a high tech whiteboard. The video quality is very good.

Udacity class on Robot motion

Other cool thing in Udacity is that the teachers are always motivating and challenging the students. They say phrases like “I would be blown away if you got it right,” “There is no problem if you get it wrong.” You can be sure it challenges people.

I did the final exam of both courses last week and I am very proud of me and feel as I am being part of the future of education.

There will be more cool classes starting April 16. Check them out at udacity.com.

References:

Coursera

I signed up for some classes from Coursera: Software as a Service, Design and Analysis of Algorithms I, Information Theory, and others.

But I was a little bit disappointed because there was a big delay to start the classes. Some did not even start yet. I took SaaS class and I am in the last week of DAA class.

Software as a Service is a hot topic, and the teachers were well named people in the software industry: Armando Fox and David Patterson. Yes, in the Industry.

I got more disappointed when the class started, because I was expecting something like Khan Academy or Udacity, where the videos are made to online audience. But in the SaaS course the videos were recordings from Berkeley classes, and too much slide oriented. At least they had an online judge and good content.

They did provide a VirtualBox VM where you had an Ubuntu with all your needs self-contained. So people would only need VirtualBox to do the assignments.

There were a few quizzes along the videos, but they were very confusing to me because they used to ask using the negative (e.g “What is NOT something?”, “Which statement is NOT true about something?”).

My expectations with the course were all about creating services and developing communication between them, but the course was not about it. The course was not about SaaS at all.

The course title is “Engineering for Software as a Service,” and the professors wrote the textbook “Engineering Long-Lasting Software: An Agile Approach Using Cloud Computing and SaaS.” The course was about good development process and practices, but it was not clear to me and most of the students.

A nice experience anyway. The content was not new to me, because I had worked in a research group with Agile methods, BDD/TDD, and I work as a software developer since 2008. I got 100% in all programming assignments :-)

I am in the last week of Design and Analysis of Algorithms, taught by Tim Roughgarden. The classes have been very theoretical, and that is exactly what I wanted. Different from SaaS course, Tim remade the videos to aim the online class.

Despite Tim’s horrible handwriting, he is an excellent educator. The most I have been learning is about induction proof and run time analysis. All topics I had seen before, when I used to do programming contests - but not so in depth as in the course. For those who wants to see more practical algorithms examples I recommend reading Programming Challenges.

Coursera offers great content by great teachers, but they are not as interactive as Udacity. It feels like they are only publishing material online, as MIT OCW does.

MITx

MIT realized OpenCourseWare was not enough and launched MITx March this year. I did not signed up to its Circuits & Electronics (6.002x) first class, and I can not say anything about the course.

Most of what I know about MITx comes from http://www.insidehighered.com/news/2012/04/06/how-could-mitx-change-mit and http://www.aiqus.com/questions/39268/mitx-circuits-and-electronics-is-it-practical-or-theoretical.

What’s next?

I do not know what the future of education is, but it seems universities will need to change their education system to accomodate initiatives like Khan Academy, Udacity, Coursera, and MITx. Maybe institutions will use an exam (oral, written, or practical) to certificate people in near future - maybe something like LPI certificates.

The other day I was talking about it at ##udacity-373 @ freenode with a bunch of guys about undergraduation and I said I do not know if this idea of online education fits law or medicine. There is a part I can not forget:

-- What is the name of the bottom student in med school?
-- What?
-- Doctor.

What he meant by that is a med degree tells nothing about how good a doctor is. And it also means that online universities can offer better content than physical ones.

<hltbra> I know it is a long discussion but I would like to know your opinion about the future of "formal" education
<hltbra> do you have any strong opinions about it?
<eghm> "education bubble"
<hltbra> eghm: what about lawyers, doctors, engineers?
<@gundega> hltbra: I think that formal education institutions will have to really step up their game and add more value for being actually on campus
<eghm> you know what they call the person who graduates at the bottom of their class in med school?
<@gundega> doctors could have chemistry taught online by the best teachers and go to campus for some practical exercises only
<hltbra> eghm: no. what they call?
<eghm> Doctor
<@gundega> there are a lot of possibilities how formal schools can use the new online schools

Extracted and adpted from http://www.elitter.net/~amberj/irclogs/udacity-cs373/year-12/month-04/%23%23udacity-cs373-day03.txt

My eyes are wide open to see what is going to change in universities’s education systems, and I am really enjoying learning so much from the best teachers in their fields.

References:

The Practice of Programming cover

Last year I read the famous The Practice of Programming book, by Brian Kernighan and Rob Pike, and it was a kind of déjà vu; it is a Clean Code book from the 90s!

I decided to get my paperback and write some thoughts about the first chaper, Style.

1.1 Names

Code is read much more often than written, and we must choose the best names to our variables/functions we can.

Sometimes we become obsessed by self explanatory names, and we add noisy to the reader without realizing it.

From the book:

for (theElementIndex = 0 ; theElementIndex < numberOfElements ; theElementIndex++)
    elementArray[theElementIndex] = theElementIndex;

This example may look insane and you may think you would never do this. But do you remember something like:

for (index = 0 ; index < numberOfElements ; index++)
    elements[index] = index;

The word index is just noisy. It’s scope is so small it doesn’t deserve that much information. The book suggests just using i as variable name.

Verbose isn’t the same as clear.

Programmers are often encouraged to use long variable names regardless of context. That is a mistake: clarity is often achieved through brevity.

Use active names for functions. Functions that return boolean values are an exception to this rule.

if (checkoctal(c))
    ...

vs

if (isoctal(c))
   ...

The silliest it may look, it makes trouble. I was reading some code the other day and I had to read the implementation details to be sure the function doesn’t change any state. I wouldn’t need to do that if the author was aware of this hint.

1.2 Expressions and Statements

We are in 2012 and we still make indentation mistakes. Every time I see a for loop like the following I need to reread the lines around it about 3 times to be sure it means “no body here”:

for (n++ ; n < 100 ; field[n++] = '\0');

For God’s sake, add a body to it!

for (n++ ; n < 100 ; n++)
    field[n] = '\0';

Prefer using the natural form for expressions. Avoid Yoda conditions. If you speak the expression aloud and it sounds odd, change it.

Most of us can’t memorize operator precedence, so try to avoid trouble by adding parentheses to doubtful expressions and extracting subexpressions to variables.

1.3 Consistency and Idioms

The project’s consistency is more important than your own, because it makes life easier for those who follow.

Every programmer has it’s own style, and we create lots of idioms along our projects. Do not change any project coding style just because you don’t like it; preserve the coding style you have found.

1.4 Function Macros

I used to think I was really smart when I used to write inline functions using C preprocessor macros. I felt really good doing that. But the authors did me a favor writing:

Avoid function macros.

Function macros are not a good idea, since what they do is replace code. They can hide serious flaws:

#define isupper(c) ((c) >= 'A' && (c) <= 'Z')

It may look correct, but the problem is that inside c is evaluated two times, and if isupper is used as

while (isupper(c = getchar()))
    ...

getchar is evaluated two times if c >= 'A', and make you waste hours debugging it.

1.5 Magic Numbers

As a guideline, any number other than 0 or 1 is likely to be magic and should have a name of its own.

There is a trick in this section to calculate the number of elements of objects in C:

#define NELEMS(array) (sizeof(array) / sizeof(array[0]))

And with this definition you have a dynamic size calculator in C!

1.6 Comments

Don’t belabor the obvious. This rule applies to all code like:

/* return SUCCESS */
return SUCCES;

zerocount++; /* Increment zero entry count */

Sometimes we spend hours trying to improve comments, but what we may not realize is that we have bad code, and no matter how good the comment is, the code is not going to become better.

Don’t comment bad code. Rewrite it.

Conclusions

These were only some ideas behind the first chapter, and they are amazing!

The next chapters talk about complexity, how to grow arrays, how to implement hash tables, debugging, how they used to test code in the old times, performance hints and much more.

I loved the book. Go on and get your copy.

PS.: Rob Pike wrote in 1989 Notes on Programming in C, that is worth reading too.

Updates: 23/02/2012 - Fixed typos; Added book cover image