September 5, 2014 – “Nightlies weren’t updating, here’s why” edition.

Another technical post from me. (yay?)

Some of you may have noticed that over the past few days the nightlies weren’t updating. There were two seperate issues and we managed to resolve both of them. Both of them took a bit of research and in the second case going through going back a long time (over a month!) to find the commit where stuff stopped working, (and serves as sort of an example case of how particularly pernicious software bugs form.)

If you rewind your brains to the September 1st you’ll remember reading this passage:

Kyren’s been working on nitty gritty technical stuff, addressing some threading issues we’re running into with C++11 pointers.

Specifically, what this was referring to was the fact that, while shared_ptr assignment is thread safe in the control section, it isn’t in the pointer section. We had been operating under the assumption that is was thread safe in this form and the bug affected two areas of code (leading to very intermittent problems).

The specific areas of code affected were the MaterialDatabase and the LiquidsDatabase, we hunted down the affected areas, replaced the code with std::atomic_load and std::atomic_store and all was well with the universe.

Or all was well, specifically on Kyren’s computer, running OSX 10.9 and Clang 3.4. After pushing she found out via our automated build system that gcc didn’t want to compile it. After some researching, she found that gcc, in fact, did not support atomic_store or atomic_load. So cue adding a new define and a short header file to our core library to handle the case of a compiler not having that feature implemented.

Fast forward a day or two, someone poked us about the nightly not updating. Quite a few people actually. We thought, “Well, that’s weird.” Because our automated build system automatically tells us when a git commit fails an automatic compile and test, and we didn’t get anything about that.

So we check the status of the test builders, and sure enough, everything is green. BUT, the OSX Release builder failed. Why? Link errors according to the log. It can’t find the atomic operations.

Long story short, we were targetting 10.7 on the release builder, and not the test builder. The test builder was using defaults and targetting the installed OS, 10.9. The version of libc++ that comes with the latest version of XCode that can target 10.7 also does not support atomic_store and atomic_load. Easy enough fix, but obnoxious. While we’re in the area, we clean the asset database to trigger a full rebuild, because it’s been a few months for that. Turns out that triggered the second problem.

The code that generated the asset pak file was silently failing. Actually, the failure wasn’t completely silent, but because the file was created and because our script that controls it did not check the return value of the program, the build system thought everything was AOK. And we wound up creating and uploading a 30kb asset pak file (instead of an 800MB one, how embarrassing.)

We, of course, didn’t notice at all, until someone else alerted us to it, actually quite a few people (again.)

The exception the asset packer is throwing is:

Exception caught: (EofException) Error, unexpected end of file found 1 0 0

Which seems super weird, because when we have an existing pak file, everything works, the only thing that’s wrong is when we’re creating a new one. And why would we be reading from the file in that case? Just weird all around.

Why is this suddenly failing? This is more or less how I approached the problem, step by step.

First stop, let’s take a look at our backtrace:

$ gdb ./asset_packer
(gdb) set args -d -i ../assets/pak_ignore.config ../assets/packed assets.pak
(gdb) catch throw
Catchpoint 1 (throw)
(gdb) r
Starting program: /home/omnipotententity/work/starbound/dist/asset_packer -d -i ../assets/pak_ignore.config ../assets/packed assets.pak
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Catchpoint 1 (exception thrown), 0x00007ffff7916a30 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
(gdb) bt
#0 0x00007ffff7916a30 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#1 0x000000000044a2a8 in Star::Buffer::readPartial (this=0x7fffffffd6b8, data=0x7fffffffd57f "", len=1) at /home/omnipotententity/work/starbound/source/core/StarBuffer.cpp:76
#2 0x0000000000451a58 in Star::IODevice::readFull (this=0x7fffffffd6b8, data=0x7fffffffd57f "", len=1) at /home/omnipotententity/work/starbound/source/core/StarIODevice.cpp:125
#3 0x000000000044e083 in readData (len=1, data=0x7fffffffd57f "", this=0x7fffffffd690) at /home/omnipotententity/work/starbound/source/core/StarDataStream.cpp:71
#4 operator>> (d=@0x7fffffffd57f: 0 '\000', this=0x7fffffffd690) at /home/omnipotententity/work/starbound/source/core/StarDataStream.cpp:165
#5 read<unsigned char> (this=0x7fffffffd690) at /home/omnipotententity/work/starbound/source/core/StarDataStream.hpp:339
#6 operator() (__closure=<optimized out>) at /home/omnipotententity/work/starbound/source/core/StarDataStream.cpp:241
#7 operator* (this=<optimized out>) at /home/omnipotententity/work/starbound/source/core/StarAlgorithm.hpp:381
#8 readVlqU<Star::FunctionInputIterator<Star::DataStream::readVlqU(uint64_t&)::__lambda18> > (in=..., x=@0x7fffffffd5b8: 0) at /home/omnipotententity/work/starbound/source/core/StarVlqEncoding.hpp:37
#9 Star::DataStream::readVlqU (this=0x7fffffffd690, i=@0x7fffffffd5b8: 0) at /home/omnipotententity/work/starbound/source/core/StarDataStream.cpp:241
#10 0x000000000044e211 in Star::DataStream::readVlqU (this=this@entry=0x7fffffffd690) at /home/omnipotententity/work/starbound/source/core/StarDataStream.cpp:272
#11 0x0000000000441707 in Star::DataStream::readMapContainer<Star::Map<Star::String, Star::ByteArray, std::unordered_map<Star::String, Star::ByteArray, Star::hash<Star::String, void>, std::equal_to<Star::String>, std::allocator<std::pair<Star::String const, Star::ByteArray> > > >, void Star::DataStream::readMapContainer<Star::Map<Star::String, Star::ByteArray, std::unordered_map<Star::String, Star::ByteArray, Star::hash<Star::String, void>, std::equal_to<Star::String>, std::allocator<std::pair<Star::String const, Star::ByteArray> > > > >(Star::Map<Star::String, Star::ByteArray, std::unordered_map<Star::String, Star::ByteArray, Star::hash<Star::String, void>, std::equal_to<Star::String>, std::allocator<std::pair<Star::String const, Star::ByteArray> > > >&)::{lambda(Star::DataStream&, Star::String&, Star::ByteArray&)#1}>(Star::Map<Star::String, Star::ByteArray, std::unordered_map<Star::String, Star::ByteArray, Star::hash<Star::String, void>, std::equal_to<Star::String>, std::allocator<std::pair<Star::String const, Star::ByteArray> > > >&, void Star::DataStream::readMapContainer<Star::Map<Star::String, Star::ByteArray, std::unordered_map<Star::String, Star::ByteArray, Star::hash<Star::String, void>, std::equal_to<Star::String>, std::allocator<std::pair<Star::String const, Star::ByteArray> > > > >(Star::Map<Star::String, Star::ByteArray, std::unordered_map<Star::String, Star::ByteArray, Star::hash<Star::String, void>, std::equal_to<Star::String>, std::allocator<std::pair<Star::String const, Star::ByteArray> > > >&)::{lambda(Star::DataStream&, Star::String&, Star::ByteArray&)#1}) (this=this@entry=0x7fffffffd690, map=..., function=function@entry=...) at /home/omnipotententity/work/starbound/source/core/StarDataStream.hpp:490
#12 0x0000000000441942 in readMapContainer<Star::Map<Star::String, Star::ByteArray, std::unordered_map<Star::String, Star::ByteArray, Star::hash<Star::String, void>, std::equal_to<Star::String>, std::allocator<std::pair<Star::String const, Star::ByteArray> > > > > (container=..., this=0x7fffffffd690) at /home/omnipotententity/work/starbound/source/core/StarDataStream.hpp:523
#13 operator>><Star::ByteArray> (map=..., ds=...) at /home/omnipotententity/work/starbound/source/core/StarDataStreamExtra.hpp:155
#14 read<Star::Map<Star::String, Star::ByteArray, std::unordered_map<Star::String, Star::ByteArray, Star::hash<Star::String, void>, std::equal_to<Star::String>, std::allocator<std::pair<Star::String const, Star::ByteArray> > > > > (data=...,
this=0x7fffffffd690) at /home/omnipotententity/work/starbound/source/core/StarDataStream.hpp:345
#15 Star::DataStreamBuffer::deserialize<Star::Map<Star::String, Star::ByteArray, std::unordered_map<Star::String, Star::ByteArray, Star::hash<Star::String, void>, std::equal_to<Star::String>, std::allocator<std::pair<Star::String const, Star::ByteArray> > > > > (t=..., data=...) at /home/omnipotententity/work/starbound/source/core/StarDataStream.hpp:567
#16 0x0000000000441b05 in Star::DataStreamBuffer::deserialize<Star::Map<Star::String, Star::ByteArray, std::unordered_map<Star::String, Star::ByteArray, Star::hash<Star::String, void>, std::equal_to<Star::String>, std::allocator<std::pair<Star::String const, Star::ByteArray> > > > > (data=...) at /home/omnipotententity/work/starbound/source/core/StarDataStream.hpp:598
#17 0x000000000043efba in Star::AssetsDatabaseBackend::build (assetsFolder=..., databaseFilename=..., ignoreFiles=..., skipFiles=..., digest=digest@entry=true) at /home/omnipotententity/work/starbound/source/game/StarAssetsDatabaseBackend.cpp:17
#18 0x0000000000437cce in main (argc=<optimized out>, argv=<optimized out>) at /home/omnipotententity/work/starbound/source/utility/asset_packer.cpp:90
(gdb)

So walking backwards through the frames, let’s try to find out what’s going wrong. The top is just the internal throw, it’s not going to be that. After that we have the buffer readPartial, which is what actually threw the exception. But what could have caused it? Well, frame 10 has readVlqU, which is called from readMapContainer. The readMapContainer source looks like this:

template<typename Container, typename ReadFunction>
void DataStream::readMapContainer(Container& map, ReadFunction function) {
  map.clear();
  size_t size = readVlqU();
  for (size_t i = 0; i < size; ++i) {
    typename Container::key_type key;
    typename Container::mapped_type mapped;
    function(*this, key, mapped);
    map[key] = mapped;
  }
}

That seems reasonable. We read in the size and iterate through the entries. So it’s failing on the very first read, almost as if the buffer is empty.

So, if it’s failing the read, then maybe there’s something wrong with the write function?

template<typename Container, typename WriteFunction>
void DataStream::writeMapContainer(Container& map, WriteFunction function) {
  writeVlqU(map.size());
  for (auto const& elem : map)
    function(*this, elem.first, elem.second);
}

Well, no, this seems perfectly fine, it always writes the size.

Well, this seems like a dead end… next port of call is “How long has this been a problem, what triggered it?” So I start going through commits, roughly doubling and then bisecting. Each of these steps takes roughly 5 minutes, some more, some less.

$ git checkout '@{Sep 1}' 2>&1 >/dev/null && ../scripts/linux/setup.sh 20 2>&1 >/dev/null && rm assets.pak && ./asset_packer -d -i ../assets/pak_ignore.config ../assets/packed assets.pak 2>&1 >/dev/null && echo "Success!" || echo "Fail!"
Fail!
$ ^Sep 1^Aug 23
Fail!
$ ^Aug 23^Aug 15
Fail!
$ ^Aug 15^Aug 1
Fail!
$ ^Aug 1^Jul 1
Success!
$ ^Jul 1^Jul 15
Success!
$ ^Jul 15^Jul 23
Success!
$ ^Jul 23^Jul 27
Success!
$ ^Jul 27^Jul 30
Fail!
$ ^Jul 30^Jul 29
Fail!
$ ^Jul 29^Jul 28
Success!

So now… we know that the issue was caused sometime between Jul 28 and Jul 29, so let’s look at that diff.

$ git diff '@{Jul 28}' '@{Jul 29}'

It turns out that in one of these commits, some code that calls readMapContainer and writeMapContainer was changed in our DataStreamBuffer class, but only on the deserialization or reading side:

template<typename T, typename ReadFunction>
void DataStreamBuffer::deserializeMapContainer(T& t, ByteArray data, ReadFunction readFunction) {
  if (!data.empty()) {
    DataStreamBuffer ds(move(data));
    ds.readMapContainer(t, readFunction);
  }
}

Was changed to:

template<typename T, typename ReadFunction>
void DataStreamBuffer::deserializeMapContainer(T& t, ByteArray data, ReadFunction readFunction) {
  DataStreamBuffer ds(move(data));
  ds.readMapContainer(t, readFunction);
}

And if we double check our gdb backtrace way up there we do, indeed, go through this function. So this is what broke it. Why was it changed?

Well, if we look at it, it looks like it’s a check for an empty buffer, and an empty buffer used to equal an empty map, rather than a buffer that contained a encoded 0 (for the size). We had judged an empty buffer as actual data to be kludgy and not really fitting in the spirit of good programming, so we made that a fail condition. So now we know how it broke. But still there’s the question of why? Why was it working and now suddenly it’s not?

Well, let’s follow the backtrace farther up. At frame 17 we have where the call originated, let’s take a look:

void AssetsDatabaseBackend::build(String const& assetsFolder, String const& databaseFilename, StringList const& ignoreFiles, StringList const& skipFiles, bool digest) {
  SimpleSha256Database db("Assets2");
  db.setBlockSize(2048);
  db.setAutoCommit(false);
  db.setIODevice(File::open(databaseFilename, File::ReadWrite));
  db.open();
  // First, load the existing index for the database, if it exists.
  IndexMap oldIndex = DataStreamBuffer::deserialize<IndexMap>(db.find("_index").value());

So it looks like, we initialize the database, and then we check to see if already has an existing index of keys to pull from, and if it does, farther down the function we add stuff directly to the old index, rather than regenerating from scratch. This is how we initially fixed the issue where every update, not matter how small, was several hundred megabytes, by reusing the index key. So what happens if the index key is missing?

Well, db.find seems to return a Maybe<ByteArray> which is essentially in C parlance a char** where the internal char* is to an array of char and the external pointer is either to the pointer, or to nullptr, to represent a missing value. Calling .value() on a Maybe will return either what’s in the Maybe, or if nothing is in the Maybe, then it returns the parameter we passed to it. Or if no parameter is passed, it returns a default constructed value. So in this case when it’s a new database db.find returns an empty Maybe, which has value() called on it, which returns an empty ByteArray (eureka!)

So that’s where our empty ByteArray was coming from, and why the change broke this code.

So to fix, we simply just apply some more explicit coding in this area:

// First, load the existing index for the database, if it exists.
Maybe<ByteArray> oldIndexByteArray = db.find("_index");
IndexMap oldIndex;
if (oldIndexByteArray) {
  oldIndex = DataStreamBuffer::deserialize<IndexMap>(*oldIndexByteArray);
}

And we deftly avoid having an empty byte array sent through deserialize. And the nightlies work again.