Chaordic Code Monkeys

Let's talk about CORS

2017-12-12T00:00:00-02:00

Cross Origin Resource Sharing, or CORS for short, is a mechanism to allow cross-origin requests (which happens when the sender and the receiver are hosted on different protocols, domains or ports), throught additional HTTP headers, between two applications with diferent origins to give access permission to specified resources.

It has a pretty deep background over it, so let’s go back to where we started to see why such a thing should exist.

It became quite common nowadays to type your credentials only once and remain authenticated until you explicitly logout on any service. We usually manage this by storing a token of user’s identity on the user’s browser, inside a cookie. And then, this cookie content is sent on every request associated with the given site.

And that’s it! Now every request has the user credentials hanging on it, therefore excluding the need of asking for the credentials throughout the cookie’s lifetime. Even AJAX request may also include these credentials… and that’s something that we should be aware of.

Cross-Site Request Forgery (CSRF)

Back then, the possibilty of sending these credentials via AJAX created a vast amount of devious possibilities. Let’s create a hipotetical situation.

What if you logged into your bank website, which has cookie-based auth. After logging in, you would get a token with your credentials stored on your browser. You did some transactions and then went back to reading some interesting article on some random blog and you felt like participating on some random discussion on the comments section.

What you didn’t know is that the “Comment” button, that would submit your comment, will fire a AJAX to all possible bank websites (your bank included) and say: “as the current logged in user, transfer 200$ to account XXX”.

Sick, right? Such exploit is called cross-site request forgery and, back then, in some specific cases, it would not only work but it would make all these requests appear to be done by the logged user, since the attacker used his credentials.

Nasty, isn’t it? But fear not! People at Netscape, back in 1995, thought about that and created what we call Same-Origin Policy.

Same-Origin Policy

This policy, implemented by most of the modern browsers in different ways, restrict the interaction of a document with resources from different origins. This is critical to isolate malicious scripts.

As I said at the introduction, a given document has the same origin of a given resource if they share the same protocol, domain and port. If any of these parameters diverge we have a cross-origin request, which is restricted by this policy.

Now, with that, the malicious AJAX request from our hipotetical blog, which is hosted on a different origin from our hipotetical bank, would trigger an error.

As I said before the Same-Origin Policy implementation has some variations on different browers, but, essentially, it controlls the interaction between two origins and categorize as:

Cross-origin writes (tipically allowed)
Cross-origin embedding (such as <img>, <link> or <script> and some other tags… also tipically allowed)
or Cross-origin reads (tipically not allowed - even though you can read the embedded resource information).

This allowance on Cross-origin embedding is crucial because websites composition rely on external resources and assets. But after a while, applications started to rely more and more on external APIs. These requests wouldn’t pass through the Same-origin Policy due to it’s cross-origin nature. So people started to do some workaround. Using the Cross-origin embedding as a vehicle to bypass the Same-origin Policy people used the mechanism called JSON with padding, or JSONP for short.

JSONP

Essentially JSONP took advantage of the fact that the HTML <script> tag is allowed to execute the content retrieved from a cross-origin request (categorized as embedding, which explains why it is allowed). How so? Let’s go by example!

Let’s say our app is hosted on http://app.example.com and we have an endpoint http://api.example.com/user that returns the following JSON data:

{
  "name": "Party Parrot",
  "quote": "Party or die!"
}

Notice that a request from app to endpoint should clearly be classified as cross-origin due to discrepant host. So, using the knowledge up until now we know that the Same-origin Policy wouldn’t allow us to make an AJAX request to endpoint so we need a workaround.

What if we do that?

<script src="http://api.example.com/user"></script>

As the usual behavior of the <script> tag, the browser would request the content of the src, download it and evaluate it’s content. When trying to evaluates our JSON it will either interpret it as a block and fire a syntax error or as Object literal. In either way, we don’t have direct access to its content in a way that it can be worked with.

To get around this problem, we can use the JSONP method. With this, the server wraps the JSON content with JavaScript code. Usually, it wraps the data with a function call, with the name of the function, by convention, provided as a named query parameter. Like that:

<script src="http://api.example.com/user?callback=doSomething"></script>

Will return:

doSomething({ "name": "Party Parrot", "quote": "Party or die!" });

And, as always, the script tag will evaluate this code. So… to complete this, all we have to do is to declare our handler beforehand:

<script>
function doSomething(data) {
  console.log("Our parrot data", data);
}
</script>
<script src="http://api.example.com/user?callback=doSomething"></script>

And there you go! We just made a cross-origin request bypassing the Same-origin Policy. But, doing things this way, as you can probably imagine (I hope), there are some serious security implications (which I will not cover here). For this reason, people from W3C proposed a protocol called Cross-Origin Resource Sharing, or CORS for short.

CORS consists in a set of additional headers to indicate whether the content response can be shared or not. To illustrate this, consider that all the requests mentioned are AJAX (XMLHttpRequest).

With CORS, the server classify the request in two cases: Simple and Preflighted requests.

Requests are classified as Simple requests when it is supported by a HTML form: a GET, HEAD or POST. The latter applies only when content type is either text/plain, application/x-www-form-urlencoded or multipart/form-data.

Now, requests that cannot be classified as Simple are classified as Preflighted requests. These requests trigger a CORS-preflight request, that checks if the CORS protocol is understood by the server. This preflight request uses OPTIONS as method.

In summary, the basic mechanism of CORS is to include the following headers which indicates what is being requested:

Origin: Which contains the request origin
Access-Control-Request-Method: Which indicate the future CORS request method
Access-Control-Request-Headers: Which indicate the future CORS request headers

And then include the following headers in the response which indicates what is allowed:

Access-Control-Allow-Origin: Which indicates which domains can access the response content
Access-Control-Allow-Credentials: Which indicates whether or not include the browser’s cookies containing user credentials
Access-Control-Allow-Methods: Which indicates which methods are supported
Access-Control-Allow-Headers: Which indicates which headers are supported
Access-Control-Expose-Headers: Which indicates which headers can be exposed as part of the response
Access-Control-Max-Age: Which indicates how long the Methods and Headers can be cached

So, if we send this example of a Simple Request from a fictional origin http://parrot.com…

GET /party HTTP/1.1
Host: api.parrot.com
Origin: http://parrot.com
...

…the respose will be sent to the user whether it has the CORS specific headers or not. It should be something like this:

HTTP/1.1 200 OK
Access-Control-Allow-Origin: *
Content-type: application/json
...

(response json here)

Simple, right? Cool…

Now, when sending a POST to a different host configures a Preflighted request, that triggers an OPTIONS preflight request like this:

OPTIONS /party HTTP/1.1
Host: api.parrot.com
Origin: http://parrot.com
Access-Control-Request-Method: POST
Access-Control-Request-Headers: content-type,accept
...

And the preflighted respose should be:

HTTP/1.1 200 OK
Access-Control-Allow-Origin: http://parrot.com
Access-Control-Allow-Methods: POST, GET, PUT
Access-Control-Allow-Headers: content-type,accept
...

And, since the OPTIONS preflight succeeded, the browser will then send the actual request:

POST /party HTTP/1.1
Host: api.parrot.com
Origin: http://parrot.com
...

(request JSON here)

And the corresponding response will be:

HTTP/1.1 200 OK
Access-Control-Allow-Origin: *
...

And that’s pretty much how it works. The implementation is a little different from each browser, but they generally operate in a similar way. For example, some browsers add these additional headers to Simple CORS request and others don’t. There’s even a Firefox issue related to this subject.

Sources:

Let's talk about CORS was originally published by Chaordic at Chaordic Code Monkeys on December 12, 2017.

Still using GitFlow? What about a simpler alternative?

2017-09-22T00:00:00-03:00

GitFlow is a branching model created by Vincent Driessen on 2010 (original article). Since it was published, many companies had tested and implemented it, which allows us to have many reviews about how well (or not) it works. After some discussions within our team, we decided to not go with GitFlow, but use a simpler model instead, together with a tightly defined workflow. Some of the discussed reasons of why not go with GitFlow are the same written on this blog post.

The Feature Branch Model

Compared to GitFlow, it is easier to implement and does not require any plugins to be properly used. The step-by-step of this model would be:

Create a branch from the master (feature-x), which is where the feature will be developed: git checkout -b feature-x
Push the branch to the remote: git push -u origin feature-x. With the branch in the remote repo, a pull request should be opened with it (How to open it in GitHub). A pull request is where all modifications are available to other members and they will be able to review it
Fix the reviewed code and wait for approval. If a new release on the master generates a conflict, a best practice would be to rebase it (instead of merging)
(optional) If a rebase is needed: checkout to master git checkout master, pull the changes git pull, go back to the feature branch git checkout feature-x, do the rebase git rebase master and then sync the rebased branch git push --force-with-lease . A good tutorial about merging x rebasing is available on this Atlassian article.
If there are no conflicts and it was approved ⇒ squash + merge

This Atlassian article have a more detailed view on the feature branch model

Why Squash + Merge instead of just Merge?

The squash and merge is made up of two processes: the squash, which compact all commits in one big commit/patch, and then the merge itself. After squashing + merging, you will have only one commit in the target branch (usually master) containing all your modifications. This enables two things:

It is easier to move this feature, as the whole patch/feature will be on one commit hash
The target branch will be cleaner, less messy and more readable — without those 67 commits you have made to finish the feature.

There are more information about about why devs prefers squash and merge, instead of only merging, on this article.

Managing release versions with git tags

In the feature branch model, a merge is considered a new version release. To track each release version, tags can be used. These will be used as reference to choose which version should be deployed at the servers.

To manage these tags/release, a good practice is the usage of semantic versioning:

Given a version number MAJOR.MINOR.PATCH, increment the:

MAJOR version when you make incompatible API changes,

MINOR version when you add functionality in a backwards-compatible manner, and

PATCH version when you make backwards-compatible bug fixes.

The process to create the releases can be automated using grunt-release or gulp-release-tasks. But, following the steps bellow, it can be easily done by hand:

Checkout to the master branch: git checkout master
Pull changes from the remote git pull
Get the most recent tag using git describe --abbrev=0 (let’s say it returns v0.1.0)
Create a tag using git tag -a <version>⇒git tag -a v0.2.0
Push the modifications and the tag: git push origin v0.2.0 --follow-tags
Done!

Deploying

In many PaaS, such as AWS Beanstalk or Heroku, a remote repository is set-up where, when changes are pushed (eg. git push heroku master), a deploy is triggered using the latest commits on master. In these cases, a simple push force using the release tag will deploy the desired version: git push -f <deploy/env-remote> v0.2.0^{}:master. Easy, eh?

NOTE: At Chaordic New Offers Team, a grunt script was developed where we publish which tag should be deployed: grunt deploy:<version>:<env>:all

What happens if a hot-fix is needed?

At some point, an issue will be raised and the production version will need a hot-fix ASAP. A feature branch can’t just be opened to develop a fix, as the master will probably be ahead of the production version. In this case, the fix needs to be done directly on the production version:

Checkout to the production version tag git checkout v0.10.0
Create a new branch from this tag git checkout -b hotfix-v0.10.1-weirdbehavior
Create the fix and commit it
Create a tag for this new release git tag -a v0.10.1 (notice the SEMVER pattern)
Push the branch and tag to remote git push -u origin hotfix-v0.10.1-weirdbehavior --follow-tags
Deploy the tag v0.10.1 to the production environment
A push request should be opened, as the fix should be applied at the master afterwards

If more patches are needed, this process can be repeated on the same version, incrementing only the patch version.

Software hot-fixing is way easier, don’t you think?

What about applying it to other environments?

This patch probably should be applied to other environments as well, which can be done through git cherry-pick <commit-hash> . It basically applies the chosen commit to the actual HEAD.

Checkout to the environment version tag git checkout v0.13.0
Create a new branch for the patch git checkout -b hotfix-v0.13.1
Do a git cherry-pick v0.10.1 or a git cherry-pick <commit-hash> to apply the desired commit
git tag -a v0.13.1 and git push origin v0.13.1 (push just the tag)
Deploy it

What if I want to get a modification from master and sent to one of the

environments?

It is very similar to the above one: a git cherry-pick should be done using a commit hash from the master as, after squash + merge a push request, a new commit is generated with all changes (big patch of commits condensed in one).

Just keep in mind…

The gap between the environments versions should be as short as possible. Otherwise, some issues may appear:

If the production is on v0.1.10, the latest release is v0.10, but the version v0.3 will be deployed: the team members will have to check if some of the production patches are still required and then apply them, one by one.
If some feature was only finished on v0.10.0, and it is required for the roll-out, but the v0.7.0 is still not well tested: the release should be hold until the v0.7.0 has been tested

Usually, these version gaps occur when the producing capacity is higher than the testing capacity (developers x testers ratio).

Conclusion

The model is still being tested but, until now, it has been working well. The only faced drawbacks were the ones pointed on the session above.

This post was originally published on HackerNoon, under the title “Still using GitFlow? What about a simpler alternative?”

Still using GitFlow? What about a simpler alternative? was originally published by Chaordic at Chaordic Code Monkeys on September 22, 2017.

Transitioning from MySQL to Cassandra at Chaordic

2016-01-20T00:00:00-02:00

Chaordic is the leader in big data based personalization solutions for e-commerce in Brazil. The largest online retailers in the country, such as Saraiva, Walmart and Centauro use our solutions to suggest personalized shopping recommendations to their users. Chaordic’s Data Platform provides a common data layer and services shared among multiple product offerings.

Three years ago Chaordic was experiencing an exponential growth. From the beginning, our business was based in gathering as much as relevant information as we could collect. I’ve joined the company when the core database solution - MySQL - was struggling to keep the pace of growth. By that time, it was not so clear which solution would take the lead in the hyped NoSQL front. Although today we use a hybrid data architecture, Cassandra was a key technology to enable sustained growth for Chaordic.

Phase 1: Finding the Sweet Spot

Some modern data store solutions favor being easy to use, others provide an integrated environment with MapReduce or Search, and others invest in performance and data consistency (see CAP theorem). So finding the sweet spot for a technology matching our use case was the first step. As we considered the challenges we were facing at that moment, many things became clearer:

Vertically scaling the database had hit the ceiling: we had the largest AWS instance by the time and it was not enough for our master-slave deploy.
We wanted to avoid manual sharding of data and the complexity of rebalancing cluster data.
Growing and doubling our capacity should be something straightforward and quick.
The data store should fit well to our write intensive pattern.
Downtimes, even scheduled, should be avoided and challenging SLAs adopted.
Proprietary data storages and provider-specific technologies should also be avoided.
Schema changes needed to be less painful: long maintenance operations (e.g., ALTER TABLE) were already limiting us.

As we pondered these factors and benchmarked among other players at the time, the choice of Cassandra was natural for our scenario. The balance of linear scalability, write optimizations, ease of administration, tunable consistency and a growing community made the difference. The momentum for Cassandra by the time (including Netflix early adoption) and of course our natural instinct to try new stuff were also important.

Phase 2: Learning by Doing

We decided the most logical way was to migrate incrementally from MySQL. There were many uncertainties in the process, different ways of modeling, and so much to learn. We began by setting up a 6 nodes, single zone Cassandra cluster and migrating the most performance sensitive entities first. Our strategy included a way to move the migration forward, as well as a way to rollback if things went wrong; this proved to be very effective, as we were still learning the best ways to represent our data in Cassandra. Our platform was already designed with REST APIs, so internal clients would not be affected by the migration. Our tiered architecture demanded only to implement a new Cassandra Data Access Object for each entity and in some cases adapt the business logic.

Understanding how to model efficiently in Cassandra was a step-by-step process. One of the common mistakes was to cause load hotspots in the cluster. Cassandra load balances data distribution based on consistent key hashing. So, by doing a poor choice of partition keys - i.e. a single key for all the products of your biggest client - the load would be concentrated in the same replica group and these particular nodes would be saturated. Primary keys in Cassandra should be chosen considering a partitioning that balances data distribution and clustering of columns. Clustering keys, on the other hand, enables the retrieval of rows very efficiently and is also used to index and sort data.

Cassandra’s well-known limitation with secondary indexes is something we learned the hard way. If you read the docs, the benefits may catch your attention. They work pretty well with low cardinality data, but in many situations this won’t be the case. When we migrated one of our main entities to take advantage of that, to create a list of products by client, performance decreased and we had to rollback. The solution was to create a new table (or a Materialized View) to each desired query and manage updates in the application layer. The good news is that Cassandra 3.0 will support this built in.

Another mistake we did was to use the same table to store different entities, back when there was no table schema. On the 0.8 times there was a myth recommendation to avoid using too many tables, so we reused some tables to store rows in different storage formats, in particular wide and skinny partitions. This ended up messing with page caching and other storage optimizations. Modeling data in CQL as of today already prevents you from this bad practice, but pre-CQL this was not so clear. The advice is that mixing different entities or read/write patterns in the same table is generally not a good idea.

Phase 3: Optimizing for Operations

In case your cluster size is quickly evolving, it is very important to master Cassandra operations: automating the process of node replacement and addition will save you a lot of time. We started with Puppet and then moved to Chef to automate node provisioning. When the cluster gets reasonably big enough, it is also imperative to automate rolling restarts to do configurations tuning and version upgrades.

Understanding Cassandra specificities such as repairs, hinted handoff and clean ups is also key to maintain data consistency. For example, once we got zombie data in our cluster because deletions (see tombstones) were not replicated when a node was down for longer than gc_grace_seconds and data eventually reappeared; a repair operation could have fixed that.

As we grew in developing new products, we had to decide between staying with a monolithic cluster or deploying a new one. We decided that the latter was a better option to enable experimentation and modularity. We then rolled our cluster with a newer version of Cassandra (1.2 at the time) enabling virtual nodes. We started with spinning disk based nodes, but quickly migrated to new SSD backed instances. This last step, reduced write latency to 50% and read latency got 80% better, even using fewer machines to compensate costs! Better yet, it allowed us to take Memcached out of the stack, reducing costs and complexity, specially in multi region deploys (see benchmark here).

Another critical aspect for performance in Cassandra is choosing the right Compaction Strategy. Since Cassandra writes to disk are immutable (SSTables), data updates accumulate in new SSTables files. This gives excellent write performance, but reading data that is not sequentially stored on disk implies multiple disk seeks and performance penalty. To overcome that, the compaction process merges and combines data, evicts tombstones and consolidates SSTables in new merged files. We started with the default SizeTieredCompactionStrategy (STCS), which works pretty well for write intensive workloads. For cases where reads are more frequent and data is frequently updated, we migrated to Leveled Compaction (LCS). In many cases, this helped removing tombstones faster and reducing the number of SSTables per read, hence, improving read latency (see bellow).

*SSTables per read by node (SizeTiered vs Leveled Compaction Strategy)*

SSTables per read by node (SizeTiered vs Leveled Compaction Strategy) Configuring compression also has a huge impact in storage space. For archival or infrequently read data, Deflate compression saved us 60-70% percent of disk space. For some use cases (see here), compression may even improve read performance, due to reduced IO time.

For monitoring, it doesn’t matter which tool you use (Librato, OpsCenter, Grafana, etc.), as long as you are comfortable with and you do the homework of understanding and watching key metrics. Cassandra is built with extensive instrumentation based on JMX. In our experience, sticking with fewer time-based metrics, coherently organized in dashboards where you can easily spot anomalies, is much more important than having all the metrics. A selection of key metrics will include the more general ones (latencies, transactions per second), as well as system metrics (load, network i/o, disk, memory), JVM metrics (heap usage, gc times), and deeper to Cassandra specifics. For compactions, look for PendingTasks and TotalCompactionsCompleted; Read/WriteTimeouts and Dropped Mutations will normally anticipate capacity issues; Exceptions.count to spot general errors and TotalHints and RepairedBackground will help diagnose inconsistencies.

Keeping the Pace

Thanks to the vibrant community and Datastax commitment to frequent releases, even minor Cassandra upgrades may surprise with performance improvements, features and configuration tweaks. So we advise to keep the cluster upgraded to a recent stable version. A rule of thumb to choose the best stable version to upgrade is to watch the latest DataStax Enterprise’s Cassandra version.

This post gave an overview of how the Data Platform team at Chaordic adopted Cassandra and supported the company’s growth. Overcoming the initial barrier of adopting a non-relational database may not be easy, but it is well worth the investment to gain in a robust and scalable solution.

Transitioning from MySQL to Cassandra at Chaordic was originally published by Chaordic at Chaordic Code Monkeys on January 20, 2016.

Optimizing for Cost in the Cloud and AWS

2015-06-11T00:00:00-03:00

Scaling a cloud operation requires a great deal of design choices, engineering practices, and picking the right set of technologies. Critical to being competitive while scaling is also being cost-efficient. Specially for complex cloud based offerings, infrastructure costs can represent a significant part of the company expenses. At Chaordic, we achieved a factor of 3 of improvement in cost efficiency metrics in the last two years, and this post describes some of the techniques we employed to succeed.

To begin, being able to actually know your costs is the basis for intelligent decision making. While this is somewhat obvious, the ability to breakdown your costs by services and modules is of paramount importance to understand bottlenecks, spikes, trends, variable, and fixed costs. In our case, a first analysis showed that the most expensive resource in the platform was the distributed database operation. In this phase, you can make use of your cloud provider’s (in our case, Amazon Web Services) ability to tag most of the resources and report based on this. We tagged our machines according to their roles, teams, environments and some other aspects to deepen our analysis.

AWS provides many options to help you reducing costs. The trade-offs to reduce costs while keeping quality of service must be well understood. For example, once we were confident in an specific type of instance for Cassandra, we chose to commit to one year reserved instances contracts. This decision produced yearly economies from 56% to 77%, a significant reduction suitable for long- or always-running services. For short-running or non-critical tasks, you can rely on Spot instances to reduce even further machine-hour costs. The trade-off here is that this kind of instances may be shut down at any time by AWS. We have developed a Python agent - Uncle Scrooge - to manage spot usage with load balancers and compensate the absence of spots with on-demand ones. This way we could combine the quality of service of on-demand instances and the low cost of spots.

*Data Platform Overview*

Architecture and design decisions also play a key role in cost optimization. In our case, we have decided to go even further in reducing database costs by employing data archiving based on usage patterns. AWS provides very cheap and reliable data store through S3, so instead of keeping infrequently accessed data in the more expensive database instances, we chose to archive them periodically in S3. Moreover, we decided even to completely bypass the database by employing a reliable data archive solution based on Kafka and Secor. This provided us a new 30-40% cost reduction in data storage. If you want to go a little further, moving data from S3 to Glacier, gives you another significant cost saving step. In this case, the downside is that you need to wait a few hours to recover your data.

While you can rely on external tools to help you managing your costs (see Cloudability, AWS provides basic reports, detailed CSVs of resource usage, and a tool to estimate your costs and help you making design decisions - Simple Monthly Calculator. You should also look for community tools that help you evaluating resource usage - like check-reserved-instances or the more advanced Netflix Ice.

Periodically reviewing service providers also proves to be worthy. For instance in performance monitoring, although New Relic offered us a great service, the pricing plans did not fit to our needs as we grew. So, we moved to the fast pacing DripStat, which provides a more flexible billing model, achieving almost 10x in cost savings.

We highlighted here some alternatives to reduce costs in the cloud and we hope to have inspired you to look at these opportunities. In AWS, we encourage you to start by looking at instance reservations to quickly achieve expressive results. Apart from delivering a greater efficiency, reallocating cost savings to new investments can be a motivation to start.

Optimizing for Cost in the Cloud and AWS was originally published by Chaordic at Chaordic Code Monkeys on June 11, 2015.

Start using Spark with Ignition!

2015-03-22T00:00:00-03:00

Apache Spark is a cluster computing framework designed to provide ease of use, fast processing and general-purpose pipelines when compared to traditional systems like Apache Hadoop. Here at Chaordic we are using it to build scalable products, like Personalized Emails, replacing a stack comprising of Hadoop Jobs, databases and services like Amazon Dynamo, MySQL, SQS, Redis and internal systems built on Jetty.

Even Spark being easy to use and very powerful, we’ve learned better ways to use Spark and condensed them in a open-source project called Ignition, consisting of:

Ignition-Core, a library with Spark and Scala goodies, including a command-line cluster management tool
Ignition-Template, a SBT template for Spark and Scala which includes Ignition-Core, gives some examples and suggests some best-practices to work with Spark

Starting the Ignition!

This tutorial will show first how to run a Spark Job locally and then one for real in the cloud (AWS).

Downloading and running a local job setup

1
2
3
git clone --recursive https://github.com/chaordic/ignition-template
cd ignition-template
./sbt 'run WordCountSetup'

Note: it may take a while to download all dependencies for the first time. Be patient.

This job will download files from a S3 bucket containing some Project Gutenberg books and will give the top 1000 words on them. It’s something like 400MB of data which is small enough to be runnable locally but long enough to make you bored if your internet or computer is slow =)

Running a job setup in an AWS cluster

Pre-requisites

You will need:

An AWS account with Access Key ID and Secret Access Key, see this doc
In each region (this tutorial assumes us-east-1, N. Virginia) that you’ll run the cluster, you need an ec2 key pair. We recommend following the convention and calling the key id “ignition_key” and saving the key file in ~/.ssh/ignition_key.pem with the correct permissions.
Python and PIP

For instance in a Ubuntu system, you can do this setup with the following commands on a shell:

1
2
3
4
5
export AWS_ACCESS_KEY_ID=<your key id>
export AWS_SECRET_ACCESS_KEY=<you secret access key>
chmod 400 ~/.ssh/ignition_key.pem
sudo apt-get install pip
sudo pip install -r ignition-template/core/tools/requirements.txt

Running it!

Disclaimer: you need to know what you are doing. Running this example will incur charges. If you cancel the setup during the spot request phase (e.g internet connection fails and the script aborts), requests may be left and they need to be cancelled manually. The price you use below will be the one Amazon may charge per spot machine plus a on-demand master. If you change the Amazon Region, you may also face data transfer charges. Make sure you can and want to afford it.

1
2
cd ignition-template
core/tools/cluster.py launch my-spark-cluster 2 --spot-price 0.1 --instance-type r3.xlarge

This will launch a cluster consisting of a on-demand master (m3.xlarge by default) and 2 slaves at a price of USD$ 0.1 in the us-east-1b AZ. If you can’t get machines check the price on Amazon Console and perhaps change the Availability zone (e.g -z us-east-1c)

Then you can run the Word Count example:

1
core/tools/cluster.py jobs run my-spark-cluster WordCountSetup 20G

To check the job progress see the interface in the master ip address, port 8080

To get the master ip address, use:

1
core/tools/cluster.py get-master my-spark-cluster

Running in the cluster should be much faster than running locally. The last parameter is the amount of memory for this job. As this machine is a 30G one, 20G for the job is a reasonable value. If you use bigger machines, a r3.2xlarge for instance, we recommend using more workers per machine (e.g --worker-instances=2) so that one machine will be split in many workers.

After playing with the cluster, you must destroy it:

1
core/tools/cluster.py destroy my-spark-cluster

Note: this destroy will default to region us-east-1, if you have launched you cluster in another region, you need to specify it with the --region parameter.

For more options and to see the defaults, use the option -h for each command or sub command:

1
2
3
4
core/tools/cluster.py -h
core/tools/cluster.py launch -h
core/tools/cluster.py jobs -h
core/tools/cluster.py jobs run -h

Next steps

See the provided README file for more information and explore the source code!

There are still more examples and documentation to come. Stay tuned!

Start using Spark with Ignition! was originally published by Chaordic at Chaordic Code Monkeys on March 22, 2015.

Saldão Black Friday 2014

2014-12-02T00:00:00-02:00

No último fim de semana tivemos o maior evento do e-commerce brasileiro, a Black Friday. Enquanto você se divertia na caça por promoções, a gente na Chaordic também se divertia mantendo as engrenagens girando para fornecer a melhor experiência de compra aos consumidores, promovendo o encontro entre eles e os produtos que mais lhes interessam.

Somente na sexta-feira registramos mais de 750 milhões de interações com nossos serviços partindo de 12,5 milhões de usuários. Isso representa aproximadamente 4 vezes o valor medido na sexta da semana anterior. Apesar disso, nossos servidores responderam em 55 milissegundos durante praticamente todo o evento. A robustez da nossa infraestrutura nos permite coletar e analisar terabytes de dados em poucas horas… e foi isso que fizemos após a Black Friday. Quer saber o resultado dessas análises?

Qual deve ter sido o tipo de produto que fez mais sucesso nessa edição do evento? Analisamos a descrição dos itens comprados e extraímos os termos mais populares. Os campeões de vendas foram os tênis. Confira a nuvem de palavras abaixo.

Essa edição da Black Friday foi um sucesso total, tanto para o comércio quanto para os serviços de personalização da Chaordic. Contabilizamos 1 milhão de pedidos com uma média de 1,8 produtos no carrinho. Tivemos participação com pelo menos um produto recomendado em 15% dos pedidos. Além disso, nesses pedidos contendo produtos que recomendamos, estes representaram em média 55% de todos os produtos do carrinho de compra!

Os carrinhos com nossas recomendações tinham em média 75% mais produtos que os carrinhos sem elas, o que ressalta a eficácia e importância de buscar soluções de excelência para personalização. Nossas vitrines foram peças importantes para promover a melhor qualidade de compra possível aos clientes das nossas dezenas de parceiros e elas foram clicadas 4,7 milhões de vezes.

Registramos uma enxurrada de promoções extremamente atraentes. Mais uma vez nossos parceiros trouxeram surpresas muito bacanas na Black Friday 2014. No total, foram 880 mil produtos com descontos de até 20%, 825 mil com descontos entre 20 e 40%, e 550 mil entre 40 e 80%. A figura abaixo mostra a proporção de produtos em cada faixa promocional.

Dentre os produtos com maior desconto, as cores mais frequentes foram (em ordem decrescente): preto, branco, azul, rosa e vermelho. Os campeões de desconto foram os DVDs, com uma média de 74% de abatimento. Na lanterna, com menores descontos, estão os livros. Veja o ranking dos campeões de promoções:

DVD
Blusa
Sapatilha
Vestido
Câmera
Calça
Tênis
Sandália
Camisa
Relógio

Ao longo de todo o dia, nossa infraestrutura mostrou-se robusta e atendeu às expectativas dos nossos parceiros. Recebemos um pico de tráfego na madrugada de quinta para sexta de aproximadamente 730 mil requisições por minuto (rpm). É possível observar na figura abaixo que ao longo da madrugada o tráfego decresce e volta a subir após as 5 horas à medida que as pessoas acordam. É possível perceber, também, um leve acréscimo após as 18 horas, quando as pessoas devem estar retornando do trabalho e aproveitam para conferir algumas últimas promoções.

Por volta das 23 horas da quinta-feira, próximo ao nosso pico de requisições por minuto, tivemos uma sobrecarga nos servidores de cache que causou uma leve queda no número de requisições atendidas. A ocorrência foi rapidamente contornada e a vazão do sistema retornou ao seu crescimento normal. O sucesso de toda a “operação Black Friday” na Chaordic foi bastante comemorado e nos deixou diversos aprendizados. Que venham outras sextas-feiras negras!

Saldão Black Friday 2014 was originally published by Chaordic at Chaordic Code Monkeys on December 02, 2014.

Aguentando porrada na Black Friday 2014

2014-11-05T00:00:00-02:00

As vitrines de personalização da Chaordic estão integradas nos maiores e-commerces do país, o que na prática significa que cerca de 40% dos pedidos do e-commerce nacional passam pelos nossos sistemas.

Atender a essa enorme quantidade de requisições com qualidade é um grande desafio, especialmente em época de Black Friday. Em 2011 a Chaordic atingiu um pico de 70 mil requisições por minuto (RPMs) durante a Black Friday. Em 2012 foram cerca de 200 mil RPMs e em 2013 foram 313 mil RPMs (mais de 5 mil requisições por segundo). A figura abaixo mostra o pico de RPMs nas últimas 3 edições da Black Friday (em milhares de RPMs).

A ansiedade pela chegada da Black Friday todos os anos é tanta que nós até fazemos um bolão pra ver quem acerta o pico máximo de RPM. Para este ano estamos esperando picos de até 650k rpms (mais de 10 mil por segundo) e o tempo de resposta não pode passar de 70ms! =]

Como responder rapidamente?

A figura abaixo, mostra um breve resumo da arquitetura que montamos para atender toda esta demanda com qualidade.

As requisições primeiramente chegam ao ELB (Elastic Load Balancer) da Amazon, que distribui as mesmas para as máquinas do OnSite Server, um sistema escrito em Node.js e otimizado para alta capacidade de processamento em paralelo. Caso a requisição seja relativa ao carregamento de recomendações, uma consulta ao cache do Redis é feita, garantindo baixos tempos de resposta.

Se a informação não estiver presente no Redis, então o OnSite Server dispara uma requisição ao Platform, um sistema altamente disponível que utiliza a base de dados Cassandra. Atualmente contamos com 48 servidores em nosso cluster, um dos maiores clusters dessa tecnologia da América Latina, de acordo com informações obtidas no último Cassandra Summit.

O Platform, por sua vez, tenta acessar a informação dentro de seu próprio cache Redis e, caso esta informação ainda não esteja ali presente, uma consulta ao Cassandra é feita. O sistema de cache que usamos é bastante eficiente e apenas 10% das requisições de leitura que recebemos chegam até a base de dados. Este é o principal fator que contribui para a velocidade de nossos sistemas.

E como garantir alta disponibilidade?

Os sistemas da Chaordic estão distribuídos em diferentes Availability Zones (AZs) da Amazon. Ou seja, se uma AZ por ventura sofrer algum tipo de problema, os servidores em uma outra AZ conseguem dar conta do recado. Além disso, o próprio Cassandra garante alta disponibilidade, com sua arquitetura completamente descentralizada (sem ponto único de falha) e estratégia de replicação de dados em múltiplos data-centers.

Outro componente importante desta arquitetura é o escalonamento automático das máquinas. Se por ventura houver um incremento expressivo do número de acessos, novas máquinas são lançadas automaticamente para garantir o desempenho. O mesmo acontece em caso de falha de algum servidor, que automaticamente é substituído por outro saudável.

Além dos componentes citados anteriormente, fazemos a gestão de streaming com Kafka e algumas entidades são armazenadas e indexadas no Elastic Search, mas isto ficará para um outro post.

Monitoramento

Mesmo com uma arquitetura robusta, pequenos problemas acontecem e precisamos agir antes que estes pequenos problemas tornem-se grandes problemas. =]

Investimos bastante em sistemas de monitoramento e alerta de nossa infraestrutura. Só no Librato, são mais de 40 mil métricas coletadas por minuto. Já publicamos no Monkeys um post sobre monitoramento e análise em tempo real.

Época de Black Friday é operação de guerra. Todos caórdicos muito atentos ao gráficos, alertas e logs de sistema.

E que venha a próxima!

Aguentando porrada na Black Friday 2014 was originally published by Chaordic at Chaordic Code Monkeys on November 05, 2014.

Chaordic @ Cassandra Summit 2014

2014-10-03T00:00:00-03:00

At Chaordic we have been using Apache Cassandra to store data at scale since 2012, when we faced exponential growth and migrated from MySQL. Since then Cassandra is a key technology here, allowing us to scale from a few hundred million to tens of billions requests per month, and growing… In this post we will share some of our experiences at the Cassandra Summit, held in San Francisco from September 10th to 13th this year.

*well, erm... that's me at #CassandraSummit 2014*

The conference had over 2000 participants from around the globe and awesome talks from leading companies in many fields sharing their experiences with Cassandra. Cassandra is a fully distributed database, and its conference couldn’t be different: the keynotes were broadcast to over 20 locations worlwide, including Chaordic headquarters in Florianópolis, where the rest of the team watched the main announcements of the Summit.

*#CassandraSummit keynotes @ Chaordic Florianópolis*

One of the Summit highlights was the official announcement of Cassandra 2.1, with really neat features like incremental repair (a major pain point since the beginning of the project), incremental compactions and a performance boost of over 100% of CQL against thrift access. After this, we’re definitely including migration of thrift data model to CQL and upgrade to C* 2.1 in our roadmap. We hope to present these experiences and improvements at the next Cassandra Summit in 2015. ;)

*Jonathan Ellis, Apache Cassandra chair and DataStax CTO, announcing Cassandra 2.1*

A very cool moment of the Summit was during Aaron’s Morton talk “Lesser Known Features of Cassandra 2.0 and 2.1”, when he mentioned one of our contributions to the Cassandra codebase, a flag to enable more aggressive tombstone compactions (CASSANDRA-6563). It was awesome to have our patch mentioned at the main stage of the Cassandra Summit! :-)

*Aaron Morton presenting one of the features we added to Cassandra*

Overall the conference was great, and I was quite impressed with the recent wide-scale adoption of Cassandra throughout industry and how fast the system is evolving to become a first class database, even competing with Oracle in more traditional markets, like banking, government, etc.

*Next stop: Cassandra Bootcamp (photo: Eiti Kimura)*

After the main conference, some of us joined Apache Cassandra commiters for the next two days for an intensive Bootcamp to learn and hack the Cassandra internals. On the first day, we had workshops on “hairy” cassandra topics like compaction and storage engine internals and CQL query parsing. After that we’ve done a few challenging exercises on the Cassandra codebase, like implementing our own compaction strategy.

*Hacking Cassandra (photo: Eiti Kimura)*

In the second day, we had a free day of hacking Cassandra on any ticket of our choice, with the support of Cassandra developers. I chose to fix the repair procedure within a single datacenter, an issue we faced at Chaordic, so I thought it would be interesting to work on that. By the end of the day, after a delicious lunch and a few beers, I was able to complete the patch that was reviewed and committed by Yuki Morishita. If you want to see more details about the issue, you can find it here: CASSANDRA-7450.

*Cassandra Bootcamp participants (photo: Eiti Kimura)*

It was a unique experience to get up close and personal with the Cassandra team and deep dive into the Cassandra codebase. Furthermore, it was a great opportunity share experiences with Cassandra professionals from other companies from all over the world. We were really happy and proud to be part of that. Many thanks to DataStax and the Cassandra team for organizing this gathering!

Would you like to work with Cassandra and other big data technologies at Chaordic, the largest recommendations provider in South America? We’re hiring! :)

Chaordic @ Cassandra Summit 2014 was originally published by Chaordic at Chaordic Code Monkeys on October 03, 2014.

Interações com as vitrines da Chaordic na abertura da Copa

2014-06-13T00:00:00-03:00

Um desafio quando se trabalha com sistemas de larga escala é visualizar o comportamento desses sistemas em tempo real. Para isso é imprescindível um painel de monitoramento que permita visualizar as principais métricas e tomar ação quando algo não vai bem. Aqui na Chaordic os sistemas são monitorados 24/7 para garantir a melhor experiência com as nossas vitrines de recomendação.

Uma das métricas que monitoramos é a vazão do sistema, que permite saber quantas requisições estão sendo atendidas em um dado instante. No nosso caso, essa métrica está relacionada com a quantidade de vitrines mostradas para os usuários nas lojas virtuais dos nossos parceiros.

Em situações normais, a vazão nos permite ver, por exemplo, se o sistema está se comportando bem com o aumento da demanda, ou se algum problema causou uma queda repentina na quantidade de requisições sendo atendidas. Em tempos de Copa do Mundo, porém, é possível visualizar o impacto da cerimônia de abertura e do primeiro jogo do Brasil na quantidade de requisições que estamos servindo.

O gráfico anotado abaixo, mostra a vazão do nosso sistema entre às 14:30, uma hora antes da abertura da Copa do Mundo, e 20:00, uma hora depois do jogo do Brasil:

A partir do gráfico, podemos identificar alguns fatos interessantes:

Houve uma queda de 40% no número de requisições entre às 14:30 e 15:30, horário oficial da abertura, indicando que muitas pessoas pararam o que estavam fazendo pra assistir a cerimônia de abertura da Copa.
Após o fim da cerimônia de abertura houve um crescimento no número de requisições, que não voltou ao patamar anterior pois provavelmente as pessoas estavam se deslocando ou se organizando para assistir o jogo.
As 17:00, houve outra queda expressiva devido ao início da partida.
Durante a partida, é possível ver claramente o impacto de cada um dos 4 gols na vazão do sistema:
- O Gol contra foi o menos impactante, ou seja, os internautas brasileiros não pararam de interagir com nossas vitrines após o primeiro gol sofrido.
- A maior queda de vazão se deu no 2º gol, pois como era um pênalti, gerou maior expectativa e atenção, então as pessoas se concentraram na TV para não perder o pênalti.
Um fato curioso, é que durante o intervalo houve um aumento no número de requisições, mas esse aumento não voltou patamar original no começo do segundo tempo. Talvez o jogo não estivesse muito interessante, e os menos empolgados deixaram de prestar atenção na partida e foram navegar na Web.
Ao final da partida, a quantidade de requisições aumentou novamente, voltando a normalidade.

Você achou essa análise legal e gostaria de desenvolver sistemas escaláveis que impactam a vida de milhões de pessoas? Estamos contratando! :]

Interações com as vitrines da Chaordic na abertura da Copa was originally published by Chaordic at Chaordic Code Monkeys on June 13, 2014.

Zero-downtime Cassandra upgrade

2014-04-11T00:00:00-03:00

Two critical requirements when offering a recommendation service to some of the largest Brazilian web stores are availability and scalability. After all, if our service becomes unavailable, the store customers won’t be able to see our recommendations when they shop online, upsetting both the customer and the partners that rely on our service. We take this requirement very seriously at Chaordic, using the latest technologies to ensure our service will be available 24/7/365, including on peak periods such as Black Friday.

One of the systems we use for increased availability and scalability is Apache Cassandra. Cassandra is an open source distributed storage solution that provides several mechanisms for data distribution and fault tolerance. The Cassandra community is very active, and new releases come out pretty much every month with bugfixes and awesome new features. It becomes quite hard to keep the pace and upgrade as soon as new releases become available.

A big challenge when you have to upgrade a production distributed system is to make sure your service remains available during the upgrade. Luckily, Cassandra makes it really simple to upgrade a production cluster without impacting application performance. We’ve done upgrades a few times already without “major” problems. Thor save us on the next upgrade coming soon! ;]

DataStax provides great upgrade instructions, but we will share some additional tips that may help devops engineers to perform a zero-downtime Cassandra upgrade. Even though these tips are related to the upgrade between versions 1.1 to 1.2, most of them should be valid for more recent upgrades.

Planing the Upgrade

Familiarize yourself with the upgrade process, by reading the NEWS.txt, CHANGES.txt and the DataStax upgrade documentation.
Create a simple document with the upgrade plan. This document will guide you through the process, allowing you to know which steps to take and to monitor the progress of the upgrade.
Define in which order the machines will be upgraded. If your cluster doesn’t use VNodes, a good practice is to upgrade at most one machine per replication group at a time, ensuring no replicas will be lost during the upgrade.
Define what time of the day the upgrade will be done, and estimate how long it will take. It is advisable to do upgrade when the system is NOT under heavy load, typically at late night time.

Before the Upgrade

Make sure all nodes are on the same schema, by checking the output of the following command on cassandra-cli of all nodes:

describe cluster;

Backup the cluster schema in case of catastrophe, which is unlikely to happen, but better be safe than sorry. You can easily backup the schema (based on thrift) using the following command:

echo -e "show schema;\n" | cassandra-cli -h <seed_node> > "schema-`date +%m-%d-%Y`.cdl"

Backup the ring tokens in case of catastrophe, which is unlikely to happen, but just in case:

nodetool ring > ring-`date +%m-%d-%Y`.txt

Compare the old and new cassandra.yaml files, carefully reviewing attribute changes between versions. For example, the default partitioner changed in Cassandra 1.2, but it's incompatible with the old partitioner so you need to make sure the old partitioner is specified in the configuration file. The same is valid for initial_token and other custom attributes.

Prepare an upgrade script with the main upgrade steps, and distribute it to the cluster machines. You don't want to think too much in the middle of an upgrade. Below is a sample upgrade script:

Prepare a rollback script in case things go wrong and you need to rollback fast. Below is a sample rollback script:

When you're ready to start, install the new Cassandra version binaries on the cluster nodes according to the official upgrade instructions.

During the Upgrade

Shutdown all scheduled maintenance tasks during the upgrade, such as repairs, cleanups, or upgradesstables.
Run the upgrade script (cassandra_upgrade.sh) on each of your nodes according to the schedule defined in the upgrade plan.
Check the cassandra logs for errors or warnings and take the necessary action (ie. rollback, wait) if they seem critical. Some exceptions are not critical, that’s why it’s important that you’re familiarized with the NEWS.txt and CHANGES.txt to know what has changed between versions.
In case something goes wrong and you need to rollback to the previous version, execute the rollback script (cassandra_rollback.sh) on the nodes that were already upgraded.

After the Upgrade

Run nodetool upgradesstables on all nodes to upgrade the SSTables to the newest version. Since this can be a heavyweight operation be careful not to overload your cluster.
Verify your application metrics (throughput, response time, exceptions, etc) to confirm all the metrics are stabilized (or hopefully got better) after the upgrade.
Enjoy your newly upgraded Cassandra cluster! ;-)

Special thanks to Charles Brophy and Robert Coli from the Cassandra mailing list for sharing some of the tips described here.

Zero-downtime Cassandra upgrade was originally published by Chaordic at Chaordic Code Monkeys on April 11, 2014.

Saving money on the cloud with Tio Patinhas

2014-03-26T00:00:00-03:00

About two years ago, we published tiopatinhas “An AWS Autoscaling companion that saves money by using instances from the Spot Market”®. In short, Tiopatinhas replaces half of your autoscaling ondemand instances with instances from the spot market. That allows you to save above 33% of EC2 costs depending on the instance type.

Now diving into the details, tiopatinhas is a python script that connects to an autoscaling group, monitors cloudwatch data and spot market information. Based on that information, tiopatinhas decides when to launch or terminate an instance via autoscale. If it detects that your application needs to be scaled, it checks if the next instance should be spot or on-demand: in the first case, it bids on the market, launches the instance and attaches it to the autoscaling group and ELB. Otherwise, autoscaling takes over and does the job.

Tiopatinhas is smart: when the spot market crashes, it replaces it’s spot instances with on-demand ones so that your application does not need to wait for the autoscaling triggers to scale out, which could take a while depending on the cooldown period of your policies. Also, if the market is “flapping”, tiopatinhas is aware of the one-hour billing cycles of AWS, so it won’t waste your money trying to replace instances in the middle of an hour. Moreover, tio patinhas is fail-safe: if the tiopatinhas process crashes, it will leave some instances in the autoscaling group, but since it doesn’t change any of your autoscaling rules, there is no risk of it crashing your entire application environment.

The graph below compares the number of ondemand (green) vs spot instances (yellow) in a live autoscaling group. It’s possible to see that there are roughly the same amount of spot and ondemand instances (ignore the spikes =]). So in this example, about 50% of the autoscaling instances are running at lower spot market prices ($$$$).

In some recent changes, we added the option to use different instance types, useful for when the spot market is too volatile for one specific instance type. If you use, for example, a c1.xlarge instance and it’s spot market is crashing often, you can use a larger instance with a spot price still lower than the on-demand ones. We have some plans of supporting an heterogeneous cluster, with many different instance types to avoid a crash of half of your instances at once in case of a failure, stay tuned :)

We currently run tiopatinhas in many production systems at Chaordic. To get started, just run:

python tp.py -g <AutoScalingGroupName>

Does this sound interesting and you think it could be fun to work with us? Please have a look on our careers page.

Saving money on the cloud with Tio Patinhas was originally published by Chaordic at Chaordic Code Monkeys on March 26, 2014.

Metrics monitoring and real time analysis

2013-12-26T00:00:00-02:00

Every developer needs to know about what is happening inside the application. We are no different and what we do to achieve this is tracing a lot of our data, from low level system stuff to many high level, application specific metrics.

In order to collect application data and plot it, since we use autoscaling, we need to aggregate it over time inside the application, send it to a collector and aggregate it again over instances. For that second round of aggregation we use etsy’s statsd with a few modifications of our own to compose metrics. Those modifications are made in order to allow us to create some compound metrics based on raw data we collect, e.g. we have the raw clicks and pageviews and, with that, we calculate the ctr. Since statsd just collects the data, and aggregate it, we had to devise our own composition engine with custom rules to provide us such information. You can get our changes at github and feel free to modify and contribute back :)

Measuring data is an exercise of parsimony, when you first get to it, you feel compelled to track each and every piece of data you can. But, as a matheusrossato says: “not everything you can measure is important, and not everything that is important can be measured”. For what it’s worth, the first time we started measuring data on one of our products, we also sent data to Librato for plotting and our billing was about 5 times than today’s value, with less than a half of useful data that we have today. Our mistake is that we basically were measuring useless data, just because we could.

With that in mind, there is also a deal of combining and processing raw data to plot something else, as the example of ctr above. This is where statsd came in for us. While you can have very good insights just with raw data, there are many times when they are just not enough to show you something. What kind of data needs to be combined or post processed? That is up to you and your experience on what you are doing :) When you have no information on that, try experimenting on combining some of the data you already have, a few at a time and keeping what makes sense, discarding what does not. Also, if you don’t have some kind of data tracked and need it twice, track it! Try to keep a library that makes it easy to add more data to your monitoring system.

If you are using librato, use their notifications api to plot data about deploys, changes in infrastructure or application settings, it helps a lot to quickly identify why a metric has freaked out.

At last but not at least, learn and improve over what you get from the monitoring. While it may sound obvious, all that data you gather can easily be a waste of money if you don’t pay attention to it and listen to what it has to say.

Metrics monitoring and real time analysis was originally published by Chaordic at Chaordic Code Monkeys on December 26, 2013.

Secure CORS support on Nginx

2013-10-21T00:00:00-02:00

Cross-Origin Resource Sharing (CORS) is a specification that enables client-side cross-origin HTTP requests. This is particular useful for javascript web applications, since most modern browsers do not allow client-side RPCs to domains other than the origin domain. In short, the server wishing to enable CORS should add the Access-Control-Allow-Origin header to its responses, specifying a list of allowed servers, or the wildcard * to allow cross-origin requests from any domains.

Enable-cors.org provides a list of server configuration files to enable CORS in different servers. However, the configuration file provided for Nginx does not work out of the box for a HTTPS proxy server configuration. Luckily we found an elegant configuration to enable CORS on a HTTPS Nginx Proxy here.

The limitation with the previous solution is that it uses the wildcard *, allowing any website to make cross-origin requests to your server, which raises some security concerns. Unfortunately, the CORS specification only allows a fixed list of URLS to be specified in the Access-Control-Allow-Origin header, preventing the server to allow cross domain requests from a dynamic set of URLs, such as any subdomain of *.mckinsey.com.

In order to overcome this limitation we created a Nginx configuration based on the previous solutions, that enables CORS on a HTTPS Nginx Proxy only to a set of allowed URLs based on a regex. The idea is that the Nginx server will compare the HTTP Origin header with the given regex, and respond with ‘Access-Control-Allow-Origin: $http_origin’ when $http_origin matches the defined regex. For instance, in order to allow CORS requests from any subdomain of *.mckinsey.com, the regex is: /https?://.*\.mckinsey\.com(:[0-9]+)?)$/

The resulting Nginx configuration file is shown in the following gist:

Since the more_set_headers directive is used in the solution, the HttpHeadersMore module must be enabled on the Nginx server. This can be done by recompiling Nginx with that module, or by installing the nginx-extras package, available for Ubuntu and Debian via apt.

Secure CORS support on Nginx was originally published by Chaordic at Chaordic Code Monkeys on October 21, 2013.

Hello, World!

2013-10-15T00:00:00-03:00

Welcome to Chaordic Technology Blog. This blog is maintained by Chaordic’s engineers and has the objective of sharing anything related to the technology we use in a daily basis to deliver top quality personalized recommendations to millions of users.

Please excuse the somewhat limited interface here, but we plan to take an incremental approach and improve the user experience gradually. The main objective for the moment is to focus on producing quality content to help and entertain fellow hackers out there.

We’re using the Jekyll Bootstrap toolkit to serve our content on GitHub Pages. This choice provides a simple and collaborative environment where any engineer is able to push content whenever there’s something interesting to be shared.

If you just came across this website and haven’t heard about Chaordic before, you may want to check the following links:

Happy hacking!

Hello, World! was originally published by Chaordic at Chaordic Code Monkeys on October 15, 2013.